Scikit-learn's Random Forests are a great first choice for tackling a machine-learning problem. They are easy to use with only a handful of tuning parameters but nevertheless produce good results. Additionally, a separate cross-validation step can be avoided using the out-of-bag sample predictions generated during the construction of the forest, and finally they make it relatively easy to identify and extract the most important features of the sample data.
In this talk we’ll go through the process of using scikit-learn’s random forests using a financial data-set (of ASX equities) as an example. We’ll begin with a basic overview of the random forest algorithm and of the tuning parameters available and their impact on the effectiveness of the forest. Secondly we’ll go over the basic usage of scikit-learn’s random forests and in the process trouble-shoot some common problems such as dealing with missing sample data. Next we’ll discuss the use of out-of-bag sample predictions as a method for quickly performing cross-validation and optimising the tuning parameters. Finally we’ll look at how to extract information from the model that scikit-learn has generated, most notably the relative importances of the features in the sample data.
Greg has been programming in Python since 1995. He has a PhD in Computer Science and has been working in the financial services industry for over ten years.