Random Forest
Note 1:
A Gentle Introduction to Random Forests, Ensembles, and Performance Metrics in a Commercial System
Here is how such a system is trained; for some number of trees T:
Sample N cases at random with replacement to create a subset of the data (see top layer of figure above). The subset should be about 66% of the total set.
-
At each node:
- For some number m (see below), m predictor variables are selected at random from all the predictor variables.
- The predictor variable that provides the best split, according to some objective function, is used to do a binary split on that node.
- At the next node, choose another m variables at random from all predictor variables and do the same.
-
Depending upon the value of m, there are three slightly different systems:
- Random splitter selection: m =1
- Breiman’s bagger: m = total number of predictor variables
- Random forest: m << number of predictor variables. Brieman suggests three possible values for m: ½√m, √m, and 2√m
Running a Random Forest. When a new input is entered into the system, it is run down all of the trees. The result may either be an average or weighted average of all of the terminal nodes that are reached, or, in the case of categorical variables, a voting majority.
Note that:
- With a large number of predictors, the eligible predictor set will be quite different from node to node.
- The greater the inter-tree correlation, the greater the random forest error rate, so one pressure on the model is to have the trees as uncorrelated as possible.
- As m goes down, both inter-tree correlation and the strength of individual trees go down. So some optimal value of m must be discovered.
Strengths and weaknesses. Random forest runtimes are quite fast, and they are able to deal with unbalanced and missing data. Random Forest weaknesses are that when used for regression they cannot predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy. Of course, the best test of any algorithm is how well it works upon your own data set.
Note 2
http://blog.echen.me/2011/03/14/laymans-introduction-to-random-forests/
Note 3
Bias and variance tradeoff
You build a small tree and you will get a model with low variance and high bias. How do you manage to balance the trade off between bias and variance ?
Normally, as you increase the complexity of your model, you will see a reduction in prediction error due to lower bias in the model. As you continue to make your model more complex, you end up over-fitting your model and your model will start suffering from high variance.
A champion model should maintain a balance between these two types of errors. This is known as the trade-off management of bias-variance errors. Ensemble learning is one way to execute this trade off analysis.
Gradient Boosting
Note 1:
https://www.quora.com/What-is-an-intuitive-explanation-of-Gradient-Boosting
- There is a good explanation for it
- Also a good video explanation
Note 2:
XGBoost official page: http://xgboost.readthedocs.io/en/latest/model.html