All the way back in the year 1785, Marquis de Condorcet expressed a political science theorem about the relative probability of a given group of individuals arriving at a correct majority decision.
Despite its age, the theorem has been applied to a number of different fields, most notably for us the field of machine learning. To see how, let's start with one model, a decision tree with an accuracy of 60 percent.
Condorcet's Jury Theorem says that if each person is more than 50% correct, then adding more people to vote increases the probability that the majority is correct.
True to form, if we ask two more decision trees, each also with 60% accuracy, and decide by the majority vote, then the probability of the vote being right goes up to 65%!
The theorem suggests that the probability of the majority vote being correct can go up as we add more and more models. With eleven models, the majority can reach 75% accuracy!
There is a caveat. The accuracy may not improve if each model produces the same prediction, for example. The mistake of one model would not be caught by the other models. Therefore, we would want to cast a majority vote using the models that are different.
We can add more and more models.
Play with the scrollers yourself! The top controls the number of trees in the ensemble. The bottom controls each tree's accuracy.
In machine learning, this concept of multiple models working together to come to an aggregate prediction is called ensemble learning. It provides the basis for many important machine learning models, including random forests.
Ensemble learning creates a stronger model by aggregating the predictions of multiple weak models, such as decision trees (check out our previous article on Decision Trees to learn more about some of their limitations). Condorcet’s Jury Theorem suggests that the majority vote aggregation can have better accuracy than the individual models. There are other methods to aggregate predictions, such as weighted majority vote.
Random Forest is an example of ensemble learning where each model is a decision tree. In the next section, we will build a random forest model to classify if a road sign is a pedestrian crossing sign or not. These signs come in many variations, and we will use four simple features: Size, number of sides, number of colors used, and if the sign has text or symbol.
We will start with a sampling method called the Bagging Method to create multiple samples from the training data to build each tree.
One way to produce multiple models that are different is to train each model using a different training set. The Bagging (Bootstrap Aggregating) method randomly draws a fixed number of samples from the training set with replacement. This means that a data point can be drawn more than once.
This results in several training sets that are different. Here, we have three samples. Notice that some data points are shared among some samples. Each sample then can be used to build a model. Let’s build some decision trees with these samples!
Here are the features: Size, Number of Sides, Number of Colors Used, and Text or Symbol.
For each split of the tree building, we compute the best splitting using only a randomly selected subset of the features. This is another way to ensure that the decision trees are as different as possible.
Let's build the first tree with the first sample! For the first split, three features are selected to find the best split.
The best feature to split is "Number of Colors used."
Likewise, for the second split, another set of three features are selected to find the best split. The best feature to split this time is "Text or Symbol."
This process continues until the tree is constructed. The other samples will build other trees using the same process.
Here are trees built from each sample.
Notice how these trees are quite different from each other.
Let's bring some test data to see the random forest in action!
Each tree produces a prediction.
Can a small circular sign be a crossings sign? Let's ask the three trees!
The first tree thinks that the sign is not a crossings sign. However, the other trees voted yes. By the majority vote, the prediction is "Yes"!
It's your turn
Click the remaining data points to see how the random forest makes a prediction.
Did you notice that even though the trees produce the same prediction, the decision path - the reason why each thinks the sign is a crossing sign or not - is different from each other? This is good news: The random forest models perform better when the trees in the forest are different. Let's dive deep into this!
Variance in Composition
We previously discussed how decision tree model suffers from high variance. However, this variance among trees is employed in the random forest as a feature, not a bug. The inventor of the random forest model Leo Breiman says in his paper "[o]ur results indicate that better (lower generalization error) random forests have lower correlation between classifiers and higher strength." [Random Forest Article]
The high variance of the decision tree model can help keep the correlation among trees low. The Bagging Method as well as the Feature Selection are the key innovations to keep correlation low.
We've just shown how to construct random forests for a given dataset, but how different are our trees from one another in reality? To find out, we've trained a nine-tree random forest on our sign dataset and plotted it below.
The outer enclosing circle represents the random forest. Each inner circle represents a unique decision tree in the forest. Hovering over a tree will highlight the tree's classification accuracy, and its feature importances* for four features.
Observe that each tree has a fairly unique combination of feature importances and accuracy and there is no obvious pattern. Some trees only use two out of four features, and the tree with similar feature importances as the random forest model still performs much worse than the forest. As discussed in Condorcet's theorem, the key takeaway is the power these models employ when aggregated together in a smart way, as shown by the random forest model having the highest accuracy.
As expected, the random forest model (the red dots on the right) performs better than any individual tree. Notice the wide range of the feature importance scores across our trees, which can contribute to low correlation. In the very first random forest visualization tool developed by Breiman, he also attempted to show the range in feature importances.
Variance in Predictions
If each tree produces the same prediction, then the accuracy can not improve. Below, each circle represents a prediction from a model. Circles in the same row share the same model, while circles in the same column share the same test data. Blue means "Yes" and pink means "No". Solid color means that the prediction is correct, while the strip means that the prediction is incorrect. Not all test data points are shown.
The irregular pattern in the grid shows how the trees are different in where they make mistakes. As expected, the random forest model performs the best overall even if there are trees with very low accuracy. Note that for the first data point (first column), there are still three trees with the correct prediction while the majority is incorrect. This inspired people to consider other methods, such as Boosting, which is very popular today.
Random Forest models are a popular model for a large number of tasks. In short, it's a method to produce aggregated predictions using the predictions from several decision trees. The old theorem of Condorcet suggests that the majority vote from several weak models with more than 50% accuracy may do the trick. Later, Breiman came up with Bagging and Feature Selection that helps to keep correlation among the trees low which is a key to success.
There has been some theory built around the bias and variance of random forest model in relation to the trees that it has. Please check out this article on the bias-variance tradeoff to learn more about these concepts.