Keywords: Decision trees, Random forests, gradient boosting, XGBoost, LightGBM, bagging.
3 Classification and regression trees (CART)
A CART model is a tree structure consisting of a set of nested decision rules. At each node \(i\), the feature dimension \(d_i\) of the input vector \(\mathbf x\) is compared to a threshold value \(t_i\), and the input is then passed down to the left or right branch, depending on whether it is above or below threshold. (For categorical values, we compare if \(x_{d_i}\) is equal to the target value \(t_i\) or not.) At the leaves of the tree, the model specifies a distribution over the output for points that are in that part of the input space.
where \(R_j\) is the region specified by the \(j\)’th leaf node, \(w_j\) is the predicted output for that node, and \(\boldsymbol \theta = \{ (R_j, w_j): j = 1:J \}\) (For regression, the predicted output for each leaf \(w_j\) is a scalar; for classification, it can be the logits or class probabilities.)
Tree-based models are non-parametric models, so the loss function is not differentiable. Instead, The standard practice is to use a greedy procedure, in which we iteratively grow the tree one node at a time. This approach is used by CART, C4.5, and ID3, which are three popular implementations of the method.
For classification, we first compute the empirical distribution over class labels for this node:
This is the expected error rate. To see this, note that \(\widehat \pi_{i,c}\) is the probability a random entry in the leaf belongs to class \(c\), and \(1 - \widehat \pi_{i,c}\) is the probability it would be misclassified.
Each node in the tree is a partition of the input space. If we want to further partition the space into left and right subtrees, we choose the best feature \(j_i\) to split on, and the best value for that feature \(t_i\), as follows:
A node that is pure (i.e., only has examples of one class) will have 0 entropy.
According to the above rule, we can pick the best feature, and best threshold at each node. We then partition the data, and call the fitting algorithm recursively on each subset of the data.
If we let the tree become deep enough, it can achieve 0 error on the training set (assuming no label noise), by partitioning the input space into sufficiently small regions where the output is constant. However, this will typically result in overfitting. To prevent this, there are two main approaches: The first is to stop the tree growing process according to some heuristic. The second approach is to grow the tree to its maximum depth, where no more splits are possible, and then to prune it back.
Advantages of trees
Trees are easy to interpret.
Some people believe that decision trees more closely mirror human decision-making.
Trees can be displayed graphically.
Trees can easily handle qualitative predictors.
Disadvantages of trees
Trees can have bad predictions.
Trees can be very non-robust.
4 Bagging, random forests, boosting
4.1 Bagging
The decision trees suffer from high variance. This means that if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different.
Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the bagging variance of a statistical learning method.
Averaging a set of observations reduces variance, we generate B different bootstrapped training data sets to get \(\hat f^{*b}(x)\). We then train our method on the \(b\)th bootstrapped training set, and finally average all the predictions, to obtain
Bagging has been demonstrated to give impressive improvements in accuracy by combining together hundreds or even thousands of trees into a single procedure.
4.2 Random forests
Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of \(m\) predictors is chosen as split candidates from the full set of \(p\) predictors. The split is allowed to use only one of those\(m\)predictors. Using a small value of m in building a random forest will typically be helpful when we have a large number of correlated predictors.
4.3 Boosting
Boosting works in a similar way as bagging, except that the trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.
For binary classification problem, each classifier \(F_m \in \{-1, +1\}\). In particular, we first \(F_1\) on the original data, and then we weight the data samples by the errors made by \(F_1\), so misclassified examples get more weight. Next we fit \(F_2\) to this weighted data set. We keep repeating this process until we have fit the desired number \(M\) of components.
It can be shown that, as long as each \(F_m\) has an accuracy that is better than chance (even on the weighted dataset), then the final ensemble of classifiers will have higher accuracy than any given component. That is, if \(F_m\) is a weak learner (so its accuracy is only slightly better than 50%), then we can boost its performance using the above procedure so that the final f becomes a strong learner.
Note that boosting reduces the bias of the strong learner, by fitting trees that depend on each other, whereas bagging and RF reduce the variance by fitting independent trees. In many cases, boosting can work better (but take more time to train).
5 Random forest workflow
We load the Food Security Supplement household data we curated earlier. Our goal is to predict food insecurity status using household’s socio-economical status.
# A tibble: 30,162 × 13
HRFS12M1 HRNUMHOU HRHTYPE GEREG PRCHLD HRPOOR PRTAGE PEEDUCA PEMLR PTDTRACE
<fct> <dbl> <fct> <fct> <fct> <fct> <dbl> <fct> <fct> <fct>
1 Food Secu… 1 Indivi… West NoChi… NotPo… 35 HighSc… Empl… AIAN
2 Food Secu… 1 Indivi… South NoChi… NotPo… 36 Colleg… Empl… White
3 Food Secu… 3 Marrie… South NoChi… NotPo… 55 HighSc… Empl… White
4 Food Secu… 2 Marrie… South NoChi… NotPo… 85 Colleg… NotI… White
5 Food Secu… 2 Marrie… West NoChi… Poor 69 HighSc… NotI… AIAN
6 Food Secu… 1 Indivi… Nort… NoChi… NotPo… 51 HighSc… Empl… White
7 Low Food … 2 Unmarr… Midw… Child… Poor 54 HighSc… NotI… White
8 Food Secu… 3 Marrie… South Child… NotPo… 46 Colleg… Empl… White
9 Food Secu… 2 Marrie… Midw… NoChi… NotPo… 69 HighSc… NotI… White
10 Food Secu… 2 Marrie… South NoChi… NotPo… 75 HighSc… NotI… Black
# ℹ 30,152 more rows
# ℹ 3 more variables: HHSUPWGT <dbl>, PEHSPNON <fct>, HRFS12M1_binary <fct>
5.1 Initial split into test and non-test sets
# For reproducibilityset.seed(2024)data_split <-initial_split( data_clean, # stratify by HRFS12M1strata ="HRFS12M1_binary", prop =0.75 )data_other <-training(data_split)dim(data_other)
[1] 22621 13
data_test <-testing(data_split)dim(data_test)
[1] 7541 13
5.2 Recipe
recipe <-recipe( HRFS12M1_binary ~ .,data = data_other ) |># remove the weights and original HRFS12M1step_rm(HHSUPWGT, HRFS12M1) |># create dummy variables for categorical predictorsstep_dummy(all_nominal_predictors()) |># zero-variance filterstep_zv(all_numeric_predictors()) |># center and scale numeric datastep_normalize(all_numeric_predictors()) |># estimate the means and standard deviationsprint()
5.3 Model
rf_mod <-rand_forest(mode ="classification",# Number of predictors randomly sampled in each splitmtry =tune(),# Number of trees in ensembletrees =tune() ) |>set_engine("ranger")rf_mod
Random Forest Model Specification (classification)
Main Arguments:
mtry = tune()
trees = tune()
Computational engine: ranger