Predictive Modeling - Tree-Based Models

Author

Dr. Roch Nianogo, Bowen Zhang, Dr. Hua Zhou

Code

library(tidyverse)
library(tidycensus)
census_api_key("4cf445b70eabd0b297a45e7f62f52c27ba3b5cae",
               install = TRUE, overwrite = TRUE)
Sys.setenv("CENSUS_KEY" = "4cf445b70eabd0b297a45e7f62f52c27ba3b5cae")

library(censusapi)
library(ggplot2)
library(knitr)
library(ranger)
library(tidymodels)

1 Roadmap

Slido

2 Learning objectives

Keywords: Decision trees, Random forests, gradient boosting, XGBoost, LightGBM, bagging.

3 Classification and regression trees (CART)

A CART model is a tree structure consisting of a set of nested decision rules. At each node \(i\), the feature dimension \(d_i\) of the input vector \(\mathbf x\) is compared to a threshold value \(t_i\), and the input is then passed down to the left or right branch, depending on whether it is above or below threshold. (For categorical values, we compare if \(x_{d_i}\) is equal to the target value \(t_i\) or not.) At the leaves of the tree, the model specifies a distribution over the output for points that are in that part of the input space.

Formally, a CART model can be defined by

\[ T(\mathbf x; \theta) = \sum_{j=1}^J w_j \mathbf 1(\mathbf x \in R_j) \]

where \(R_j\) is the region specified by the \(j\)’th leaf node, \(w_j\) is the predicted output for that node, and \(\boldsymbol \theta = \{ (R_j, w_j): j = 1:J \}\) (For regression, the predicted output for each leaf \(w_j\) is a scalar; for classification, it can be the logits or class probabilities.)

default

Tree-based models are non-parametric models, so the loss function is not differentiable. Instead, The standard practice is to use a greedy procedure, in which we iteratively grow the tree one node at a time. This approach is used by CART, C4.5, and ID3, which are three popular implementations of the method.

For classification, we first compute the empirical distribution over class labels for this node:

\[ \widehat \pi_{i,c} = \frac{1}{|D_i|} \sum_{n \in D_i} \mathbf 1(y_n = c) \]

Given this, we can then compute the Gini index:

\[ G_i = \sum_{c=1}^C \widehat \pi_{i,c} (1 - \widehat \pi_{i,c}) \]

This is the expected error rate. To see this, note that \(\widehat \pi_{i,c}\) is the probability a random entry in the leaf belongs to class \(c\), and \(1 - \widehat \pi_{i,c}\) is the probability it would be misclassified.

Each node in the tree is a partition of the input space. If we want to further partition the space into left and right subtrees, we choose the best feature \(j_i\) to split on, and the best value for that feature \(t_i\), as follows:

\[ (j_i, t_i) = \arg \min_{j,t} \left[ \frac{|D_{i}^L(j,t)|}{|D_i|} G_{i}^L(j,t) + \frac{|D_{i}^R(j,t)|}{|D_i|} G_{i}^R(j,t) \right] \]

Alternatively we can replace the Gini index by the entropy or deviance of the node:

\[ H_i = - \sum_{c=1}^C \widehat \pi_{i,c} \log \widehat \pi_{i,c} \]

A node that is pure (i.e., only has examples of one class) will have 0 entropy.

According to the above rule, we can pick the best feature, and best threshold at each node. We then partition the data, and call the fitting algorithm recursively on each subset of the data.

If we let the tree become deep enough, it can achieve 0 error on the training set (assuming no label noise), by partitioning the input space into sufficiently small regions where the output is constant. However, this will typically result in overfitting. To prevent this, there are two main approaches: The first is to stop the tree growing process according to some heuristic. The second approach is to grow the tree to its maximum depth, where no more splits are possible, and then to prune it back.

Advantages of trees
- Trees are easy to interpret.
- Some people believe that decision trees more closely mirror human decision-making.
- Trees can be displayed graphically.
- Trees can easily handle qualitative predictors.
Disadvantages of trees
- Trees can have bad predictions.
- Trees can be very non-robust.

4 Bagging, random forests, boosting

4.1 Bagging

The decision trees suffer from high variance. This means that if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different.

Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the bagging variance of a statistical learning method.

Averaging a set of observations reduces variance, we generate B different bootstrapped training data sets to get \(\hat f^{*b}(x)\). We then train our method on the \(b\)th bootstrapped training set, and finally average all the predictions, to obtain

\[ \hat f_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^B \hat f^{*b}(x) \]

Bagging has been demonstrated to give impressive improvements in accuracy by combining together hundreds or even thousands of trees into a single procedure.

4.2 Random forests

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of \(m\) predictors is chosen as split candidates from the full set of \(p\) predictors. The split is allowed to use only one of those \(m\) predictors. Using a small value of m in building a random forest will typically be helpful when we have a large number of correlated predictors.

4.3 Boosting

Boosting works in a similar way as bagging, except that the trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.

For binary classification problem, each classifier \(F_m \in \{-1, +1\}\). In particular, we first \(F_1\) on the original data, and then we weight the data samples by the errors made by \(F_1\), so misclassified examples get more weight. Next we fit \(F_2\) to this weighted data set. We keep repeating this process until we have fit the desired number \(M\) of components.

It can be shown that, as long as each \(F_m\) has an accuracy that is better than chance (even on the weighted dataset), then the final ensemble of classifiers will have higher accuracy than any given component. That is, if \(F_m\) is a weak learner (so its accuracy is only slightly better than 50%), then we can boost its performance using the above procedure so that the final f becomes a strong learner.

Note that boosting reduces the bias of the strong learner, by fitting trees that depend on each other, whereas bagging and RF reduce the variance by fitting independent trees. In many cases, boosting can work better (but take more time to train).

5 Random forest workflow

Machine learning workflow

We load the Food Security Supplement household data we curated earlier. Our goal is to predict food insecurity status using household’s socio-economical status.

data_clean <- read_rds("../02-wrangle/fss21.rds") |> 
  print()

# A tibble: 30,162 × 13
   HRFS12M1   HRNUMHOU HRHTYPE GEREG PRCHLD HRPOOR PRTAGE PEEDUCA PEMLR PTDTRACE
   <fct>         <dbl> <fct>   <fct> <fct>  <fct>   <dbl> <fct>   <fct> <fct>   
 1 Food Secu…        1 Indivi… West  NoChi… NotPo…     35 HighSc… Empl… AIAN    
 2 Food Secu…        1 Indivi… South NoChi… NotPo…     36 Colleg… Empl… White   
 3 Food Secu…        3 Marrie… South NoChi… NotPo…     55 HighSc… Empl… White   
 4 Food Secu…        2 Marrie… South NoChi… NotPo…     85 Colleg… NotI… White   
 5 Food Secu…        2 Marrie… West  NoChi… Poor       69 HighSc… NotI… AIAN    
 6 Food Secu…        1 Indivi… Nort… NoChi… NotPo…     51 HighSc… Empl… White   
 7 Low Food …        2 Unmarr… Midw… Child… Poor       54 HighSc… NotI… White   
 8 Food Secu…        3 Marrie… South Child… NotPo…     46 Colleg… Empl… White   
 9 Food Secu…        2 Marrie… Midw… NoChi… NotPo…     69 HighSc… NotI… White   
10 Food Secu…        2 Marrie… South NoChi… NotPo…     75 HighSc… NotI… Black   
# ℹ 30,152 more rows
# ℹ 3 more variables: HHSUPWGT <dbl>, PEHSPNON <fct>, HRFS12M1_binary <fct>

5.1 Initial split into test and non-test sets

# For reproducibility
set.seed(2024)

data_split <- initial_split(
  data_clean, 
  # stratify by HRFS12M1
  strata = "HRFS12M1_binary", 
  prop = 0.75
  )

data_other <- training(data_split)
dim(data_other)

[1] 22621    13

data_test <- testing(data_split)
dim(data_test)

[1] 7541   13

5.2 Recipe

recipe <- 
  recipe(
    HRFS12M1_binary ~ .,
    data = data_other
  ) |>
  # remove the weights and original HRFS12M1
  step_rm(HHSUPWGT, HRFS12M1) |>
  # create dummy variables for categorical predictors
  step_dummy(all_nominal_predictors()) |>
  # zero-variance filter
  step_zv(all_numeric_predictors()) |> 
  # center and scale numeric data
  step_normalize(all_numeric_predictors()) |>
  # estimate the means and standard deviations
  print()

5.3 Model

rf_mod <- rand_forest(
    mode = "classification",
    # Number of predictors randomly sampled in each split
    mtry = tune(),
    # Number of trees in ensemble
    trees = tune()
  ) |> 
  set_engine("ranger")
rf_mod

Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()

Computational engine: ranger

5.4 Workflow

rf_wf <- workflow() |>
  add_recipe(recipe) |>
  add_model(rf_mod)
rf_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_rm()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()

Computational engine: ranger

5.5 Tuning grid

param_grid <- grid_regular(
  trees(range = c(200L, 800L)), 
  mtry(range = c(1L, 5L)),
  levels = c(7, 5)
  )
param_grid

# A tibble: 35 × 2
   trees  mtry
   <int> <int>
 1   200     1
 2   300     1
 3   400     1
 4   500     1
 5   600     1
 6   700     1
 7   800     1
 8   200     2
 9   300     2
10   400     2
# ℹ 25 more rows

5.6 Cross-validation (CV)

Set cross-validation partitions.

set.seed(2024)

folds <- vfold_cv(data_other, v = 5)
folds

#  5-fold cross-validation 
# A tibble: 5 × 2
  splits               id   
  <list>               <chr>
1 <split [18096/4525]> Fold1
2 <split [18097/4524]> Fold2
3 <split [18097/4524]> Fold3
4 <split [18097/4524]> Fold4
5 <split [18097/4524]> Fold5

Fit cross-validation.

set.seed(2024)

(rf_fit <- rf_wf |>
  tune_grid(
    resamples = folds,
    grid = param_grid,
    metrics = metric_set(roc_auc, accuracy)
    )) |> 
  system.time()

   user  system elapsed 
480.241  14.612 379.725

Visualize CV results:

rf_fit |>
  collect_metrics() |>
  print(width = Inf) |>
  filter(.metric == "roc_auc") |>
  ggplot(mapping = aes(x = trees, y = mean, color = factor(mtry))) +
  geom_point() + 
  # geom_line() + 
  labs(x = "Num. of Trees", y = "CV AUC")

# A tibble: 70 × 8
    mtry trees .metric  .estimator  mean     n std_err .config              
   <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
 1     1   200 accuracy binary     0.903     5 0.00252 Preprocessor1_Model01
 2     1   200 roc_auc  binary     0.795     5 0.00428 Preprocessor1_Model01
 3     1   300 accuracy binary     0.903     5 0.00252 Preprocessor1_Model02
 4     1   300 roc_auc  binary     0.795     5 0.00413 Preprocessor1_Model02
 5     1   400 accuracy binary     0.903     5 0.00252 Preprocessor1_Model03
 6     1   400 roc_auc  binary     0.796     5 0.00365 Preprocessor1_Model03
 7     1   500 accuracy binary     0.903     5 0.00252 Preprocessor1_Model04
 8     1   500 roc_auc  binary     0.796     5 0.00406 Preprocessor1_Model04
 9     1   600 accuracy binary     0.903     5 0.00252 Preprocessor1_Model05
10     1   600 roc_auc  binary     0.795     5 0.00410 Preprocessor1_Model05
# ℹ 60 more rows

Show the top 5 models.

rf_fit |>
  show_best(metric = "roc_auc")

# A tibble: 5 × 8
   mtry trees .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     2   400 roc_auc binary     0.800     5 0.00518 Preprocessor1_Model10
2     2   700 roc_auc binary     0.800     5 0.00511 Preprocessor1_Model13
3     2   200 roc_auc binary     0.800     5 0.00500 Preprocessor1_Model08
4     2   500 roc_auc binary     0.800     5 0.00527 Preprocessor1_Model11
5     2   800 roc_auc binary     0.800     5 0.00511 Preprocessor1_Model14

Let’s select the best model.

best_rf <- rf_fit |>
  select_best(metric = "roc_auc")
best_rf

# A tibble: 1 × 3
   mtry trees .config              
  <int> <int> <chr>                
1     2   400 Preprocessor1_Model10

5.7 Final model

# Final workflow
final_wf <- rf_wf |>
  finalize_workflow(best_rf)

final_fit <- 
  final_wf |>
  last_fit(data_split)

# Test metrics
final_fit |> 
  collect_metrics()

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.902 Preprocessor1_Model1
2 roc_auc  binary         0.800 Preprocessor1_Model1

Random forest model has a better test performance than logistic regression model in terms of AUC.

6 Feedback

Slido