Predictive Modeling - Logistic Regression and Regularization

Author

Dr. Roch Nianogo, Bowen Zhang, Dr. Hua Zhou

Code
library(tidyverse)
library(tidycensus)
census_api_key("4cf445b70eabd0b297a45e7f62f52c27ba3b5cae",
               install = TRUE, overwrite = TRUE)
Sys.setenv("CENSUS_KEY" = "4cf445b70eabd0b297a45e7f62f52c27ba3b5cae")

library(censusapi)
library(glmnet)
library(gtsummary)
library(knitr)
library(maps)
library(pROC)
library(tidymodels)
library(tigris)

1 Roadmap

Data Science Diagram

In the last lecture, we focused on data wrangling (import, tidy, transform, visualize). Now we progress to the modeling part. In this lecture, we focus on predictive modeling using machine learning methods (logistic, random forest, neural network, …). In the next lecture, we focus on policy evaluation using double machine learning.

Slido

2 Learning objectives

Supervised vs unsuperversed learning, logistic regression, ROC curve, overfitting, regularization, L1 and L2 penalty, elastic net, cross-validation, hyperparameter tuning, model evaluation, and interpretation.

3 Machine learning overview

Machine Learning

3.1 Supervised vs unsupervised learning

  • Supervised learning: input(s) -> output.

    • Prediction (or regression): the output is continuous (income, weight, bmi, …).
    • Classification: the output is categorical (disease or not, pattern recognition, …).
  • Unsupervised learning: no output. We learn relationships and structure in the data.

    • Clustering.
    • Dimension reduction.
    • Embedding.
  • In modern applications, the line between supervised and unsupervised learning is blurred.

    • Matrix completion: Netflix problem. Both supervise and unsupervised techniques are used.

    • Large language model (LLM) combines supervised learning and reinforcement learning.

4 Logistic regression

We load the Food Security Supplement household data we curated earlier. Our goal is to predict food insecurity status using household’s socio-economical status.

data_clean <- read_rds("../02-wrangle/fss21.rds") |>
  print()
# A tibble: 30,162 × 13
   HRFS12M1   HRNUMHOU HRHTYPE GEREG PRCHLD HRPOOR PRTAGE PEEDUCA PEMLR PTDTRACE
   <fct>         <dbl> <fct>   <fct> <fct>  <fct>   <dbl> <fct>   <fct> <fct>   
 1 Food Secu…        1 Indivi… West  NoChi… NotPo…     35 HighSc… Empl… AIAN    
 2 Food Secu…        1 Indivi… South NoChi… NotPo…     36 Colleg… Empl… White   
 3 Food Secu…        3 Marrie… South NoChi… NotPo…     55 HighSc… Empl… White   
 4 Food Secu…        2 Marrie… South NoChi… NotPo…     85 Colleg… NotI… White   
 5 Food Secu…        2 Marrie… West  NoChi… Poor       69 HighSc… NotI… AIAN    
 6 Food Secu…        1 Indivi… Nort… NoChi… NotPo…     51 HighSc… Empl… White   
 7 Low Food …        2 Unmarr… Midw… Child… Poor       54 HighSc… NotI… White   
 8 Food Secu…        3 Marrie… South Child… NotPo…     46 Colleg… Empl… White   
 9 Food Secu…        2 Marrie… Midw… NoChi… NotPo…     69 HighSc… NotI… White   
10 Food Secu…        2 Marrie… South NoChi… NotPo…     75 HighSc… NotI… Black   
# ℹ 30,152 more rows
# ℹ 3 more variables: HHSUPWGT <dbl>, PEHSPNON <fct>, HRFS12M1_binary <fct>

4.1 Why not linear regression?

We are interested in predicting whether a household will be food insecure, on the basis of household size, marital status, have children or not, geographical region, education level, employment status, and race.

The response HRFS12M1_binary falls into one of two categories, Food Insecure (1) or Food Secure (0). Rather than modeling this response \(Y\) directly, logistic regression models the probability that \(Y\) belongs to a particular category.

default

\[ Y_i = \begin{cases} 1 & \text{with probability } p_i \\ 0 & \text{with probability } 1 - p_i \end{cases} \]

The parameter \(p_i = \mathbb{E}(Y_i)\) will be related to the predictors \(\mathbf{x}_i\) via

\[ p_i = \frac{e^{\eta_i}}{1 + e^{\eta_i}}, \] where \(\eta_i\) is the linear predictor (or systematic component)

\[ \eta_i = \mathbf{x}_i^T \boldsymbol{\beta} = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_q x_{iq}. \] In other words, logistic regression models the log-odds of the probability of success as a linear function of the predictors

\[ \log \left( \frac{p}{1-p} \right) = \log(\text{odds}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_q x_q. \]

Therefore \(\beta_1\) can be interpreted as a unit increase in \(x_1\) with other predictors held fixed increases the log-odds of success by \(\beta_1\), or increase the odds of success by a factor of \(e^{\beta_1}\).

4.2 Logistic regression on food security data

To further investigate the factors that are associated with food insecurity, we can use logistic regression to model the probability of a household being in low food security.

# Fit logistic regression
logit_model <- glm(
  # All predictors except HRFS12M1 and HHSUPWGT
  HRFS12M1_binary ~ . - HRFS12M1 - HHSUPWGT,
  data = data_clean, 
  family = "binomial"
  )
logit_model

Call:  glm(formula = HRFS12M1_binary ~ . - HRFS12M1 - HHSUPWGT, family = "binomial", 
    data = data_clean)

Coefficients:
                       (Intercept)                            HRNUMHOU  
                          -2.40169                             0.01544  
            HRHTYPEUnmarriedFamily                   HRHTYPEIndividual  
                           0.58782                             0.65878  
              HRHTYPEGroupQuarters                        GEREGMidwest  
                           1.90448                             0.08662  
                        GEREGSouth                           GEREGWest  
                           0.11764                             0.09240  
                    PRCHLDChildren                          HRPOORPoor  
                           0.24396                             1.28567  
                            PRTAGE  PEEDUCAHighSchoolOrAssociateDegree  
                          -0.01514                            -0.34774  
            PEEDUCACollegeOrHigher                    PEMLRNotEmployed  
                          -1.05583                             0.86870  
              PEMLRNotInLaborForce                       PTDTRACEBlack  
                           0.38743                             0.58287  
                      PTDTRACEAIAN                       PTDTRACEAsian  
                           0.53993                            -0.04968  
                       PTDTRACEHPI                       PTDTRACEOther  
                           0.16870                             0.75211  
                  PEHSPNONHispanic  
                           0.22406  

Degrees of Freedom: 30161 Total (i.e. Null);  30141 Residual
Null Deviance:      19230 
Residual Deviance: 16270    AIC: 16310

gtsummary package offers a nice summary of the model

logit_model |>
  tbl_regression() |>
  bold_labels() |>
  bold_p(t = 0.05)
Characteristic log(OR)1 95% CI1 p-value
HRNUMHOU 0.02 -0.03, 0.06 0.5
HRHTYPE


    MarriedFamily
    UnmarriedFamily 0.59 0.47, 0.70 <0.001
    Individual 0.66 0.53, 0.79 <0.001
    GroupQuarters 1.9 0.32, 3.2 0.008
GEREG


    Northeast
    Midwest 0.09 -0.06, 0.23 0.2
    South 0.12 -0.01, 0.25 0.075
    West 0.09 -0.04, 0.23 0.2
PRCHLD


    NoChildren
    Children 0.24 0.12, 0.37 <0.001
HRPOOR


    NotPoor
    Poor 1.3 1.2, 1.4 <0.001
PRTAGE -0.02 -0.02, -0.01 <0.001
PEEDUCA


    LessThanHighSchool
    HighSchoolOrAssociateDegree -0.35 -0.47, -0.22 <0.001
    CollegeOrHigher -1.1 -1.2, -0.90 <0.001
PEMLR


    Employed
    NotEmployed 0.87 0.66, 1.1 <0.001
    NotInLaborForce 0.39 0.29, 0.49 <0.001
PTDTRACE


    White
    Black 0.58 0.47, 0.70 <0.001
    AIAN 0.54 0.25, 0.82 <0.001
    Asian -0.05 -0.29, 0.18 0.7
    HPI 0.17 -0.46, 0.73 0.6
    Other 0.75 0.49, 1.0 <0.001
PEHSPNON


    NonHispanic
    Hispanic 0.22 0.10, 0.34 <0.001
1 OR = Odds Ratio, CI = Confidence Interval

Importance of each predictor can be evaluated by the analysis of variance (ANOVA) test.

anova(logit_model, test = "Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: HRFS12M1_binary

Terms added sequentially (first to last)

         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                     30161      19229              
HRNUMHOU  1     8.10     30160      19221 0.0044308 ** 
HRHTYPE   3   755.27     30157      18465 < 2.2e-16 ***
GEREG     3    43.42     30154      18422 2.005e-09 ***
PRCHLD    1    51.41     30153      18370 7.502e-13 ***
HRPOOR    1  1498.77     30152      16872 < 2.2e-16 ***
PRTAGE    1    61.32     30151      16810 4.855e-15 ***
PEEDUCA   2   304.66     30149      16506 < 2.2e-16 ***
PEMLR     2   105.77     30147      16400 < 2.2e-16 ***
PTDTRACE  5   121.52     30142      16278 < 2.2e-16 ***
PEHSPNON  1    13.30     30141      16265 0.0002657 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.3 ROC curve

If we want to use this model for prediction, we need to evaluate its performance. There are many metrics related to classification models, such as accuracy, precision, recall, sensitivity, specificity, …

\[ \begin{aligned} \text{Accuracy} &= \frac{TP + TN}{TP + TN + FP + FN} \\ \text{Precision} &= \frac{TP}{TP + FP} \\ \text{Recall} &= \frac{TP}{TP + FN} \\ \text{Sensitivity} &= \frac{TP}{TP + FN} \\ \text{Specificity} &= \frac{TN}{TN + FP} \end{aligned} \]

pred_prob <- predict(logit_model, type = "response")

# Set the threshold 0.5
threshold <- 0.5

predicted_class <- ifelse(pred_prob > threshold, 1, 0) |> as.factor()
actual_class <- data_clean$HRFS12M1_binary

ctb_50 <- table(Predicted = predicted_class, Actual = actual_class)

# Set the threshold 0.1
threshold <- 0.1

predicted_class <- ifelse(pred_prob > threshold, 1, 0) |> as.factor()
actual_class <- data_clean$HRFS12M1_binary

ctb_10 <- table(Predicted = predicted_class, Actual = actual_class)

list(threshold50 = ctb_50, threshold10 = ctb_10)
$threshold50
         Actual
Predicted Food Security Food Insecure
        0         27132          2858
        1           100            72

$threshold10
         Actual
Predicted Food Security Food Insecure
        0         20314           845
        1          6918          2085

If we set different thresholds, the confusion matrix will change, thus the metrics will change.

# Calculate metrics under different thresholds
calc_metrics <- function(ct) {
  c(Accuracy = (ct[1, 1] + ct[2, 2]) / sum(ct),
    Sensitivity = ct[2, 2] / sum(ct[, 2]),
    Specificity = ct[1, 1] / sum(ct[, 1]),
    Precision = ct[2, 2] / sum(ct[2, ]))
}

metrics50 <- calc_metrics(ctb_50)
metrics10 <- calc_metrics(ctb_10)

data.frame(threshold50 = metrics50, threshold10 = metrics10) |>
  t() |>
  round(2) |>
  kable()
Accuracy Sensitivity Specificity Precision
threshold50 0.90 0.02 1.00 0.42
threshold10 0.74 0.71 0.75 0.23

If we set the threshold to 0.5, the accuracy is 0.90, which is quite higher than setting threshold to 0.1. However, the sensitivity is only 0.02, which means the model can only capture 2% of the households in low food security. In contrast, if we set the threshold to 0.1, the sensitivity is 0.71, which means the model can capture 71% of the households in low food security. However, the specificity decreases to 0.74. This trade-off is common in classification models.

Therefore, we need a metric that can evaluate the model’s performance under different thresholds. The ROC curve is a good choice. The ROC curve is a popular graphic for simultaneously displaying the two types of errors for all possible thresholds. The name “ROC” is historic, and comes from communications theory. It is an acronym for receiver operating characteristics.

The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug the top left corner, so the larger area under the AUC the better the classifier. We expect a classifier that performs no better than chance to have an AUC of 0.5.

There is another similar plot called the precision-recall curve, which sets the x-axis as recall and the y-axis as precision. The classifier that has a higher AUC on the ROC curve will always have a higher AUC on the PR curve as well.

data_clean <- data_clean |>
  mutate(prob = predict(logit_model, type = "response"))

roc_data <- roc(data_clean$HRFS12M1_binary, data_clean$prob)

ggroc(roc_data, legacy.axes = TRUE) +
  labs(title = "ROC Curve for Logistic Regression Model",
       x = "1 - Specificity",
       y = "Sensitivity") +
  theme_minimal() +
  annotate("text", x = 0.5, y = 0.5,
           label = paste("AUC =", round(auc(roc_data), 3)))

The logistic regression model has an AUC of 0.794, which indicates that the model has a reasonably good discrimination ability. However, if we want to evaluate the model’s predictive performance, simply fitting models and calculating AUC is not enough.

5 Assessing model accuracy

5.1 Measuring the quality of fit

In order to evaluate the performance of a statistical learning method on a given data set, we need some way to measure how well its predictions actually match the observed data. That is, we need to quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation. In the regression setting, the most commonly-used measure is the mean squared error (MSE), given by

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2. \] The MSE will be small if the predicted responses are very close to the true responses, and will be large if for some of the observations, the predicted and true responses differ substantially.

The MSE is computed using the training data that was used to fit the model. But in general, we do not really care how well the method works training on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data. We’d like to select the model for which the average of the test MSE is as small as possible.

default

As model flexibility increases, training MSE will decrease, but the test MSE may not. When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data.

Tip
  • Does it mean simpler models are always better?

No Free Lunch Theorem [David Wolpert, William Macready]: Any two optimization algorithms are equivalent when their performance is averaged across all possible problems.

default

The black points represent the training data. There are two models A and B, in which model B is more flexible than model A.

default

The white points represent the test data. In the left panel, model A has a smaller test MSE than model B. In the right panel, model B has a smaller test MSE than model A. Therefore, we cannot say that simpler models are always better.

In practice, one can usually compute the training MSE with relative ease, but estimating test MSE is considerably more difficult because usually no test data are available. The flexibility level corresponding to the model with the minimal test MSE can vary considerably among data sets. One important method is cross-validation, which is a cross method for estimating test MSE using the training data.

5.2 Cross-validation

K-fold cross-validation randomly divides the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean squared error, \(\text{MSE}_1\), is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error, \(\text{MSE}_1\), \(\text{MSE}_2\),…, \(\text{MSE}_k\). The k-fold CV estimate is computed by averaging these values,

\[ \text{CV}_{(k)} = \frac{1}{k} \sum_{i=1}^{k} \text{MSE}_i. \]

default

6 Tidymodels overview

  • tidymodels is an ecosystem for:

    1. Feature engineering: coding qualitative predictors, transformation of predictors (e.g., log), extracting key features from raw variables (e.g., getting the day of the week out of a date variable), interaction terms, … (recipes package);
    2. Build and fit a model (parsnip package);
    3. Evaluate model using resampling (such as cross-validation) (tune and dial packages);
    4. Tuning model parameters.

Supported models in tidymodels ecosystem. link

scikit-learn in Python. link

Supported models in Julia MLJ ecosystem. link

7 Elastic-net (enet) regularization and shrinkage methods

The subset selection methods such as best subset selection, forward stepwise selection, and backward stepwise selection have some limitations. They are computationally expensive and can lead to overfitting. Shrinkage methods are an alternative approach to subset selection. we fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance. The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso.

  • In logistic regression, for ridge regression (\(L_2\) penalty), we need to optimize the following objective function: \[ \ell^*(\boldsymbol\beta) = \ell(\boldsymbol\beta) - \lambda \sum_{j=1}^{p} \beta_j^2, \] where the penalty term is \(\lambda \sum_{j=1}^{p} \beta_j^2\).

  • For the lasso (\(L_1\) penalty), we need to optimize the following objective function: \[ \ell^*(\boldsymbol\beta) = \ell(\boldsymbol\beta) - \lambda \sum_{j=1}^{p} |\beta_j|, \] where the penalty term is \(\lambda \sum_{j=1}^{p} |\beta_j|\).

  • The elastic net combines the ridge and lasso penalties, and the penalty term is \(\lambda \left( \alpha \sum_{j=1}^{p} |\beta_j| + \frac{1 - \alpha}{2} \sum_{j=1}^{p} \beta_j^2 \right)\), where \(\alpha\) controls the relative weight of the two penalties.

Implementing ridge regression and the lasso requires a method for selecting a value for the tuning parameter \(\lambda\) (and \(\alpha\) if we use elastic net). Cross-validation provides a simple way to tackle this problem. We choose a grid of tuning parameters values, and compute the cross-validation error for each value of tuning parameters. We then select the tuning parameters value for which the cross-validation error is smallest. Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter.

8 Logistic regression (with enet regularization) workflow

Produced by OmniGraffle 7.9.4 2019-02-16 02:42:35 +0000 Canvas 1 Layer 1 All Data Training Testing Assessment Analysis Resample 1 Assessment Analysis Resample 2 Assessment Analysis Resample B

Machine learning workflow

8.1 Initial split into test and non-test sets

We randomly split the data into 25% test data and 75% non-test data. Stratify on food security status.

# For reproducibility
set.seed(2024)

data_split <- data_clean |>
  initial_split(
  # stratify by HRFS12M1_binary
  strata = "HRFS12M1_binary", 
  prop = 0.75
  )
data_split
<Training/Testing/Total>
<22621/7541/30162>
data_other <- training(data_split)
dim(data_other)
[1] 22621    14
data_test <- testing(data_split)
dim(data_test)
[1] 7541   14

8.2 Recipe

Recipe for preprocessing the data:

recipe <- recipe(
    HRFS12M1_binary ~ .,
    data = data_other
  ) |>
  # remove the weights and original HRFS12M1
  step_rm(HHSUPWGT, HRFS12M1) |>
  # create dummy variables for categorical predictors
  step_dummy(all_nominal_predictors()) |>
  # zero-variance filter
  step_zv(all_numeric_predictors()) |> 
  # center and scale numeric data
  step_normalize(all_numeric_predictors()) |>
  # estimate the means and standard deviations
  print()

8.3 Model

logit_mod <- logistic_reg(
    penalty = tune(), # \lambda
    mixture = tune()  # \alpha
  ) |> 
  set_engine("glmnet", standardize = FALSE) |>
  print()
Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = tune()
  mixture = tune()

Engine-Specific Arguments:
  standardize = FALSE

Computational engine: glmnet 

8.4 Workflow

train_weight <- round(data_other$HHSUPWGT / 1000, 0)
train_weight <- ifelse(train_weight == 0, 1, train_weight)

logit_wf <- workflow() |>
  #add_case_weights(train_weight) |>
  add_recipe(recipe) |>
  add_model(logit_mod) |>
  print()
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_rm()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = tune()
  mixture = tune()

Engine-Specific Arguments:
  standardize = FALSE

Computational engine: glmnet 

8.5 Tuning grid

param_grid <- grid_regular(
  penalty(range = c(-3, 3)), # \lambda
  mixture(), # \alpha
  levels = c(1000, 5)
  ) |>
  print()
# A tibble: 5,000 × 2
   penalty mixture
     <dbl>   <dbl>
 1 0.001         0
 2 0.00101       0
 3 0.00103       0
 4 0.00104       0
 5 0.00106       0
 6 0.00107       0
 7 0.00109       0
 8 0.00110       0
 9 0.00112       0
10 0.00113       0
# ℹ 4,990 more rows

8.6 Cross-validation (CV)

Set cross-validation partitions.

set.seed(2024)

(folds <- vfold_cv(data_other, v = 5))
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits               id   
  <list>               <chr>
1 <split [18096/4525]> Fold1
2 <split [18097/4524]> Fold2
3 <split [18097/4524]> Fold3
4 <split [18097/4524]> Fold4
5 <split [18097/4524]> Fold5

Fit cross-validation.

(logit_fit <- logit_wf |>
  tune_grid(
    resamples = folds,
    grid = param_grid,
    metrics = metric_set(roc_auc, accuracy)
    )) |>
  system.time()
   user  system elapsed 
 85.124   7.405  92.932 

Visualize CV results:

logit_fit |>
  # aggregate metrics from K folds
  collect_metrics() |>
  print(width = Inf) |>
  filter(.metric == "roc_auc") |>
  ggplot(mapping = aes(
    x = penalty, 
    y = mean, 
    color = factor(mixture)
    )) +
  geom_point() +
  labs(x = "Penalty", y = "CV AUC") +
  scale_x_log10()
# A tibble: 10,000 × 8
   penalty mixture .metric  .estimator  mean     n std_err
     <dbl>   <dbl> <chr>    <chr>      <dbl> <int>   <dbl>
 1 0.001         0 accuracy binary     0.901     5 0.00225
 2 0.001         0 roc_auc  binary     0.792     5 0.00541
 3 0.00101       0 accuracy binary     0.901     5 0.00225
 4 0.00101       0 roc_auc  binary     0.792     5 0.00541
 5 0.00103       0 accuracy binary     0.901     5 0.00225
 6 0.00103       0 roc_auc  binary     0.792     5 0.00541
 7 0.00104       0 accuracy binary     0.901     5 0.00225
 8 0.00104       0 roc_auc  binary     0.792     5 0.00541
 9 0.00106       0 accuracy binary     0.901     5 0.00225
10 0.00106       0 roc_auc  binary     0.792     5 0.00541
   .config                
   <chr>                  
 1 Preprocessor1_Model0001
 2 Preprocessor1_Model0001
 3 Preprocessor1_Model0002
 4 Preprocessor1_Model0002
 5 Preprocessor1_Model0003
 6 Preprocessor1_Model0003
 7 Preprocessor1_Model0004
 8 Preprocessor1_Model0004
 9 Preprocessor1_Model0005
10 Preprocessor1_Model0005
# ℹ 9,990 more rows

Show the top 5 models.

logit_fit |>
  show_best(metric = "roc_auc")
# A tibble: 5 × 8
  penalty mixture .metric .estimator  mean     n std_err .config                
    <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                  
1  0.0515    1    roc_auc binary     0.794     5 0.00542 Preprocessor1_Model4286
2  0.352     0.25 roc_auc binary     0.794     5 0.00542 Preprocessor1_Model1425
3  0.357     0.25 roc_auc binary     0.794     5 0.00542 Preprocessor1_Model1426
4  0.362     0.25 roc_auc binary     0.794     5 0.00542 Preprocessor1_Model1427
5  0.367     0.25 roc_auc binary     0.794     5 0.00542 Preprocessor1_Model1428

Let’s select the best model.

best_logit <- logit_fit |>
  select_best(metric = "roc_auc")
best_logit
# A tibble: 1 × 3
  penalty mixture .config                
    <dbl>   <dbl> <chr>                  
1  0.0515       1 Preprocessor1_Model4286

8.7 Final model

# Final workflow
final_wf <- logit_wf |>
  finalize_workflow(best_logit)
final_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_rm()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = 0.0514886745013749
  mixture = 1

Engine-Specific Arguments:
  standardize = FALSE

Computational engine: glmnet 
# Fit the whole training set, then predict the test cases
final_fit <- final_wf |>
  last_fit(data_split)
final_fit
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits               id              .metrics .notes   .predictions .workflow 
  <list>               <chr>           <list>   <list>   <list>       <list>    
1 <split [22621/7541]> train/test spl… <tibble> <tibble> <tibble>     <workflow>
# Test metrics
final_fit |> 
  collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.902 Preprocessor1_Model1
2 roc_auc  binary         0.797 Preprocessor1_Model1

9 Feedback

Slido