You’ve done tremendous work carrying out an intervention program. You want to demonstrate to stakeholders (government, funding agencies, etc) that your program “works.”
You compared the average of Food Security Rasch Scale Score in the past 30 days (higher score indicates lower food security) among program participants versus those not participating our program. We found a significant difference in the averages 😄 But people criticize your analysis as biased 🙁 What could go wrong?
In another try, you ran a multiple linear regression that included the program participation status and many socio-economical variables (race, age, income, region, …), hoping to adjust for hidden confounding. Then you found that program participation does not significantly improve the food security Rasch score 🙁
Causal inference methods can help more reliably estimate the “treatment effect.”
Main approaches for causal inference:
RCT (randomized controlled trials): gold standard, but impractical in most Community-Based Participatory Research (CBPR) with multi-level interventions.
Observational studies under unconfoundedness: matching, regression estimators, inverse-propensity-score weighting.
3 Partially linear regression and interactive regression model
3.1 Partially linear regression (PLR)
Partially linear regression (PLR) model introduced by Robinson (1988):
\[
\begin{aligned}
Y &= D \theta_0 + g_0(X) + U, &E[U | X, D] = 0, \\
D &= m_0(X) + V, &E[V | X] = 0.
\end{aligned}
\]
Here, \(Y\) is the outcome variable, \(D\) is the policy/treatment variable of interest, vector
\[
X = (X_1, \ldots, X_p)
\]
consists of other control, and \(U\) and \(V\) are disturbances terms. The first equation is the main equation, and \(\theta_0\) is the main regression coefficient that we would like to infer. If \(D\) is exogenous conditional on controls X, \(\theta_0\) has the interpretation of the treatment effect parameter or ‘lift’ parameter in business applications. The second equation keeps track of confounding, namely the dependence of the treatment variable on controls.
The confounding factors \(X\) affect the policy variable \(D\) via the function \(m_0(X)\) and the outcome variable via the function \(g_0(X)\). These function does not need to be linear! We can take advantage of the great prediction performance of machine learning techniques to estimate these functions.
Naive application of machine learning methods directly to two equations may have a very high bias.
Remarks:
The policy variable \(D\) can be either binary or continuous (dosage).
The policy variable \(D\) can be multi-dimensional (multi-level intervention: food pantry, mobile clinics, referral system, education classes, ads.).
3.2 Interactive regression model (IRM)
We consider estimation of average treatment effects when treatment effects are fully heterogeneous, i.e., the response curves under control and treatment can be different nonparametric functions, and the treatment variable is binary, \(D \in \{0,1\}\). We consider vectors \((Y,D,X)\) such that
\[
\begin{aligned}
Y &= g_0(D,X) + U, &E[U | X, D] = 0, \\
D &= m_0(X) + V, &E[V | X] = 0.
\end{aligned}
\]
Since \(D\) is not additively separable, this model is more general than the partially linear model for the case of binary \(D\). A common target parameter of interest in this model is the average treatment effect (ATE).
\[
\theta_0 = E[g_0(1,X) - g_0(0,X)].
\]
The confounding factors \(X\) affect the policy variable via the propensity score \(m_0(X)\) and the outcome variable via the function \(g_0(X)\). Both of these functions are unknown and potentially complex, and we can employ ML methods to learn them.
The general idea for identification of \(θ_0\) using the IRM is similar. Once we are able to account for all confounding variables \(X\) in our analysis, we can consistently estimate the causal parameter \(\theta_0\). A difference to the PLR refers to assumptions on the functional form of the main regression equation. Whereas it is assumed that the effect of \(D\) on \(Y\) in the PLR model is additively separable, the IRM model comes with less restrictive assumptions.
3.3 Basic Idea behind Double Machine Learning for the PLR Model
The basic idea behind double machine learning method is to use machine learning methods to estimate the nuisance functions \(m_0(X)\) and \(g_0(X)\) and then use these estimates to construct a doubly robust estimator of the target parameter \(\theta_0\). The doubly robust estimator is consistent if either the machine learning estimator of \(m_0(X)\) or \(g_0(X)\) is consistent.
The Neyman orthogonal score is orthogonal to the nuisance parameter in the sense that its expectation is zero. The Neyman orthogonal score is a key concept in the theory of efficient estimation of the target parameter in semiparametric models.
The PLR model can be rewritten in the following residualized form:
The variable \(W\) and \(V\) represent original variables after taking out or partialling out the effect of X. Given identification, double machine learning for a PLR proceeds as follows:
Estimate \(\ell_0\) and \(m_0\) by sloving two problems of predicting \(Y\) and \(D\) using \(X\), using any generic machine learning method, giving us estimated residuals \[
\begin{aligned}
\hat{W} &= Y - \hat{\ell}_0(X), \\
\hat{V} &= D - \hat{m}_0(X).
\end{aligned}
\]
The residuals should be of a cross-validated form to avoid overfitting.
Estimate \(\theta_0\) by regressing the residuals \(\hat{W}\) on \(\hat{V}\). Use the conventional inference for this regression estimator, ignoring the estimation error in \(\hat{W}\) and \(\hat{V}\).
3.3.2 Idea 2: Sample splitting
The key idea behind sample splitting is to split the sample into two parts, one for estimating the nuisance functions and the other for estimating the causal parameter. This is also known as cross-fitting.
4 Example: causal effect of WIC on food security
4.1 Background
The Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) is a federally funded nutrition program that provides grants to States to support distribution of supplemental foods, health care referrals, and nutrition education to safeguard the health of low-income pregnant, breastfeeding, and non-breastfeeding postpartum women; for infants in low-income families; and for children younger than age 5 in low-income families and who are found to be at nutritional risk.
In 2020, WIC served over 6.2 million participants per month at an average monthly cost for food (after rebates to WIC from manufacturers) of about $38 per person. Many household under low food security benefit from WIC. We are curious to know the causal effect of WIC on food security status.
4.2 Data preparation
We use the 2020 Current Population Survey (CPS) Food Security Supplement (FSS) data to estimate the causal effect of WIC on food security status. We screen the eligible household (below 185 percent of the poverty threshold, with children under age 5 or women aged 15-45). Besides, we set another eligibility criteria that the household must have at least 1 food insecure event in the past 30 days, because we are interested in the causal effect of WIC on food security status among households that suffered from food insecurity.
We include the following variables in our analysis:
HRFS30D4: Food Security Rasch Scale Score in the past 30 days (100 - 1400). Higher score indicates lower food security.
HESP8: WIC participation status (1: Yes, 2: No).
HRNUMHOU: Number of people in the household.
HRHTYPE: Household type.
GEREG: Geographic region.
PRCHLD: Presence of children in the household.
PRTAGE: Age of the reference person.
PEEDUCA: Education level.
PEMLR: Employment status.
RACE: recoded from PTDTRACE and PEHSPNON.
HEFAMINC: Family income (take median of each class).
fit |>tbl_regression() |>bold_labels() |>bold_p(t =0.05)
Characteristic
Beta
95% CI1
p-value
factor(HESP8)
NoWIC
—
—
WIC
-32
-75, 11
0.14
HRNUMHOU
-7.6
-18, 2.2
0.13
HRHTYPE
MarriedFamily
—
—
UnmarriedFamily
31
0.59, 61
0.046
Individual
40
-13, 92
0.14
GEREG
Northeast
—
—
Midwest
29
-19, 77
0.2
South
13
-29, 54
0.6
West
-4.5
-49, 40
0.8
PRTAGE
1.2
-0.13, 2.5
0.077
PEEDUCA
LessThanHighSchool
—
—
HighSchoolOrAssociateDegree
-34
-72, 4.4
0.083
CollegeOrHigher
-34
-84, 17
0.2
PRCHLD
NoChildren
—
—
Children
-60
-96, -24
0.001
PEMLR
Employed
—
—
NotEmployed
87
45, 128
<0.001
NotInLaborForce
44
13, 75
0.005
HEFAMINC
0.00
0.00, 0.00
0.2
PEHSPNON
Non-Hispanic
—
—
Hispanic
-17
-49, 15
0.3
1 CI = Confidence Interval
4.5 DoubleML workflow
The Python and R package DoubleML provide an implementation of the double / debiased machine learning framework of Chernozhukov et al. (2018). The R package is built on top of mlr3 and the mlr3 ecosystem (Lang et al., 2019).
Data-backend
we initialize the data-backend and thereby declare the role of the outcome, the treatment, and the confounding variables.
There are several models currently implemented in DoubleML which differ in terms of the underlying causal structure. In this example, we use the interactive regression model (IRM).
ML Methods
we can specify the machine learning tools used for estimation of the nuisance parts. We can generally choose any learner from mlr3 ecosystem in R. In this example, we use random forest for both the main equation and the confounding equation.
we initialize and parametrize the model object which will later be used to perform the estimation. we specify the resampling, the dml algorithm and the score function.