Technology Sharing

[Stanford Causal Inference Course Collection] 2_No Confusion and Propensity Scoring 1

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Table of contents

Beyond a single randomized controlled trial

Aggregating difference-in-means estimators

Continuous X and the propensity score


One of the simplest extensions of randomized trials is the estimation of intervention effects under no constraints. Qualitatively, unboundedness is relevant when we want to estimate a treatment effect that is not random but is as good as if it were random once we control for a set of covariates Xi.

The purpose of this lecture is to discuss the identification and estimation of average intervention effects under this unbounded assumption. As before, we will adopt a nonparametric approach: we will not assume any good specification of the parametric model, and the identification of the average treatment effect will be driven entirely by the design (i.e., conditional independence statements associated with the potential intervention outcome and the treatment).

Beyond a single randomized controlled trial

We define the causal effect of a treatment in terms of potential intervention outcomes. For a binary intervention w∈{0, 1}, we define potential outcomes Yi(1) and Yi(0) corresponding to the outcomes that the ith subject would experience if he or she received or did not receive the intervention, respectively. We assume SUTVA,Y_i = Y_i(W_i), and wish to estimate the average intervention effect

text{ATE}=mathbb{E}left[Y_i(1)-Y_i(0)right]

In the first lecture, we assumed random intervention assignment.{Y_i(0), Y_i(1)}perp W_i, and several √n-consistency estimators of ATE are studied.

The simplest way to go beyond one RCT is to think of two RCTs. To take a concrete example, suppose we are interested in giving adolescents cash incentives to stop them from smoking. 5% of adolescents in Palo Alto, California, and 20% of adolescents in Geneva, Switzerland, are eligible to participate in this study.

Within each city, we did a randomized controlled study and it was easy to see that the intervention helped. However, looking at the aggregate data was misleading and made it look like the intervention hurt; this is an example of what is sometimes called Simpson's paradox:Once we pool the data, this is no longer an RCT, as people in Geneva are both more likely to receive treatment and more likely to smoke regardless of whether they receive treatment. To get a consistent estimate of ATE, we need to estimate the intervention effect in each city separately:begin{aligned} &hat{tau}_{mathrm{PA}}=frac{5}{152+5}-frac{122}{2362+122}approx-1.7%, \ &hat{tau}_{mathrm{GVA}}=frac{350}{350+581}-frac{1979}{2278+1979}approx-8.9% \ &begin{aligned}hat{tau}=frac{2641}{2641+5188}hat{tau}_{mathrm{PA}}+frac{5188}{2641+5188}hat{tau}_{mathrm{GVA}}approx-6.5%.end{aligned} end{aligned}

What are the statistical properties of this estimator? How does this idea generalize to continuous x?

Aggregating difference-in-means estimators

Assume that the covariate Xi takes values ​​in the discrete space Xi∈X,|mathcal{X}|=p<infty. Assume again that treatment assignment is random assignment conditional on Xi (ie, each group has an RCT defined by the x levels):{Y_i(0), Y_i(1)} perp W_i big| X_i=x, text{for all} xinmathcal{X}.

The average treatment effect within the group is defined astau(x)=mathbb{E}begin{bmatrix}Y_i(1)-Y_i(0)&X_i=xend{bmatrix}

We can then estimate ATE τ by aggregating the group-level treatment effect estimates, as described above,

begin{aligned}hat{tau}_{AGG}=sum_{xinmathcal{X}}frac{n_x}{n}hat{tau}(x),quadhat{tau}(x)=frac{1}{n_{x1}}sum_{{X_i=x,W_i=1}}Y_i-frac{1}{n_{x0}}sum_{{X_i=x,W_i=0}}Y_i,end{aligned}

in n_x=|{i:X_i=x}|begin{aligned}n_{xw}=|{i:X_i=x, W_i=w}|end{aligned}How good is this estimate? Intuitively, we need to estimate |mathcal{X}|=p "parameters", so we might expect the variance to be linear in p?

To explore this estimate, we can write it as follows. First, for any group with covariate x, define e(x) as the probability of being treated in that group,e(x)=mathbb{P}left[W_{i}=1 big| X_{i}=xright] , and noted

sqrt{n_x}left(hat{tau}(x)-tau(x)right)Rightarrowmathcal{N}left(0, frac{text{Var}left[Y_i(0) big| X_i=xright]}{1-e(x)}+frac{text{Var}left[Y_i(1) big| X_i=xright]}{e(x)}right)

In addition, according to mathrm{Var}begin{bmatrix}Y(w)&X=xend{bmatrix} =sigma^{2}(x) Without relying on the simplifying assumptions about w, we can obtain

sqrt{n_x}left(hat{tau}(x)-tau(x)right)Rightarrowmathcal{N}left(0, frac{sigma^2(x)}{e(x)(1-e(x))}right).

Next, for the ensemble estimator, hat{pi}(x) = n_x/n defined asX_{i}=x The proportion of observations will bepi(x)=mathbb{P}left[X_i=xright] Defined as its expected value, we can derive

Putting these parts together, we getsqrt{n}left(hat{tau}_{AGG}-tauright)Rightarrowmathcal{N}left(0,V_{AGG}right)

begin{gathered} V_{AGG} =mathrm{Var}left[tau(X_{i})right]+sum_{xinmathcal{X}}pi^{2}(x)frac{1}{pi(x)}frac{sigma^{2}(x)}{e(x)(1-e(x))} \ =mathrm{Var}left[tau(X_i)right]+mathbb{E}left[frac{sigma^2(X_i)}{e(X_i)(1-e(X_i))}right]. end{gathered}

It is worth noting that the asymptotic variance VAGG does not depend on the number of groups. |mathcal{X}|=p,As we will see later, this fact plays a key role in making efficient semiparametric inferences about average intervention effects in observational studies.

Continuous X and the propensity score

Above, we considered the case where X is discrete with a finite number of levels, and the treatment Wi is random as in (2.1) with Xi = x. In this case, we found that we can still accurately estimate ATE by aggregating the within-group treatment effect estimates, and that the exact number of groups |X| = p does not affect the accuracy of the inference. However, if X is continuous (or the chi-square of X is very large), this result does not apply directly - since we cannot obtain enough samples for every possible value of x∈X to define τ(x) as in (2.3).

To generalize our analysis beyond the discrete-X case, we can no longer simply try to estimate τ(x) for each value of x by simple averaging, but instead use a more indirect argument. To do this, we first need to generalize the assumption that there is an RCT for each group. Formally, we simply write the same

{Y_i(0),Y_i(1)}perp W_i big| X_i,quad(2.6)

Although Xi may now be an arbitrary random variable, this statement may need to be interpreted more cautiously. From a qualitative perspective, one way to interpret (2.6) is that we have measured enough covariates to capture any dependencies between Wi and the potential outcomes, so that given Xi, Wi cannot "peek" at {Yi(0), Yi(1)}. We call this assumptionunconfoundedness.

Assumption (2.6) seems difficult to use in practice because it involves conditions on continuous random variables. However, as Rosenbaum and Rubin (1983) pointed out, by taking into account the propensity scoree(x)=mathbb{P}begin{bmatrix}W_i=1 big| X_i=xend{bmatrix}

From a statistical point of view, a key property of the propensity score is that it is a balanced score: if (2.6) holds, then in practice

{Y_i(0),Y_i(1)}perp W_i | e(X_i),quad(2.8)

That is, we can actually eliminate the bias associated with non-random intervention assignment by controlling only e(X) instead of X. We can verify this claim by:

begin{aligned} &mathbb{P}left[W_{i}=w big| {Y_{i}(0), Y_{i}(1)big} , e(X_{i})right] \ &=int_{mathcal{X}}mathbb{P}left[W_i=w big| {Y_i(w)} ,X_i=xright]mathbb{P}left[X_i=x big| {Y_i(w)} , e(X_i)right] dx \ &=int_{mathcal{X}}mathbb{P}left[W_i=w big| X_i=xright]mathbb{P}left[X_i=x big| big{Y_i(w)big} , e(X_i)right] dxquadtext{(unconf.)} \ &=e(X_{i})mathbf{1}_{w=1}+(1-e(X_{i}))mathbf{1}_{w=0}. end{aligned}

(2.8) means that if we can partition the observations into groups with (almost) constant values ​​of the propensity score e(x), then we canhat{tau}_{AGG} Variants of continuously estimate the average intervention effect.