All about influence functions

Kyle Butts

Note that this material is mainly a summary of the math in Ben Jann’s notes compiled together with additional writing by myself.

Setup

Let zn={zi}i=1n{\bf z}_n = \{z_i\}_{i=1}^n be a random sample of data that comes from some true underlying distribution FF. We take this data and compute some estimator with it: θ^(zn)\hat{\theta}({\bf z}_n) (scalar or vector). Note that this is a function of the sample we observe, zn{\bf z}_n. Some examples:

  1. Sample Mean: ziz_i is a scalar and θ^(zn)\hat{\theta}({\bf z}_n) the sample mean of the data.
μ=EF[zi]\mu = \mathbb{E}_F\left[ z_i \right]
  1. Regression: {zi(xi,yi)}i=1n\{z_i \equiv (x_i, y_i)\}_{i=1}^n be a random sample of data, where xix_i is a 1×k1\times k vector of covariates and yiy_i is a scalar outcome. θ^(zn)\hat{\theta}({\bf z}_n) would be the OLS coefficients β\beta.
β=argminβEF[(yXβ)(yXβ)] \beta = argmin_{\beta} \mathbb{E}_F\left[ (y-X\beta)'(y-X\beta) \right]
  1. Treatment Effect: {zi(yi,Di)}i=1n\{z_i \equiv (y_i, D_i)\}_{i=1}^n where yiy_i is the outcome and DiD_i is the treatment indicator. θ^(zn)\hat{\theta}({\bf z}_n) would be the treatment effect on the treated. The estimate would be the sample analogue of
τ=EF[yiDi=1]EF[yiDi=0]\tau = \mathbb{E}_F\left[ y_i \vert D_i = 1\right] - \mathbb{E}_F\left[ y_i \vert D_i = 0\right]

The influence function

First, we define a contaminated distribution function, Fϵ(zi)F_\epsilon(z_i), as:

Fϵ(zi)=(1ϵ)F+ϵδziF_\epsilon(z_i) = (1-\epsilon)F + \epsilon\delta_{z_i}

where δzi\delta_{z_i} is the probability measure which assigns probability 1 to zi=(xi,yi)z_i = (x_i, y_i) and 0 to all other elements. In effect, Fϵ(zi)F_\epsilon(z_i) makes data point zi=(xi,yi)z_i = (x_i, y_i) slightly more likely in the population. To make clear, if ϵ=0.5\epsilon = 0.5 that means with (at least) probability 1/2 you observe ziz_i given a random draw from Fϵ(zi)F_\epsilon(z_i).

Our goal is to see what happens to our estimator when we increase the probability of seeing ziz_i in the population. This gives us a sense of how ziz_i influences the sample distribution of the estimator θ^(zn)\hat{\theta}({\bf z}_n).

To build intuition, let’s think about outliers in regressions. If one observation, ziz_i, is a high-leverage outlier, then intuitively its presence has a lot of influence on the regression coefficients. Formally, the influence function asks if you make this ziz_i slightly more likely, how much does it move the estimated coefficients, β^\hat{\beta}.

To formalize this, we will use what’s called a “Gateaux derivative” which is just the fancy version of a derivative. The influence function of θ^\hat{\theta} at FF, IFθ^,F(zi)IF_{\hat{\theta}, F}(z_i) is defined as:

IFθ^,F(zi)=limϵ0θ(Fϵ(x))θ(F)ϵ IF_{\hat{\theta}, F}(z_i) = \lim_{\epsilon \to 0} \frac{\theta(F_\epsilon(x)) - \theta(F)}{\epsilon}

This is a slight change of notation as we are now not specifying a particular sample zn{\bf z}_n, but instead the distribution it is drawn from. The influence function is worked out based on the actual population moments that give rise to the sample estimates. This will hopefully make more sense when we work out some examples below.

Influence function and Variance of Estimator

What’s helpful about knowing the influence function is that we can think of our sample estimator as being equal to the true value (so long as we have unbiasedness) plus nn disturbances of the distribution with weight ϵ=1n\epsilon = \frac{1}{n} for each. Each disturbance causes the true estimate to be influenced (or moved) by approximately 1nIFθ^,F(zi)\frac{1}{n} * IF_{\hat{\theta}, F}(z_i) (derivative times a change in x). Since we are extrapolating the derivative by a distance of 1n\frac{1}{n}, this gives rise to higher order terms from the Taylor expansion:

θ^(zn)=θ0unbiasedness+i=1n1nIFθ^,F(zi)approx. influence of zi+higher order terms\hat{\theta}({\bf z_n}) = \underbrace{\theta_0}_{\text{unbiasedness}} + \sum_{i=1}^n \underbrace{\frac{1}{n} * IF_{\hat{\theta}, F}(z_i)}_{\text{approx. influence of } z_i} + \text{higher order terms}

An important thing to know is that EF[IFθ^,F(zi)]=0\mathbb{E}_F \left[ IF_{\hat{\theta}, F}(z_i) \right] = 0. Therefore, the second term above is going to be approximately zero in large samples, so we have that the mean of θ^(zn)θ0\hat{\theta}({\bf z_n}) - \theta_0 is zero (Note this doesn’t prove unbiasedness because we are assuming that already.)

This is similar to OLS asymptotics: nβ^β0=1ni=1n(XX)1Xε+n\*higher order terms\sqrt{n} \hat\beta - \beta*0 = \frac{1}{\sqrt{n}} \sum*{i=1}^n (X'X)^{-1}X'\varepsilon + \sqrt{n} \* \text{higher order terms}. Just like in the OLS case, having our Taylor Expansion is helpful because then asymptotics comes easy:

    n(θ^(zn)θ0)=1ni=1nIFθ^,F(zi)+nhigher order terms\implies \sqrt{n}\left( \hat{\theta}({\bf z_n}) - \theta_0 \right) = \frac{1}{\sqrt{n}} \sum_{i=1}^n IF_{\hat{\theta}, F}(z_i) + \sqrt{n} * \text{higher order terms}

Under some assumptions, the higher order terms go to zero faster than n\sqrt{n} goes to infinity, so the product is approximately zero under large samples.

Therefore, from some central limit theorem:

n(θ^(zn)θ0)dN(0,EF[i=1nIFθ^,F(zi)IFθ^,F(zi)])\sqrt{n}\left( \hat{\theta}({\bf z_n}) - \theta_0 \right) \to^d N(0, \mathbb{E}_F \left[ \sum_{i=1}^n IF_{\hat{\theta}, F}(z_i)' IF_{\hat{\theta}, F}(z_i) \right])

So, if you know the influence function, then you have large scale asymptotics for free! Then you can just plug in the sample estimates of IFθ^,F(zi)IF_{\hat{\theta}, F}(z_i) for each ziz_i and you have the variance-covariance matrix for your estimator.

Example 1: Mean

ziz_i is a scalar and our estimator is the population mean of the data, θ^(F)=EF[zi]\hat{\theta}(F) = \mathbb{E}_F\left[ z_i \right].

Our estimator is the sample analogue of: θ^(F)=EF(zi)\hat\theta(F) = E_F(z_i)

Lets think about what happens when we use the contaminated distribution function:

θ^(Fϵ(zi))=EFϵ(zi)[zi]=(1ϵ)EF[zi]+ϵEδzi[zi]=(1ϵ)θ^(F)+ϵzi\begin{align*} \hat{\theta}(F_{\epsilon}(z_i)) &= \mathbb{E}_{F_{\epsilon}(z_i)}\left[ z_i \right] \\ &= (1 - \epsilon) \mathbb{E}_F\left[ z_i \right] + \epsilon E_{\delta_{z_i}} \left[ z_i \right] \\ &= (1 - \epsilon) \hat{\theta}(F) + \epsilon z_i \\ \end{align*}

Note that the last equality comes from the fact that we are taking the expectation over the degenerate distribution δzi\delta_{z_i} which has mean ziz_i.

Now lets plug that back into the influence function equation

IFθ^,F(zi)=limϵ0θ^(Fϵ(x))θ^(F)ϵ=(1ϵ)θ^(F)+ϵziθ^(F)ϵ=ϵθ^(F)+ϵziϵ=ziEF[zi]\begin{align*} IF_{\hat{\theta}, F}(z_i) &= \lim_{\epsilon \to 0} \frac{\hat{\theta}(F_\epsilon(x)) - \hat{\theta}(F)}{\epsilon} \\ &= \frac{(1 - \epsilon) \hat{\theta}(F) + \epsilon z_i - \hat{\theta}(F)}{\epsilon} \\ &= \frac{-\epsilon \hat{\theta}(F) + \epsilon z_i}{\epsilon} \\ &= z_i - \mathbb{E}_F\left[ z_i \right] \end{align*}

This makes intuitive sense. Observations ziz_i are more influential to the sample estimator if they are further away from the population mean.

We can estimate EF[zi]\mathbb{E}_F\left[ z_i \right] with the sample mean, and

IF^θ^,F(zi)=zizˉ\widehat{IF}_{\hat{\theta}, F}(z_i) = z_i - \bar{z}

Let θ0\theta_0 be the true population mean, EF[zi]\mathbb{E}_F\left[ z_i \right]. From above and our calculate influence function, we know that

n(zˉθ0)N(0,1nEF[IFθ^,FIFθ^,F])=N(0,EF[(zizˉ)2]),\sqrt{n}(\bar{z} - \theta_0) \sim N(0, \frac{1}{n}\mathbb{E}_F[ IF_{\hat{\theta}, F}'IF_{\hat{\theta}, F}]) = N(0, \mathbb{E}_F\left[ (z_i-\bar{z})^2\right]),

which we can estimate this variance with 1ni=1n(zizˉ)2\frac{1}{n} \sum_{i=1}^n (z_i - \bar{z})^2, the sample variance.

Example 2: Regression coefficients

{zi(xi,yi)}i=1n\{z_i \equiv (x_i, y_i)\}_{i=1}^n where xix_i is a 1×k1\times k vector of covariates and yiy_i is a scalar outcome. The estimator is

β^(F)=argminβEF[(yXβ)(yXβ)] \hat{\beta}(F) = argmin_{\beta} \mathbb{E}_F\left[ (y-X\beta)'(y-X\beta) \right]

Lets think about what happens when we use the contaminated distribution function:

β^(Fϵ(zi))=argminβ((1ϵ)EF[(yXβ)(yXβ)]+ϵ(yixiβ)(yixiβ))\hat{\beta}(F_{\epsilon}(z_i)) = argmin_{\beta} \left( (1-\epsilon) \mathbb{E}_F[(y-X\beta)'(y-X\beta)] + \epsilon (y_i-x_i\beta)'(y_i-x_i\beta) \right)

The first term is the average squared error under the distribution F times (1ϵ)(1-\epsilon) and the second term is the average squared error at the point ziz_i times ϵ\epsilon.

Expanding terms:

θ^(Fϵ(zi))=argminβ{(1ϵ)EF[yy2yXββXXβ]+ϵ(yiyi2yixiββxixiβ)}\begin{align*} \hat{\theta}(F_{\epsilon}(z_i)) &= argmin_{\beta} \left\{ (1-\epsilon) \mathbb{E}_F[y'y - 2 y'X\beta - \beta'X'X\beta] \right. \\ &\quad\quad + \left. \epsilon * (y_i'y_i - 2 y_i'x_i\beta - \beta'x_i'x_i\beta)\right\} \end{align*}

Which gives first order condition:

(1ϵ)EF[2Xy2XXβ^ϵ]+ϵ(2xiyi2xixiβ^ϵ)=0(1-\epsilon) \mathbb{E}_F[ -2X'y - 2X'X \hat{\beta}_{\epsilon}] + \epsilon(2x_i'y_i - 2 x_i'x_i \hat{\beta}_{\epsilon}) = 0     (1ϵ)EF[XyXXβ^ϵ]=ϵ(xiyixixiβ^ϵ)\implies (1-\epsilon) \mathbb{E}_F[ -X'y - X'X \hat{\beta}_{\epsilon}] = - \epsilon(x_i'y_i - x_i'x_i \hat{\beta}_{\epsilon})

Now we are going to use a common trick. Which is to take the derivative with respect to ϵ\epsilon of the first order condition. This will give us a bunch of terms but also βϵ^ϵ\frac{\partial \hat{\beta_{\epsilon}}}{\partial \epsilon} which is the influence function! So we can solve the total derivative of the first order condition using ϵ0\epsilon \to 0 to get the influence function.

Taking the total derivative of the first order condition with respect to ϵ\epsilon:

EF[XyXXβ^ϵ]+(1ϵ)EF[XXβ^ϵβ^ϵ]β^ϵϵchain ruleproduct rule=(xiyixixiβ^ϵ)ϵxixiβ^ϵβ^ϵβ^ϵϵ\begin{align*} &\overbrace{- \mathbb{E}_F[ - X'y - X'X \hat{\beta}_{\epsilon}] + (1-\epsilon) \underbrace{\mathbb{E}_F\left[ \frac{\partial X'X \hat{\beta}_{\epsilon} }{\partial \hat{\beta}_{\epsilon}} \right] \frac{\partial \hat{\beta}_{\epsilon}}{\partial \epsilon}}_{\text{chain rule}}}^{\text{product rule}} \\ &\quad=-(x_i'y_i - x_i'x_i \hat{\beta}_{\epsilon}) - \epsilon \frac{\partial x_i' x_i \hat{\beta}_\epsilon}{\partial \hat{\beta}_{\epsilon}} * \frac{\partial \hat{\beta}_{\epsilon}}{\partial \epsilon} \end{align*}

To simplify a bunch, note that as ϵ0\epsilon \to 0:

(i) The first term, EF[XyXXβ^ϵ]0\mathbb{E}_F[ - X'y - X'X \hat{\beta}_{\epsilon}] \to 0 from first order condition.

(ii) The last term, ϵxixiβ^ϵβ^ϵβ^ϵϵ0\epsilon \frac{\partial x_i' x_i \hat{\beta}_\epsilon}{\partial \hat{\beta}_{\epsilon}} * \frac{\partial \hat{\beta}_{\epsilon}}{\partial \epsilon} \to 0 because ϵ0\epsilon \to 0.

(iii) XXβ^ϵβ^ϵ=XX\frac{\partial X'X \hat{\beta}_{\epsilon} }{\partial \hat{\beta}_{\epsilon}} = X'X

Therefore, as ϵ0\epsilon \to 0:

EF[XyXXβ^ϵ]=0+(1ϵ)1=EF[XXβ^ϵβ^ϵ]EF[XX]β^ϵϵ=(xiyixixiβ^ϵ)ϵxixiβ^ϵβ^ϵβ^ϵϵ0\begin{align*} &- \underbrace{\mathbb{E}_F[ - X'y - X'X \hat{\beta}_{\epsilon}]}_{= 0} + \underbrace{(1-\epsilon)}_{\to 1} \underbrace{ = \mathbb{E}_F\left[ \frac{\partial X'X \hat{\beta}_{\epsilon} }{\partial \hat{\beta}_{\epsilon}} \right]}_{\mathbb{E}_F\left[ X'X \right]} \frac{\partial \hat{\beta}_{\epsilon}}{\partial \epsilon} \\ &\quad=-(x_i'y_i - x_i'x_i \hat{\beta}_{\epsilon}) - \underbrace{\epsilon \frac{\partial x_i' x_i \hat{\beta}_\epsilon}{\partial \hat{\beta}_{\epsilon}} * \frac{\partial \hat{\beta}_{\epsilon}}{\partial \epsilon}}_{\to 0} \end{align*}     EF[XX]β^ϵϵ=(xiyixixiβ^ϵ)\implies \mathbb{E}_F[ X'X ] \frac{\partial \hat{\beta}_{\epsilon}}{\partial \epsilon} = - (x_i'y_i - x_i'x_i \hat{\beta}_{\epsilon})

Since β^ϵϵ\frac{\partial \hat{\beta}_{\epsilon}}{\partial \epsilon} is the definition of the influence function and β^ϵβ\hat{\beta}_{\epsilon} \to \beta as ϵ0\epsilon \to 0:

IFβ^(zn),F(zi)=EF[XX]1xi(yixiβ)IF_{\hat{\beta}({\bf z}_n),F}(z_i) = - \mathbb{E}_F[X'X]^{-1} x_i' (y_i - x_i \beta)

Example 3: Treatment Effect

For the last example, we are looking at the average treatment effect among the treated which assumes that counterfactual untreated outcomes are uncorrelated with treatment, i.e. yi(0)Diy_i(0) \perp D_i. Our data is zi=(yi,Di)z_i = (y_i, D_i) which is iid. Out estimator is the sample analogue of:

τ^(F)=EF[yiDi=1]EF[yiDi=0]\hat{\tau}(F) = E_F\left[ y_i \vert D_i = 1\right] - E_F\left[ y_i \vert D_i = 0\right]

Note that this is a sum of two estimators from the data: τ1(F)=EF(yiDi=1)\tau_1(F) = E_F(y_i \vert D_i = 1) and τ0(F)=EF(yiDi=0)\tau_0(F) = E_F(y_i \vert D_i = 0).

Since influence functions can be expressed as derivatives, we get to use the usual chain rule.

Chain Rule: If θ^=T(θ^1,,θ^k)\hat{\theta} = T(\hat{\theta}_1, \dots, \hat{\theta}_k), then the influence function of θ^\hat{\theta} is:

IFθ^,F(Zi)=Tθ^1IFθ^1,F(Zi)++Tθ^kIFθ^k,F(Zi)IF_{\hat{\theta}, F}(Z_i) = \frac{\partial T}{\partial \hat\theta_1} IF_{\hat{\theta}_1, F}(Z_i) + \dots + \frac{\partial T}{\partial \hat\theta_k} IF_{\hat{\theta}_k, F}(Z_i)

In the context of our function θ^(F)=τ1(F)τ0(F)\hat{\theta}(F) = \tau_1(F) - \tau_0(F), the influence function is

IFθ^,F(Zi)=1IFθ^1,F(Zi)1IFθ^0,F(Zi)IF_{\hat{\theta}, F}(Z_i) = 1 * IF_{\hat{\theta}_1, F}(Z_i) - 1 * IF_{\hat{\theta}_0, F}(Z_i)

Influence function of conditional mean

Let’s compute the influence function for τ^1(F)=EF[YiDi=1]\hat{\tau}_1(F) = E_F\left[Y_i \vert D_i = 1\right]. For observation ii, if Di=0D_i = 0, then θ^(Fϵ(zi))=θ^(F)\hat{\theta}(F_{\epsilon}(z_i)) = \hat{\theta}(F) and hence the influence function is 00.

If Di=1D_i = 1, then

θ^(Fϵ(zi))=EFϵ(zi)[yiDi=1]=(1ϵ)EF[yiDi=1]+ϵyi=(1ϵ)θ^(F)+ϵyi\begin{align*} \hat{\theta}(F_{\epsilon}(z_i)) &= \mathbb{E}_{F_{\epsilon}(z_i)}\left[ y_i \vert D_i = 1 \right] \\ &= (1 - \epsilon) \mathbb{E}_F\left[ y_i \vert D_i = 1 \right] + \epsilon y_i \\ &= (1 - \epsilon) \hat{\theta}(F) + \epsilon y_i\\ \end{align*}

Again, plugging into the influence function derivative formula:

limϵ0(1ϵ)θ^(F)+ϵyiθ^(F)ϵ=yiθ^(F)\lim_{\epsilon \to 0} \frac{(1 - \epsilon) \hat{\theta}(F) + \epsilon y_i - \hat{\theta}(F)}{\epsilon} = y_i - \hat{\theta}(F)

Therefore,

IFθ^1,F=1(Di=1)(yiE[yiDi=1])IF_{\hat{\theta}_1, F} = \mathcal{1}(D_i = 1) * (y_i - \mathbb{E}\left[y_i \vert D_i = 1\right])

Similarly, for θ^0(F)\hat{\theta}_0(F),

IFθ^0,F=1(Di=0)(yiE[yiDi=0])IF_{\hat{\theta}_0, F} = \mathcal{1}(D_i = 0) * (y_i - \mathbb{E}\left[y_i \vert D_i = 0\right])

Influence function of ATT

From the above two derivations, we have:

IFθ^,F(Zi)=1IFθ^1,F(Zi)1IFθ^0,F(Zi)=1(Di=1)(yiE[yiDi=1])1(Di=0)(yiE[yiDi=0])\begin{align*} IF_{\hat{\theta}, F}(Z_i) &= 1 * IF_{\hat{\theta}_1, F}(Z_i) - 1 * IF_{\hat{\theta}_0, F}(Z_i) \\ &= \mathcal{1}(D_i = 1) * (y_i - \mathbb{E}\left[y_i \vert D_i = 1\right])\\ &\quad\quad - \mathcal{1}(D_i = 0) * (y_i - \mathbb{E}\left[y_i \vert D_i = 0\right]) \end{align*}

Appendix: Matrix Algebra rules

For help working through the OLS section:

  1. For k×1k\times 1 vectors aa and bb:
abb=bab=a \frac{\partial a'b}{\partial b} = \frac{\partial b'a}{\partial b} = a
  1. For k×1k\times 1 vectors bb and symmetric k×kk \times k matrix AA:
bAbb=2Ab=2bA \frac{\partial b'Ab}{\partial b} = 2Ab = 2 b'A
  1. m×1m \times 1 vector yy, n×1n \times 1 vector xx, and m×nm \times n matrix AA:
yAxx=yA \frac{\partial y'Ax}{\partial x} = y'A yAxy=xA \frac{\partial y'Ax}{\partial y} = x'A'