Note that this material is mainly a summary of the math in BenJann’snotes compiled together with additional writing by myself.
Setup
Let zn={zi}i=1n be a random sample of data that comes from some true underlying distribution F. We take this data and compute some estimator with it: θ^(zn) (scalar or vector). Note that this is a function of the sample we observe, zn. Some examples:
Sample Mean:zi is a scalar and θ^(zn) the sample mean of the data.
μ=EF[zi]
Regression:{zi≡(xi,yi)}i=1n be a random sample of data, where xi is a 1×k vector of covariates and yi is a scalar outcome. θ^(zn) would be the OLS coefficients β.
β=argminβEF[(y−Xβ)′(y−Xβ)]
Treatment Effect:{zi≡(yi,Di)}i=1n where yi is the outcome and Di is the treatment indicator. θ^(zn) would be the treatment effect on the treated. The estimate would be the sample analogue of
τ=EF[yi∣Di=1]−EF[yi∣Di=0]
The influence function
First, we define a contaminated distribution function, Fϵ(zi), as:
Fϵ(zi)=(1−ϵ)F+ϵδzi
where δzi is the probability measure which assigns probability 1 to zi=(xi,yi) and 0 to all other elements. In effect, Fϵ(zi) makes data point zi=(xi,yi) slightly more likely in the population. To make clear, if ϵ=0.5 that means with (at least) probability 1/2 you observe zi given a random draw from Fϵ(zi).
Our goal is to see what happens to our estimator when we increase the probability of seeing zi in the population. This gives us a sense of how ziinfluences the sample distribution of the estimator θ^(zn).
To build intuition, let’s think about outliers in regressions. If one observation, zi, is a high-leverage outlier, then intuitively its presence has a lot of influence on the regression coefficients. Formally, the influence function asks if you make this zi slightly more likely, how much does it move the estimated coefficients, β^.
To formalize this, we will use what’s called a “Gateaux derivative” which is just the fancy version of a derivative. The influence function of θ^ at F, IFθ^,F(zi) is defined as:
IFθ^,F(zi)=ϵ→0limϵθ(Fϵ(x))−θ(F)
This is a slight change of notation as we are now not specifying a particular sample zn, but instead the distribution it is drawn from. The influence function is worked out based on the actual population moments that give rise to the sample estimates. This will hopefully make more sense when we work out some examples below.
Influence function and Variance of Estimator
What’s helpful about knowing the influence function is that we can think of our sample estimator as being equal to the true value (so long as we have unbiasedness) plus n disturbances of the distribution with weight ϵ=n1 for each. Each disturbance causes the true estimate to be influenced (or moved) by approximately n1∗IFθ^,F(zi) (derivative times a change in x). Since we are extrapolating the derivative by a distance of n1, this gives rise to higher order terms from the Taylor expansion:
θ^(zn)=unbiasednessθ0+i=1∑napprox. influence of zin1∗IFθ^,F(zi)+higher order terms
An important thing to know is that EF[IFθ^,F(zi)]=0. Therefore, the second term above is going to be approximately zero in large samples, so we have that the mean of θ^(zn)−θ0 is zero (Note this doesn’t prove unbiasedness because we are assuming that already.)
This is similar to OLS asymptotics: nβ^−β∗0=n1∑∗i=1n(X′X)−1X′ε+n\*higher order terms. Just like in the OLS case, having our Taylor Expansion is helpful because then asymptotics comes easy:
⟹n(θ^(zn)−θ0)=n1i=1∑nIFθ^,F(zi)+n∗higher order terms
Under some assumptions, the higher order terms go to zero faster than n goes to infinity, so the product is approximately zero under large samples.
So, if you know the influence function, then you have large scale asymptotics for free! Then you can just plug in the sample estimates of IFθ^,F(zi) for each zi and you have the variance-covariance matrix for your estimator.
Example 1: Mean
zi is a scalar and our estimator is the population mean of the data, θ^(F)=EF[zi].
Our estimator is the sample analogue of: θ^(F)=EF(zi)
Lets think about what happens when we use the contaminated distribution function:
The first term is the average squared error under the distribution F times (1−ϵ) and the second term is the average squared error at the point zi times ϵ.
Now we are going to use a common trick. Which is to take the derivative with respect to ϵ of the first order condition. This will give us a bunch of terms but also ∂ϵ∂βϵ^ which is the influence function! So we can solve the total derivative of the first order condition using ϵ→0 to get the influence function.
Taking the total derivative of the first order condition with respect to ϵ:
Since ∂ϵ∂β^ϵ is the definition of the influence function and β^ϵ→β as ϵ→0:
IFβ^(zn),F(zi)=−EF[X′X]−1xi′(yi−xiβ)
Example 3: Treatment Effect
For the last example, we are looking at the average treatment effect among the treated which assumes that counterfactual untreated outcomes are uncorrelated with treatment, i.e. yi(0)⊥Di. Our data is zi=(yi,Di) which is iid. Out estimator is the sample analogue of:
τ^(F)=EF[yi∣Di=1]−EF[yi∣Di=0]
Note that this is a sum of two estimators from the data: τ1(F)=EF(yi∣Di=1) and τ0(F)=EF(yi∣Di=0).
Since influence functions can be expressed as derivatives, we get to use the usual chain rule.
Chain Rule: If θ^=T(θ^1,…,θ^k), then the influence function of θ^ is:
In the context of our function θ^(F)=τ1(F)−τ0(F), the influence function is
IFθ^,F(Zi)=1∗IFθ^1,F(Zi)−1∗IFθ^0,F(Zi)
Influence function of conditional mean
Let’s compute the influence function for τ^1(F)=EF[Yi∣Di=1]. For observation i, if Di=0, then θ^(Fϵ(zi))=θ^(F) and hence the influence function is 0.