An Interesting Aspect of the Omitted Variable Bias
This article is originally published at http://skranz.github.io/
Econometrics does not cease to surprise me. I just now realized an interesting feature of the omitted variable bias. Consider the following model:
Assume we want to estimate the causal effect beta
of x
on y
. However, we have an unobserved confounder z
that affects both x
and y
. If we don’t add the confounder z
as control variable in the regression of y
on x
, the OLS estimator of beta
will be biased. That is the so called omitted variable bias.
Let’s simulate a data set and illustrate the omitted variable bias:
n = 10000
alpha = beta = gamma = 1
z = rnorm(n,0,1)
eps.x = rnorm(n,0,1)
eps.y = rnorm(n,0,1)
x = alpha*z + eps.x
y = beta*x + gamma*z + eps.y
# Estimate short regression with z omitted
coef(lm(y~x))[2]
## x
## 1.486573
While the true causal effect beta
is equal to 1, our OLS estimator where we omit z
is around 1.5
. This means it has a positive bias of roughly 0.5
.
Before we continue, let’s have a quiz (click here if the Google form quiz is not correctly shown.):
Let’s see what happens if we increase the impact of the confounder z
on x
, say to alpha=1000
.
alpha = 1000
x = alpha*z + eps.x
y = beta*x + gamma*z + eps.y
coef(lm(y~x))[2]
## x
## 1.000983
The bias is almost gone!
This result surprised me at first. I previously had the following intuition: An omitted variable is only a problem if it affects both y
and x
. Thus the omitted variable bias probably becomes worse if the confounder z
affects y
or x
more strongly. While this intuition is correct for small alpha
, it is wrong once alpha
is sufficiently large.
For our simulation, we can derive the following analytic formula for the (asymptotic) bias of the OLS estimator $\hat \beta$ in the short regression:
[asy. \; Bias(\hat \beta) = \gamma\alpha\frac{Var(z)}{\alpha^{2}Var(z)+Var(\varepsilon_x)}] (From now on, I use Mathjax. If you read on a blog aggregator where Mathjax is not well rendered click here.)
Let’s plot the bias for different values of $\alpha$:
Var.z = Var.eps.x = 1
alpha = seq(0,10,by=0.1)
asym.bias = gamma*alpha * Var.z /
(alpha^2*Var.z+Var.eps.x)
plot(alpha,asym.bias)
For small $\alpha$ the bias of $\hat \beta$ first quickly increases in $\alpha$. But it decreases in $\alpha$ once $\alpha$ is larger than 1. Indeed the bias then slowly converges back to 0.
Intuitively, if $\alpha$ is large, the explanatory variable $x$ has a lot of variation and the confounder mainly affects $y$ through $x$. The larger is $\alpha$, the relatively less important is therefore the direct effect of $z$ on $y$. The direct effect from $z$ on $y$ will thus bias the OLS estimator $\hat \beta$ of the short regression less and less.
Typical presentation of the omitted variable bias formula
Note that the omitted variable bias formula is usually presented as follows:
[ Bias(\hat \beta) = \gamma \hat \delta] where $\hat \delta$ is the OLS estimate of the linear regression [z = const + \delta x + u ]
(This bias formula is derived under the assumption that $x$ and $z$ are fixed. This allows to compute the bias, not only the asymptotic bias.) If we solve the equation above for $x$, we can write it as
[
x=\tilde{const} + \frac 1 \delta z + \tilde u
]
suggesting $\alpha \approx \frac 1 \delta$ and thus an approximate bias of $\frac \gamma \alpha$. (This argumentation is just suggestive but not fully correct. The effects of swapping the y
and x
in a simple linear regression can be a bit surprising, see my previous post.)
If we look at our previous formula for the asymptotic bias and consider in the limit of no exogenous variation of $x$, i.e. $Var(\varepsilon_x) = 0$, we indeed get [\lim_{Var(\varepsilon_x)\rightarrow 0 } asy. \; Bias(\hat \beta) = \frac \gamma\alpha]
However, the presence of exogenous variation in $x$ makes the bias formula more complicated. In particular, it has the effect that as long as $\alpha$ is still small, the bias increases in $\alpha$.
Appendix: Derivation of the asymptotic bias formula
Here is just a short derivation of the first asymptotic bias formula. We estimate a simple regression (just one explanatory variable): [ y=const+\beta x+\eta ]
For example, the introductionary textbook by Wooldridge shows in the chapter on the OLS asymptotics that under relatively weak assumptions the asymptotic bias of the OLS estimator $\hat{\beta}$ in such a simple regression is given by [ asy.\; Bias(\hat{\beta})=\frac{Cov(x,\eta)}{Var(x)} ]
In our simulation, the error term of the short regression is given by [ \eta=\gamma z+\varepsilon_{y} ] and $x$ is given by [ x=\alpha z+\varepsilon_{x} ] where and $\varepsilon_{y}$ and $\varepsilon_{x}$ are iid errors. We thus have [ Cov(x,\eta)=\alpha\gamma Var(z) ] and [ Var(x)=\alpha^{2}Var(z)+Var(\varepsilon_{x}) ]
Hence we get the asymptotic bias formula [ asy.\; Bias(\hat{\beta})=\alpha\gamma\frac{Var(z)}{\alpha^{2}Var(z)+Var(\varepsilon_{x})} ]
Thanks for visiting r-craft.org
This article is originally published at http://skranz.github.io/
Please visit source website for post related comments.