title |
author |
date |
output |
Lecture 15 Instrumental Variables |
Nick Huntington-Klein |
March 20, 2019 |
revealjs::revealjs_presentation |
theme |
transition |
self_contained |
smart |
fig_caption |
reveal_options |
solarized |
slide |
true |
true |
true |
|
|
|
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
library(tidyverse)
library(dagitty)
library(ggdag)
library(gganimate)
library(ggthemes)
library(Cairo)
library(fixest)
library(modelsummary)
theme_set(theme_gray(base_size = 15))
```
## Recap
- We've covered quite a few methods for isolating causal effects!
- Controlling for variables to close back doors (explain X and Y with the control, remove what's explained)
- Matching on variables to close back doors (find treated and non-treated observations with )
- Using a control group to control for time (before/after difference for treated and untreated, then difference them)
- Using a cutoff to construct a very good control group (treated/untreated difference near a cutoff)
## Today
- We've got ONE LAST METHOD to go deep on!
- Today we'll be covering *instrumental variables*
- The basic idea is that we have some variable - the instrumental variable - that causes `X` but has no other open back doors!
## Natural Experiments
- This calls back to our idea of trying to mimic an experiment without having an experiment. In fact, let's think about an actual randomized experiment.
- We have some random assignment `R` that determines your `X`. So even though we have back doors between `X` and `Y`, we can identify `X -> Y`
```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=3}
dag % tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## Natural Experiments
- The idea of instrumental variables is this:
- What if we can find a variable that can take the place of R in the diagram despite not actually being something we randomized in an experiment?
- If we can do that, we've clearly got a "natural experiment"
- When we find a variable that can do that, we call it an "instrument" or "instrumental variable"
- Let's call it `Z`
## Instrumental Variable
So, for `Z` take the place of `R` in the diagram, what do we need?
- `Z` must be related to `X` (typically `Z -> X` but not always)
- There must be *no open paths* from `Z` to `Y` *except for ones that go through `X`*
In other words "`Z` is related to `X`, and all the effect of `Z` on `Y` goes THROUGH `X`"
## Instrumental Variable
- This doesn't relieve us of the duty of identifying a causal effect by closing back doors
- But it *moves* that duty from the endogenous variable to the instrument, which potentially is easier to identify
- (and then adds on the additional requirement that there are also no open *front* doors from $Z$ to $Y$ except through $X$ )
## Instrumental Variable
How?
- Explain `X` with `Z`, and keep only what *is* explained, `X'`
- Explain `Y` with `Z`, and keep only what *is* explained, `Y'`
- [If `Z` is logical/binary] Divide the difference in `Y'` between `Z` values by the difference in `X'` between `Z` values
- [If `Z` is not logical/binary] Get the correlation between `X'` and `Y'`
## Estimation
- We will be doing this mostly by hand today (until the end part) but most commonly this is estimated using *two stage least squares*
- We basically just do what we described on the last slide:
1. Use the instruments and controls to explain $X$ in the first stage
1. Use the controls and the predicted (explained) part of $X$ in place of $X$ in the second stage
1. (do some standard error adjustments)
Many ways to do this in R, we'll be doing 2SLS with `feols()` from **fixest**
## Graphically
```{r, dev='CairoPNG', echo=FALSE, fig.width=8,fig.height=7}
df 100),
W = rnorm(200)) %>%
mutate(X = .5+2*W +2*Z+ rnorm(200)) %>%
mutate(Y = -X + 4*W + 1 + rnorm(200),time="1") %>%
group_by(Z) %>%
mutate(mean_X=mean(X),mean_Y=mean(Y),YL=NA,XL=NA) %>%
ungroup()
#Calculate correlations
before_cor % mutate(mean_X=NA,mean_Y=NA,time=before_cor),
#Step 2: Add x-lines
df %>% mutate(mean_Y=NA,time='2. What differences in X are explained by Z?'),
#Step 3: X de-meaned
df %>% mutate(X = mean_X,mean_Y=NA,time="3. Remove everything in X not explained by Z"),
#Step 4: Remove X lines, add Y
df %>% mutate(X = mean_X,mean_X=NA,time="4. What differences in Y are explained by Z?"),
#Step 5: Y de-meaned
df %>% mutate(X = mean_X,Y = mean_Y,mean_X=NA,time="5. Remove everything in Y not explained by Z"),
#Step 6: Raw demeaned data only
df %>% mutate(X = mean_X,Y =mean_Y,mean_X=NA,mean_Y=NA,YL=mean_Y,XL=mean_X,time=afterlab))
#Get line segments
endpts %
group_by(Z) %>%
summarize(mean_X=mean(mean_X),mean_Y=mean(mean_Y))
p Y, With Binary Z as an Instrumental Variable \n{next_state}')+
transition_states(time,transition_length=c(6,16,6,16,6,6),state_length=c(50,22,12,22,12,50),wrap=FALSE)+
ease_aes('sine-in-out')+
exit_fade()+enter_fade()
animate(p,nframes=175)
```
## Instrumental Variables
- Notice that this whole process is like the *opposite* of controlling for a variable
- We explain `X` and `Y` with the variable, but instead of tossing out what's explained, we ONLY KEEP what's explained!
- Instead of saying "you're on a back door, I want to close you" we say "you have no back doors! I want my `X` to be just like you! I'm only keeping that part of `X` that's explained by you!"
- Since `Z` has no back doors, the part of `X` explained by `Z` has no back doors to the part of `Y` explained by `Z`
## Imperfect Assignment
- Let's apply one of the common uses of instrumental variables, which actually *is* when you have a randomized experiment
- In normal circumstances, if we have an experiment and assign people with `R`, we just compare `Y` across values of `R`:
```{r, echo=TRUE}
df %
mutate(X = R, Y = 5*X + rnorm(500))
#The truth is a difference of 5
df %>% group_by(R) %>% summarize(Y=mean(Y))
```
## Imperfect Assignment
- But what happens if you run a randomized experiment and assign people with `R`, but not everyone does what you say? Some "treated" people don't get the treatment, and some "untreated" people do get it
- When this happens, we can't just compare `Y` across `R`
- But `R` is still a valid instrument!
## Imperfect Assignment
```{r, echo=TRUE}
df %
#We tell them whether or not to get treated
mutate(X = R) %>%
#But some of them don't listen! 20% do the OPPOSITE!
mutate(X = ifelse(runif(500) > .8,1-R,R)) %>%
mutate(Y = 5*X + rnorm(500))
#The truth is a difference of 5
df %>% group_by(R) %>% summarize(Y=mean(Y))
```
## Imperfect Assignment
- So let's do IV (instrumental variables); `R` is the IV.
```{r, echo=TRUE}
iv % group_by(R) %>% summarize(Y = mean(Y), X = mean(X))
iv
#Remember, since our instrument is binary, we want the slope
(iv$Y[2] - iv$Y[1])/(iv$X[2]-iv$X[1])
#Truth is 5!
```
## Another Example
- Justifying that an IV has no back doors can be hard!
- Usually things aren't as clean-cut as having actual randomization
- And sometimes we may have to add controls in order to justify the IV
- Think hard - are there really no other paths from `Z` to `Y`?
- This will often require *detailed contextual knowledge* of the data generating process
## Pollution and Driving
- If air quality is really bad, you may choose to drive instead of walk/bike/bus in order to avoid breathing it
- So do particularly smoggy days lead people to drive more?
- Pan He and Cheng Xu ask this question using Shanghai as an example!
## Pollution and Driving
- Plenty of back doors - seasons, whether factories are running, smog levels last week...
```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=4.5}
dag % tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## Pollution and Driving
- The *direction of the wind* could be an IV - Shanghai faces the water, and so when the wind blows West, it brings pollution into the city
```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=4.5}
dag % tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## Pollution and Driving
- This gives us an IV we can use!
- Of course, we need to control for Season to block out the back door.
- The authors do indeed find that additional smog, brought in by the wind, increases the number of people who choose to drive - making the problem worse later!!
## Trade and Manufacturing
- Another example: did Chinese imports reduce US manufacturing employment?
- Employment in the US manufacturing sector has been dropping for decades
- (note - *manufacturing itself* isn't dropping, we're manufacturing more than ever, we're just doing it without as many actual people)
## Trade and Manufacturing
- The timing of the drop in manufacturing jobs coincides with us importing a lot more Chinese stuff
- But did the Chinese imports *cause* the decline or was it a coincidence? Automation is another good explanation!
- Or general declining US competitiveness in the global market, vs. everyone (not just China)
## Trade and Manufacturing
- Autor, Dorn, & Hanson use Chinese exports *to other countries* (`CEXoth`) as an IV for Chinese exports *to the US* (`CEXus`) in order to estimate the impact of Chinese exports *to the US* on US manufacturing employment (`mfg`)
- Let's think about whether this makes sense as an IV - any back doors from `CEXoth` to `mfg` we can imagine? Or front doors that don't go through `CEXus`?
- Also, important, do we think that the arrow from `CEXoth` to `CEXus` is actually there?
## Trade and Manufacturing
`D` is global demand for US manufactures, `L` is US labor supply, `Close` measures how similar the kinds of things the US manufactures are to China manufactures
```{r, dev='CairoPNG', echo=FALSE, fig.width=8,fig.height=4}
dag % tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## Trade and Manufacturing
- So we need to control for `D` in some way to close `CEXoth D -> mfg` but other than that we have a good instrument
- (they do this, and also use information like "what does `Close` look like on a regional level?" to improve their estimate)
- Autor, Dorn, & Hanson found that Chinese exports elsewhere predicted them in the US (China was opening up and becoming more effective as a producer, making their products attractive everywhere)
## Trade and Manufacturing
- And when you limit `mfg` and `CEXus` to just what's explained by `CEXoth`, you do see that some decline in `mfg` is because of Chinese imports
![Direct effect of instruments on `mfg`](Lecture_15_AutorDornHanson.png)
## Practice
- Does the price of cigarettes affect smoking? Get AER package and data(CigarettesSW). Examine with help().
- Get JUST thecigarette taxes `cigtax` from `taxs-tax`
- Draw a causal diagram using `packs`, `price`, `cigtax`, and some back door `W`. What might `W` be?
- Adjust `price` and `cigtax` for inflation: divide them by `cpi`
- Explain `price` and `packs` with `cigtax` using `cut(,breaks=7)` for `cigtax`
- Get correlation between the explained parts and plot the explained parts - does price reduce packs smoked?
## Practice Answers
```{r, echo=TRUE}
library(AER)
data(CigarettesSW)
CigarettesSW %
mutate(cigtax = taxs-tax) %>%
mutate(price = price/cpi,
cigtax = cigtax/cpi) %>%
group_by(cut(cigtax,breaks=7)) %>%
summarize(priceexp = mean(price),
packsexp = mean(packs)) %>%
ungroup()
cor(CigarettesSW$priceexp,CigarettesSW$packsexp)
```
## Practice Answers Plot
```{r, echo=TRUE, fig.width=6,fig.height=4}
plot(CigarettesSW$priceexp,CigarettesSW$packsexp)
```
## Practice Diagram Answers
```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=4}
dag % tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## Practice - Doing it with Regression!
- Common 2SLS estimators: `ivreg` in **AER**, `iv_robust` in **estimatr**, and `feols()` in **fixest**. We'll use the latter since it's fast easy to combine with fixed effects and all kinds of error adjustments
```{r, echo = TRUE, eval = FALSE}
m %
mutate(cigtax = taxs-tax) %>%
mutate(price = price/cpi,
cigtax = cigtax/cpi)
first_stage