Lecture_06_Back_Doors.Rmd · 贾凯威/CausalitySlides

title

author

date

output

Lecture 6: Back Doors

Nick Huntington-Klein

`r Sys.Date()`

revealjs::revealjs_presentation

theme

transition

self_contained

smart

fig_caption

reveal_options

solarized

slide

true

slideNumber
true

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE) library(tidyverse) library(dagitty) library(ggdag) library(gganimate) library(ggthemes) library(Cairo) library(modelsummary) library(wooldridge) theme_set(theme_gray(base_size = 15)) ``` ## Recap - We've now covered how to create causal diagrams - (aka Directed Acyclic Graphs or DAGs) - We simply write out the list of the important variables, and draw causal arrows indicating what causes what - This allows us to figure out what we need to do to *identify* our effect of interest ## Today - But HOW? How does it know? - Today we'll be covering the *process* that lets you figure out whether you can identify your effect of interest, and how - What do we need to condition the data on to limit ourselves just to the variation that identifies our effect of interest? - It turns out, once we have our diagram, to be pretty straightforward - So easy a computer can do it! ## The Back Door and the Front Door - The basic way we're going to be thinking about this is with a metaphor - When you do data analysis, it's like observing that someone left their house for the day - When you do causal inference, it's like asking *how they left their house* - You want to *make sure* that they came out the *front door*, and not out the back door, not out the window, not out the chimney ## The Back Door and the Front Door - Let's go back to this example ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=4.5} dag % tidy_dagitty() ggdag_classic(dag,node_size=20) + theme_dag_blank() ``` ## The Back Door and the Front Door - We're interested in the effect of IP spend on profits. That means that our *front door* is the ways in which IP spend *causally affects* profits - Our *back door* is any other thing that might drive a correlation between the two - the way that tech affects both ## Paths - In order to formalize this a little more, we need to think about the various *paths* - We observe that you got out of your house, but we want to know the paths you might have walked to get there - So, what are the paths we can walk to get from IP.spend to profits? ## Paths - We can go `Ip.spend -> profit` - Or `IP.spend profit` ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=4.5} dag % tidy_dagitty() ggdag_classic(dag,node_size=20) + theme_dag_blank() ``` ## The Back Door and the Front Door - One of these paths is the one we're interested in! - `Ip.spend -> profit` is a *front door path* - One of them is not! - `IP.spend profit` is a *back door path* ## Now what? - Now, it's pretty simple! - In order to make sure you came through the front door... - We must *close the back door* - We can do this by *controlling/adjusting* for things that will block that door! - We can close `IP.spend profit` by adjusting for `tech` ## So? - We already knew that we could get our desired effect in this case by controlling for `tech`. - But this process lets us figure out what we need to do in a *much wider range of situations* - All we need to do is follow the steps! - List all the paths - See which are back doors - Adjust for a set of variables that closes all the back doors! - (orrrr use a method that singles out the front door - we'll get there) ## Example - How does wine affect your lifespan? ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5} dag % tidy_dagitty() ggdag_classic(dag,node_size=20) + theme_dag_blank() ``` ## Paths - Paths from `wine` to `life`: - `wine -> life` - `wine -> drugs -> life` - `wine life` - `wine life` - `wine income -> life` - `wine health -> life` - Don't leave any out, even the ones that seem redundant! ## Paths - Front doors/Back doors - `wine -> life` - `wine -> drugs -> life` - `wine life` - `wine life` - `wine income -> life` - `wine health -> life` ## Adjusting - By adjusting/controlling for variables we close these back doors - If an adjusted variable appears anywhere along the path, we can close that path off - Once *ALL* the back door paths are closed, we have blocked all the other ways that a correlation COULD appear except through the front door! We've identified the causal effect! - This is "the back door method" for identifying the effect. There are other methods; we'll get to them. ## Adjusting for Health - Front doors/Open back doors/Closed back doors - `wine -> life` - `wine -> drugs -> life` - `wine life` - `wine life` - `wine income -> life` - `wine health -> life` ## Adjusting for Health - Clearly, adjusting for health isn't ENOUGH to identify - We need to adjust for health AND income - Conveniently, regression makes it easy to add additional controls ## Adjusting for Health and Income - Front doors/Open back doors/Closed back doors - `wine -> life` - `wine -> drugs -> life` - `wine life` - `wine life` - `wine income -> life` - `wine health -> life` ## How about Drugs? - Should we adjust for drugs? - No! This whole procedure makes that clear - It's on a *front door path* - If we adjusted for that, that's shutting out part of the way that `wine` *DOES* affect `life` ## Practice - We want to know how `X` affects `Y`. Find all paths and make a list of what to adjust for to close all back doors ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5} dag % tidy_dagitty() ggdag_classic(dag,node_size=20) + theme_dag_blank() ``` ## Practice Answers - Front door paths: `X -> Y`, `X -> E -> Y` - Back doors: `X Y`, `X Y`, `X B -> Y`, `X A -> Y`, `X Y`, `X A Y` - (that last back door is actually pre-closed, we'll get to that later) - We can close all back doors by adjusting for `A` and `B`. ## Controlling - So... what does it actually mean to control for something? - Often the way we *will do it* is just by adding a control variable to a regression. In $Y = \beta_0 + \beta_1X + \beta_2Z + \varepsilon$, the $\hat{\beta}_1$ estimate gives the effect of $X$ on $Y$ *while controlling for $Z$, and if adjusting for $Z$ closes all back doors, we've identified the effect of $X$ on $Y$! - But what does it *mean*? ## Controlling - The *idea* of controlling for a variable is that we want to *remove all parts of the $X$/$Y$ relationship that is related to that variable* - I.e. we want to *remove all variation related to that variable* - A regression control will do this (although it will only do it *linearly*), but anything that achieves this goal will work! - For example, if you want to "control for income", we could add income as a regression control, *or* we could pick a sample only made up of people with very similar incomes - No variation in $Z$: $Z$ is controlled for! ## The Two Main Approaches to Controlling Predicting Variation (what regression does): - Use $Z$ (and the other controls) to predict both $X$ and $Y$ as best you can - Remove all the predictable parts, and use only remaining variation, which is unrelated (orthogonal) to $Z$ Selecting Non-Variation (what "matching" does): - Choose observations that have different values of $X$ but have values of $Z$ that are as similar as possible - With multiple controls, this requires some way of combining them together to get a single "similarity" value ## The Two Main Approaches to Controlling - In this class we'll be focusing mostly on regression - Purely because that's what economists do most of the time - Regression and matching rely on slightly different assumptions to work, but neither is better than the other - Newfangled "doubly-robust" methods do both regression AND matching, so that the model only fails if the assumptions of BOTH methods fail - So then, focusing on the "predicting variation" approach... ## Controlling - Up to now, here's how we've been getting the relationship between `X` and `Y` while controlling for `W`: 1. See what part of `X` is explained by `W`, and subtract it out. Call the result the residual part of `X`. 2. See what part of `Y` is explained by `W`, and subtract it out. Call the result the residual part of `Y`. 3. Get the relationship between the residual part of `X` and the residual part of `Y`. - With the last step including things like getting the correlation, plotting the relationship, calculating the variance explained, or comparing mean `Y` across values of `X` ## In code ```{r, echo=TRUE, eval = FALSE} df % mutate(x = 2*w + rnorm(100)) %>% mutate(y = 1*x + 4*w + rnorm(100)) df % mutate(x.resid = x - predict(lm(x~w)), y.resid = y - predict(lm(y~w))) m1 % mutate(x = 2*w + rnorm(100)) %>% mutate(y = 1*x + 4*w + rnorm(100)) df % mutate(x.resid = x - predict(lm(x~w)), y.resid = y - predict(lm(y~w))) m1 Y` and `XY` - We remove the part of `X` and `Y` that `W` explains to get rid of `XY`, blocking `XY` and leaving `X->Y` ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=4} dag % tidy_dagitty() ggdag_classic(dag,node_size=20) + theme_dag_blank() ``` ## Graphically ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=4.5} df 100))) %>% mutate(X = .5+2*W + rnorm(200)) %>% mutate(Y = -.5*X + 4*W + 1 + rnorm(200),time="1") %>% group_by(W) %>% mutate(mean_X=mean(X),mean_Y=mean(Y)) %>% ungroup() #Calculate correlations before_cor % mutate(mean_X=NA,mean_Y=NA,time=before_cor), #Step 2: Add x-lines df %>% mutate(mean_Y=NA,time='2. Figure out what differences in X are explained by W'), #Step 3: X de-meaned df %>% mutate(X = X - mean_X,mean_X=0,mean_Y=NA,time="3. Remove differences in X explained by W"), #Step 4: Remove X lines, add Y df %>% mutate(X = X - mean_X,mean_X=NA,time="4. Figure out what differences in Y are explained by W"), #Step 5: Y de-meaned df %>% mutate(X = X - mean_X,Y = Y - mean_Y,mean_X=NA,mean_Y=0,time="5. Remove differences in Y explained by W"), #Step 6: Raw demeaned data only df %>% mutate(X = X - mean_X,Y = Y - mean_Y,mean_X=NA,mean_Y=NA,time=after_cor)) p % group_by(train) %>% summarize(wage = mean(re78)) ``` ```{r, echo=FALSE, eval=TRUE} #EXPERIMENT data(jtrain2) jtrain2 %>% group_by(train) %>% summarize(wage = mean(re78)) ``` ```{r, echo=TRUE, eval=TRUE} #BY CHOICE data(jtrain3) jtrain3 %>% group_by(train) %>% summarize(wage = mean(re78)) ``` ## Hmm... - What back doors might the `jtrain3` analysis be facing? - People who need training want to get it but are likely to get lower wages anyway! ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=4.5} set.seed(1000) dag % tidy_dagitty() ggdag_classic(dag,node_size=20) + theme_dag_blank() ``` ## Apples to Apples - The two data sets are looking at very different groups of people! ```{r, echo=TRUE} library(vtable) sumtable(select(jtrain2,re75,re78), out = 'return') sumtable(select(jtrain3,re75,re78), out = 'return') ``` ## Controlling - We can't measure "needs training" directly, but we can sort of control for it by limiting ourselves solely to the kind of people who need it - those who had low wages in 1975 ```{r, echo=FALSE, eval=TRUE} jtrain2 %>% group_by(train) %>% summarize(wage = mean(re78)) jtrain3 %>% filter(re75 <= 1.2) %>% group_by(train) %>% summarize(wage = mean(re78)) ``` ## Controlling - Not exactly the same (not surprising - we were pretty arbitrary in how we controlled for `need.tr`, and we never closed `train wage`, oh and we left out plenty of other back doors: race, age, etc.) but an improvement - This is a demonstration of controlling by choosing a sample; we could also just control for 1975 wages ## Controlling ```{r, echo = FALSE} msummary(list(lm(re78~train, data = jtrain3), lm(re78~train+re75, data = jtrain3)), stars = TRUE, gof_omit = 'Adj|AIC|BIC|Lik|F') ``` ## Bad Controls - So far so good - we have the concept of what it means to control and some ways we can do it, so we can get apples-to-apples comparisons - But what should we control for? - Everything, right? We want to make sure our comparison is as apple-y as possible! - Well, no, not actually ## Bad Controls - Some controls can take you away from showing you the front door - We already discussed how it's not a good idea to block a front-door path. - An increase in the price of cigarettes might improve your health, but not if we control for the number of cigarettes you smoke! ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=2.5} dag % tidy_dagitty() ggdag_classic(dag,node_size=20) + theme_dag_blank() ``` ## Bad Controls - There is another kind of bad control - a *collider* - Basically, if you're listing out paths, and you see a path where the arrows *collide* by both pointing at the same variable, **that path is already blocked** - Like this: `X C Y` - Note the `-> C % ggdag(node_size=20) + theme_dag_blank() ``` ## Colliders - How could this be? - Because even if two variables *cause* the same thing (`a -> m`, `b -> m`), that doesn't make them related. Your parents both caused your genetic makeup, that doesn't make *their* genetics related. Knowing dad's eye color tells you nothing about mom's. - But *within given values of the collider*, they ARE related. If you're brown-eyed, then observing that your dad has blue eyes tells us that your mom is brown-eyed ## Colliders - So here, `x m y` is pre-blocked, no problem. `a` and `b` are unrelated, so no back door issue! - Control for `m` and now `a` and `b` are related, back door path open. ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=4} m_bias(x_y_associated=TRUE) %>% ggdag(node_size=20) + theme_dag_blank() ``` ## Example - You want to know if programming skills reduce your social skills - So you go to a tech company and test all their employees on programming and social skills - Let's imagine that the *truth* is that programming skills and social skills are unrelated - But you find a negative relationship! What gives? ## Example - Oops! By surveying only the tech company, you controlled for "works in a tech company" - To do that, you need programming skills, social skills, or both! It's a collider! ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=3.5} dag % tidy_dagitty() ggdag_classic(dag,node_size=20) + theme_dag_blank() ``` ## Example ```{r, echo=TRUE, eval = FALSE} set.seed(14233) survey % mutate(hired = (prog + social > .25)) # Truth m1 % filter(hired == 1)) #Surveying everyone and controlling with our normal method m3 % mutate(hired = (prog + social > .25)) # Truth m1 % filter(hired == 1)) #Surveying everyone and controlling with our normal method m3 % transmute(time="1", X=prog,Y=social,C=hired) %>% group_by(C) %>% mutate(mean_X=mean(X),mean_Y=mean(Y)) %>% ungroup() #Calculate correlations before_cor % mutate(mean_X=NA,mean_Y=NA,C=0,time=before_cor), #Step 2: Raw data only df %>% mutate(mean_X=NA,mean_Y=NA,time='2. Separate data by the values of hired.'), #Step 3: Add x-lines df %>% mutate(mean_Y=NA,time='3. Figure out what differences in prog are explained by hired'), #Step 4: X de-meaned df %>% mutate(X = X - mean_X,mean_X=0,mean_Y=NA,time="4. Remove differences in prog explained by hired"), #Step 5: Remove X lines, add Y df %>% mutate(X = X - mean_X,mean_X=NA,time="5. Figure out what differences in social are explained by hired"), #Step 6: Y de-meaned df %>% mutate(X = X - mean_X,Y = Y - mean_Y,mean_X=NA,mean_Y=0,time="6. Remove differences in social explained by hired"), #Step 7: Raw demeaned data only df %>% mutate(X = X - mean_X,Y = Y - mean_Y,mean_X=NA,mean_Y=NA,time=after_cor)) p discrim -> wage`; our treatment is `gender -> discrim`, the discrimination caused by your gender ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=4.5} dag % tidy_dagitty() ggdag_classic(dag,node_size=20) + theme_dag_blank() ``` ## Colliders in the Gender Wage Gap - Front doors/Open back doors/Closed back doors - `gender -> discrim -> wage` - `gender -> discrim -> occup -> wage` - `discrim occup -> wage` - `discrim occup wage` - `gender -> discrim -> occup wage` ## Colliders in the Gender Wage Gap - No `occup` control? Ignore nondiscriminatory reasons to choose different occupations by gender - Control for `occup`? Open both back doors, create a correlation between `abil` and `discrim` where there wasn't one - And also close a FRONT door, `gender -> discrim -> occup -> wage`: discriminatory reasons for gender diffs in `occup` - We actually *can't* identify the effect we want in this diagram by controlling. It happens! - Suggests this question goes beyond just controlling for stuff. Real research on this topic gets clever. ## Next Time - Perhaps one of the ways we could get at the problem is by isolating *front doors* instead of focusing on closing *back doors* - Many common causal inference methods combine the two! - Next time we'll look at the concept of isolating a front door path, usually using "natural experiments" ## Practice - We want to know how `X` affects `Y`. Find all paths and make a list of what to adjust for to close all back doors ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5} dag % tidy_dagitty() ggdag_classic(dag,node_size=20) + theme_dag_blank() ``` ## Practice Answers - Front door paths: `X -> Y`, `X -> E -> Y` - Back doors: `X Y`, `X Y`, `X B -> Y`, `X A -> Y`, `X Y`, `X A Y` - (`X A Y` is pre-closed by a collider) - We can close all back doors by adjusting for `A` and `B`.

贾凯威/CausalitySlides

简介

发行版

贡献者

近期动态

贾凯威/CausalitySlides .gitee-modal { width: 500px !important; }

简介

发行版

贡献者

近期动态

搜索帮助

贾凯威/CausalitySlides