Thank you so much for this! My concern is why didn't you run a complete regression model for house price? Why only a bivariate regression? (i.e., dependent and dummies).
The purpose of this video is to demonstrate the basic technique of differences in differences estimation. You can certainly add controls to the basic model, but that is outside the scope of this video. You can search my channel for other videos on that.
I watched the video 2 years ago, it helped me understand the DID model and Stata so that I could finish my graduation dissertation on time. After the graduation, I published another paper using the same model, thank you sooooooooo much!!!!!!!!
Thanks, Very well explained. Can I get this dummy data set or can you please guide from where I can get such dummy data set for educational / learning purpose only ?
Thanks for the video! If you don't have an ideal counterfactual control group (i.e. there are some slight differences between the treatment and control groups in the pre-treatment period), can you add other independent variables to the diff n diff when running the regression in Stata?
Thanks for the detailed Info. what if my Dependent variable is Categorical like Anemia (Yes / No). What should i need to take B coefficient or Exp(B)?? And how to cross check in excel ?
Hello sir, please provide a video on reshape long from wide particularly when data sets is very large in size ..I.e., how to organise the variables before reshape... please sir ...
Phenomenal explanation, thank you. If you wanted to include more prior years and a few years after, would you have to make a dummy variable for each year?
Thanks for this video! One question: how would you proceed if you are comparing the difference between control and treated group across a 4 week period, testing whether the difference is bigger in the beginning and decreases?
Thank you for the video! Btw is there any way that we can also see the trends of both groups by drawing a line graph in Stata? If the trends are same before the treatment period, we should be able to see that right?
Hi Sebastian, very useful video at a great pace ;). In this example you compare the differences in price, how would you interpret the results if the variable is categorical (eg. completed studies, married, etc). Many thanks!
You can only do this if the categorical variable is binary (eg. married and not married). Assign a 1 to married and 0 to unmarried. We now have a linear probability model (see my video on binary choice models). The interpretation of the diff-in-diff is now the difference in probability of being married.
What a clear explanation! I'm working on my own DD regression, and it really helped. Does the dependent variable 'price' cover prices before & after the treatment here, right?
I wanted to keep things simple and focus just on the diff in diff technique. However, you can certainly add more variables to the regression as controls.
Dr. Thanks for your excellent explanation. Is this step the same for panel data as I planning to run DID for panel(2000-2019)? Expecting your kind suggestion
Hey, thanks for the great content here. QUESTION: How can I test for the "common trend" assumption of the DiD-estimator in Stata or in general? Thanks in advance!
Usually, this is done informally by comparing the dependent variable movement across groups in an extended period of time before and after the treatment goes into effect. You need a lot more data than I have in this example.
Hi Sebastian, your video helps me a lot to understand DID estimation. I have a follow-up question. Is it possible to estimate difference indifference for survey data analysis? I try it on my survey data. However, the DID from regression and the DID from manual collapse calculations show a different result.
Hi, im doing a DiD for my thesis, but im dealing with panel data. Do you know what i should do differently compared to the regression you show in this video? I noticed that there is a stata command for a fixed effects DiD regression for example.
Hi Sebastian thank you for the great content very informative, however i have a question, my research is looking at the impact of bank regulation implemented in 2014 and this regulation only affect bigger banks within my population. Banks with population of 25b and over. I have gathered panel data from 2010 - 2019. i intent on using performance ratios as depended and variable that determine profitability as control variables. I am using DID in FE model in Gretl to run the regression. I have generated some dummy variables , time dummy variable for the before and after, group dummy variable with those impacted by regulation as treatment group and the rest as control, regulatory dummy which i am not sure if its necessary. Two questions: 1. Is this research feasible in terms of parallel trend 2. will i need to interact all other variable in my model with time or the interaction only needs to be between time and group dummy. If yes then do i need to add group dummy on every interaction i do? 3. Is there need to add individual time effect since i am running the regression in FE model Many thanks in advance
1. I have no idea, but it sounds like you have enough data to make that determination yourself. 2. You should think about this on a case-by-case basis. Think about what you're trying to accomplish and whether or not interactions would help with that. 3. Time dummy variables are an important component in FE. I have some videos on FE and panel data on my channel.
Hi, many thanks for the video. When I try to do DID for my panel data set, stata says that my treatment group dummy and did variable are omitted due to collinearity, do you know why this would be / how i could fix it?
Most likely what happened is that you made a mistake creating your dummy variables. Click the magnifying glass button to look at your data to check what went wrong.
Hi, thanks for the clear explanation. Is it possible to to a DID by percentage level? So that i come up with a %increase/decrease in the treatment group? thanks!!
If you want to know about p-values, I suggest taking a look at my video on hypothesis testing: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-lhoqZjQHHjk.html
A very useful video. Thank you so much. I have a question. So i created 3 columns similar to y81 nearinc and y81nric. I am running two part logit and glm model. Since the value of y81 and other two is either 0 or 1. Will we put i.y81 and etc? I mean before binary variable ain't we suppose to put i.
For a binary variable, you will get the same result just putting the variable in or using the i. structure. If you have a categorical variable with more than two possible values, then you need to use i.
Have you ever done a coefplot to test the treatment effect? If so, I get a positive but not significant coefficient for my treat dummy variable. This would mean that the treatment group actually saw an increase in the fatalities (my y variable) or does it mean my treatment effect is positive? It is confusing because if I do a lowess plot on just the different states fatalities drops over time. However, in the coefplot the graph is trending upwards.
Hi, I would like to know whether Difference in differences analysis is suitable for a small data set thats contains only 2 years of data and have only 168 samples (84 control and 84 treatment)? Thank you so much.
I don't see any reason why not. However, with only 2 years of data, you have no idea of how the outcomes have been trending over time, and you may have a hard time justifying your counterfactual.
Thanks. Another question would be, it is not necessary to tell Stata we have Panel Data when we have already created the dummy variables that differentiate the control and treatment group, and the pre and post periods? No need to run a fixed effects regression too, I guess. I'm just learning about the subject :)
For a simple DD like this, you don't need to use xtset, if that's what you're asking. You can actually think of a DD as a very simple sort of FE model that only has two groups and two periods. If you want to see more about FE, I also have a video on it.
Sebastian, thank yo so much for this video. Does the data have to be in long shape? Is there a way to run the diff in diff regression on a wide dataset? Thank you.
Thank you very much! What about the interaction dummy between year and dummy? Given that my dataset is a balanced panel of 400 firms observed in both 2008 and 2013? Thanks again
@@sebastianwaiecon Just to follow up on this, if you do have the same units before and after, the paired difference test gives a different result than the regression you discuss in the video: Y = b1 + b2*treat + b3*time + b4*treat*time, which assumes independent samples, does it not?
Hi Sebastian, thanks a lot for the clean explanation! Could you tell me why you were inlcuding post-treatment levels of your covariates? Aren't they endogenous and thus result into bias? Thanks in advance!
I don't understand the question. What I showed here is the most basic version of diff in diff, with the bare minimum amount of variables needed. Even if I had added more variables, that would not have created any bias -- bias happens because you left variables out.
@@jackgandhi Thank you for the fast reply! Sorry I meant the covariate data structure. I recently did an DiD setup making use of this video's datastructure - and got the criticism that, since I included covariates with a time index for the post traetment period in the regression - these were endogenous and would thus impose bias.
@@ssjvegeto4ever What you are describing is a common and valid criticism of time series analysis. The purpose of diff in diff is, if the data allows, solving this problem using a control and treatment group. The "post" dummy (y81 in the video) is not enough to establish a causal relationship. This is why we have the interaction term (y81nearinc in the video). In this video, y81 controls for effects over time that are constant across groups while nearinc controls for group effects that are constant over time. The interaction pulls out the estimated effect. This is not to say this method is perfect as there could still be endogeneity due to variables that are constant neither across groups nor across time, so you still may need to think about controls. The diff in diff method is just one tool in the analyst's toolbox.
Hello sir! I have a question...it looks like you first run a simple OLS regression and then you compute the differences using the collapse command. I do not understand whether to use just OLS regression and report the differences estimator (-18824) as the DID estimator. Please guide me..
The number you gave estimates the difference between the treatment and control group before the treatment. We need to use the coefficient estimate for the interaction term to get the DID estimator.
Dear Sebastian, I am working on my dissertation using DiD, i included additional control variables in my model. However, the model suffers from heteroskedasticity and autocorrelation. How to deal with them?
Hi, Sebastian, thank you so much for your video. I was wondering if it's possible to do propensity score matching and difference in differences when my dependent variable is dichotomous?
I can't comment on specifics as I've never combined all of these myself. However, both diff in diff and propensity score matching can be done with dichotomous dependent variables. You just need to be careful about the issues inherent in linear probability. See my video on binary choice models for details.
Hi SebastianWaiEcon, I am a student at Morehouse College, and I really enjoyed watching your video. I need help running a Diff in Diff regression for my research paper. For context, I am using Stata to analyze NAFTA's impact on GDP and trade flow for its member nations. To facilitate this process, I will be running an individual diff and diff analysis for each country. My dummy variable will be years before 1994 (when NAFTA was signed) and after 1994. My DV will be GDP growth. And my extra variables will be looking at human capital, agriculture industry growth percentage, manufacturing growth percentage, and other variables. However, I struggle with the Stata platform and would like your advice to ensure this regression runs smoothly.
The most important thing for diff in diff is to identify a control and treatment group. In your case, that might be countries that were part of NAFTA and countries that were not.
@@sebastianwaiecon Enjoying your video.. But I neend help.. I have 25 countries and data from 1960-2020... How can I specify only one time 2012 while comparing it 2010-2016.. please help me
@@amnashaukat7827 A fixed effects model may be more appropriate: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-H95BHswbT3w.html&ab_channel=SebastianWaiEcon
Could you please tell if we are using for example DHS data, which has data on demographics and health of a nation; but we want to see the effect of an external policy, like NREGA on labourforce participation of females ( the data for which is available in DHS). Then, should we merge NREGA data with DHS data, and then apply matching techniques to determine treatment and control groups? If not this, then how should we see the impact? Thanks
I note that you have large Standard Errors in your findings. Does this in any way have an impact on the reliability of the findings or the interpretation of the overall impact of the program (or incinerator in this case)?
It's all relative when it comes to standard errors. You could say an SE of about 8000, as it is here, is large, but the estimate is -20,000. Standard errors are always going to be big numbers when dealing with things like the prices of homes, which are in the tens of thousands. All other things being equal, larger standard errors mean less precision in the estimates. Here, we can still be quite confident the incinerator did decrease property values.
Hi! nice video thank you very much! I have a question, how do you do if there are time varying treatment ? in your example it would be… Imagine there is a neighbourhood (1) that got the incinerator got built in 81 but other neighbourhood (2)82, for example… Would it be reg price y81 y82 nearincneighbourdhood1 nearincneighorhood2 y81* nearincneighbourdhood1 y82*nearincneighorhood2? something like that?
You could also consider including interactions between y81 and neighborhood 2 and y82 and neighborhood 1. Once we get into more than 2 periods you should also be thinking of this as a fixed effects model. You may find my video on that helpful.
sir, what the difference between xtreg and reg? if i use data from year 2007 and 2014, should i use reg org xtreg? my dataset doesn't have same units across 2007 and 2014.
Reg is the basic regression command and xtreg is used for panel data methods such as within estimation and random effects. If you don't have the same units across years (pooled cross section), then you probably want to use reg.
Do you mean you have multiple periods before and after the change? It functions the same as this, but you need to define your "post" variable to include all periods after the change.
That would be the simplest way to do it. I'm not promising this is the perfect solution as you may need to think about more sophisticated ways to handle your specific data, but it is a good starting point.
Can you do a DD with logistic regression? Say I have a dichotomous outcome - for this example, it could be something like house sold (yes/no). Would it be a similiar stata code, just change "regress" to "logistic" or are the considerations within DD that might limit the statistical validity of that sort of analysis?
The principles which drive DD -- controlling for time trends and cross sectional trends -- are still useful for logits (and probits also). However, you need to be careful about the coefficient interpretations, as it's not as clean as in the least squares DD. I would suggest looking at my video on binary choice models for details.
Unlike FE models, diff in diff does not necessarily have the same cross-sectional units across time periods. In my example, it's not the same houses in '78 and '81. As such, ID-based FE won't work. Here, the nearinc variable plays the same role as the FE. Your time dummy is already in there in DD.
Yes, I get that. I have unbalanced panel data and I want to conduct a Difference-in-Differences with id and time fixed effects. Is // xtreg DepVar i.treated##i.during controls i.month , fe cluster(id) // the correct model to achieve that? Or do you think that it would be better to exclude the fixed effects?
Firstly thank you for your video which is very helpful. As you have mentioned in your comment it was not the same house in '78 and 81', does that mean your treatment and control group are not the same pre and post-treatment ?
In this dataset, that is y81 -- a dummy variable with a 1 for 1981 and 0 otherwise. I have another video with some examples of how to create dummy variables: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-DuAhUpM-56E.html
You don't "fix" it, it's just the result you got. It tells you that you can't reject the hypothesis that your treatment had no effect. Now, it could be that you have some endogeneity that you need to control for, but statistical significance, or lack thereof, is not (by itself) a problem to be fixed.
@@sebastianwaiecon Hi, is the interaction term is insignificant, will adding more variables help us getting the result significant? Since, in the results show that the constant term is highly significant, which means that there is an omitted variable bias. I guess, adding more controls can help solve the problem for the insignificant interaction term.
@@consultingfaqs It bears repeating that the treatment not being significant is not a "problem" to be be solved unless you think this is because of an omitted variable. Tinkering around with different models with the explicit purpose of finding a significant effect is not an ethical use of data. The constant term being highly significant is also not evidence of omitted variables. I'm not sure where you got that idea. Adding more variables might or might not result in existing terms being more significant. It all depends on the direction of the bias, if there is one.
nearinc indicates whether the house is within 3 miles of the incinerator. There is a variable called "dist" which is the distance from the incinerator in feet. To create the dummy, we would use the command: gen nearinc = dist
Hi Sebastian, thank you so much. Quick question. Is this dataset a panel, or two separate cross section datasets? I am assuming it is two separate cross section, right?
@@sebastianwaiecon Good point and thank you so much for the quick reply! I am working on a thesis and realised that I was supposed to be doing DiD when I had done a different methodology for the many few weeks. Your video is incredible. Big thanks from Australia!
Hi what if I want to control for additional variables? Then the command "collapse (mean) y, by(after treatment) " is not sufficient. Please tell me what to do to control for variables.
Hey, I kind of understand diff-in diff, now I am dealing with a problem, what if the control is on way larger levels than the treatment Lets stay Control before: 100, after: 200 = 100 % increase, Treatment before: 5, after 9. If I calculate the DID efffect using the standard table so like the diff between differnces i get in this case 100-4= 96!... So the conterfactual state of the world would in the case of treatment be 105 ? !, that does not make sense no? Even the R with OLS gives me these results. What am I doing wrong? Thank you
Thank you for your helpful sharing, when I run the command: "corr(y81 nearinc y81nrinc)" to test the autocorrelation between variables and the result shows there is an autocorrelation between "nearinc" and "y81nrinc" variables. The confidence of correlation is 0.5776. So my question is: what should we do in this situation.
First of all, "autocorrelation" is a very specific term, which you are using incorrectly. In time series data, this refers to a variable correlating with itself across time. In any case, you've pointed out that an interaction term is correlated with one of the variables you are interacting. This is true by definition. There isn't anything you do about that -- it would be strange if it were not the case. In a more general sense, there is nothing wrong with two variables in a regression being correlated with each other. That is completely normal and probably the case in most regressions.
Thank you for pointing out my problem. You are right, it was my fault in using the term "autocorrelation". What I really mean is the "multicollinearity" but there was a mistake in typing. Anyway, according to the data in the video, the truth is "multicollinearity" really happens in the regression result because the coefficient of correlation between " nearinc" and "y81nrinc" variables is 0.5776. Usually, in the case of encountering "multicollinearity", we usually omit one of the two variables out of the model. However, it is impossible to omit any variable of these two variables due to the requirement of "Difference in difference" method because they must be included together to show the effect of the construction of the incinerator. That is why I asked the question "what should we do in this situation". And this problem not only happens in this example, but it also occurs in every "DID" model because we usually create a "did" variable by multiplying the "time" and "treated" variables (did = time * treated). And the consequence is there always is "multicollinearity" in "DID" model. Can you help me to solve this issue?
Multicollinearity is not a big deal. Getting into the practice of dropping variables because they are correlated with another variable in the model will lead you quickly into omitted variable bias. There is a simple test where you regress the one variable you are concerned about on all the other explanatory variables. If the R-squared is under 0.9, don't worry about it. As I explained previously, it is mathematically impossible for a variable and an interaction term involving it to be uncorrelated. The interaction term is absolutely key to a diff in diff regression.
@@sebastianwaiecon Hello, I found this video very helpful. However, when running my model, my DID variable keeps getting dropped because of collinearity. Is there a fix to that?
Hi. My data ranges from 2009 to 2018, and i have both treatment and comparison groups. i just want to ask whether DID, just like what you did in the video, is applicable. I am not much familiar with the method and stata, actually.
You can do DID if you set up a dummy variable to indicate when the treatment went into effect. Once this is in place, you can create the interaction term.
Hi Sebastian, thank you for your video! I've two questions: 1) What should I do if the FE variables (time and individual) are not significant? (I mean p-value > 0.1) 2) Do I have to take care of R squared in this case? Thank you!
1) If what you're after is measuring the treatment effect, this doesn't matter. 2) I don't know what you mean by "take care," but R squared is not particularly relevant in DID estimation.
Sir, Another question in this regard and I humbly request your attention at the earliest: Suppose I have a panel data set of 75 Banks for 5 years (Pre-merger) which have merged to become 30 Banks (also for 5 years Post Merger) and I have been able to establish my model using all the standard Panel Data Test viz. the F-test, BP-LM Test, and Hausman (1978) that it is a Fixed Effects Model. given that my Dependent Variable is an Index of Inclusion (whose values lie between 0 and 1), while all other Independent variables are metric data from Balance sheets of banks, with a time dummy (0 for pre-and post merger), CAN I run a Panel Tobit model knowing well that it is a fixed effects Model. I use Stata 14 for my econometrical model testing? I have been told that Panel Tobit can be accompanied only for Random Effects Model My problem is my Dependent variable has a truncated range ? Please guide asap
Mechanically, you can do it with dummy variables (see my fixed effects video). While I am not aware of a specific reason you should not do so, I don't know enough to definitively tell you one way or another.
Hi Sebastian, I wonder what do we have to do if the effect is spread over the years, say, treatment was implemented in one year for the firms in one industry, next year for another? Say, over the three decades, the U.S. authorities have gradually cut import tariffs on a large variety of goods and services. CUT=1 if this happened, 0 otherwise. The equation will have a form of Investment=b1*tariff CUT + b2*lagged controls + industry FE etc, cluster by industry-year. I do not understand what do I have to add to a simple regression to make it diff-in-diffs in this case... Dummy CUT interacted with what?
or, like in your example, incinerator would have been installed for one neighborhood in 1981, for another in 1985 etc, for another in 2005... y81 time dummy won't work anymore, so what do we have to interact?
You'll need a dummy variable that "turns on" from a 0 to a 1 once the treatment is active. You won't be able to do this by building an interaction term, as it's more complex than that now. I'm not sure there's a better way than putting in the 1s on a case by case basis.
Good morning. I am a student working with the DID model. Thanks to your DID explanation, I was able to complete my assignment smoothly. But yesterday, the professor asked, 'Why was the control variable excluded, so I couldn't actually answer it.' After class, the professor gave me a separate assignment. That is, put the control variable in and analyze it again. I want to use STATA again. But how do I add a control variable to the current video? Could you please advise which code to enter?
What if your data have multiple units treated and untreated at the same time? There, a clean post period makes no sense. If one city 1, for example, is being treated at time t, but city 2 and 4 aren't, but the next year, city 3 is being treated and so on, wouldn't you just do treatment##time variable
Hello, i ran into a problem when running my regression. My regression looks like this: regress DepVar post_tr_yr treat_group treat_groupXpost_tr_yr Where post_tr_yr is a dummy for year>2007 However my interaction term (treat_groupXpost_tr_yr) gets omitted due to collinearity. Is this a problem?
I can't think of a theoretical reason why you couldn't do that. To be honest, I think most people just use robust all the time and don't really think about it.
Yes. You would do this after running the collapse to get all the averages. The "classic" diff in diff graph has the outcome on the vertical axis and time on the horizontal axis. There are three lines: the treated group, the untreated group, and a counterfactual with the same starting point as the untreated group but the same slope as the treated group. See my video on graphing for how to use the twoway command.
Hi professor I hope you are doing well I'm a follower on RU-vid professor can you help me to do an assignment in method difference in differences because I didn't find subject or data can help me to do it I must to do it other way I will repeat the year and I sleep only 3 hours more than 3 weeks just because of this project can you help me and if you want I can pay you to help me