Difference in Difference Interpretation

Gonzalo Etchart

Join Date: Jul 2018

Posts: 15
#1

Difference in Difference Interpretation

29 Aug 2018, 11:31

Greetings,

I have 350 observations and am using agricultural products affected by the tax vs other export products not affected as a control. Standard Did with 2 time periods and treatment and control.

I tried the DiD with both reg and the diff command and honestly dont know which one to use.

I also don't know how to interpret my results.

They are posted below.

Any help with this would be greatly appreciated.

s.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

29 Aug 2018, 12:51

The screenshots of the output are not readable on my setup. Please re-post by copy/pasting directly from your output log or Stata log file into the Forum editor, between code delimiters. If you are not familiar with code delimiters, please read Forum FAQ #12.

Also be sure to post the actual commands along with the output. (Maybe you did that in the screenshots--I can't tell.)
Comment

Gonzalo Etchart

Join Date: Jul 2018
Posts: 15

29 Aug 2018, 13:08

I will admit this took me far longer to figure out than expected! Baby steps I guess.

Please let me know if you can see it now

Code:

 reg Exports y2017 Agri y2017Agri

      Source |       SS       df       MS              Number of obs =     349
-------------+------------------------------           F(  3,   345) =    4.70
       Model |  6.8382e+15     3  2.2794e+15           Prob > F      =  0.0031
    Residual |  1.6737e+17   345  4.8512e+14           R-squared     =  0.0393
-------------+------------------------------           Adj R-squared =  0.0309
       Total |  1.7420e+17   348  5.0059e+14           Root MSE      =  2.2e+07

------------------------------------------------------------------------------
     Exports |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       y2017 |  -114500.3    2767485    -0.04   0.967     -5557767     5328766
        Agri |   1.15e+07    3904335     2.95   0.003      3829322    1.92e+07
   y2017Agri |   -3205138    5346472    -0.60   0.549    -1.37e+07     7310644
       _cons |    1455040    2053879     0.71   0.479     -2584660     5494741
------------------------------------------------------------------------------

Code:

diff Exports, t(Agri) p(y2017) 

DIFFERENCE-IN-DIFFERENCES ESTIMATION RESULTS
Number of observations in the DIFF-IN-DIFF: 350
            Before         After    
   Control: 115            141         256
   Treated: 44             50          94
            159            191
--------------------------------------------------------
 Outcome var.   | Exports | S. Err. |   |t|   |  P>|t|
----------------+---------+---------+---------+---------
Before          |         |         |         | 
   Control      |  1.5e+06|         |         | 
   Treated      |  1.3e+07|         |         | 
   Diff (T-C)   |  1.2e+07|  3.9e+06| 2.95    | 0.003***
After           |         |         |         | 
   Control      |  1.3e+06|         |         | 
   Treated      |  9.5e+06|         |         | 
   Diff (T-C)   |  8.1e+06|  3.6e+06| 2.24    | 0.026**
                |         |         |         | 
Diff-in-Diff    | -3.4e+06|  5.3e+06| 0.64    | 0.524
--------------------------------------------------------
R-square:    0.04
* Means and Standard Errors are estimated by linear regression
**Inference: *** p<0.01; ** p<0.05; * p<0.1

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

29 Aug 2018, 13:32

Well, something is wrong here. Your -reg- output shows 349 observations, but the -diff- command shows 350. That shouldn't be. Did you do any manipulations on the data set in between running those commands that might have inadvertently dropped or added an observation?

Try running the -reg- command again and then run -tab y2017 Agri- if e(sample) and then compare that to the numbers shown in

Code:

Number of observations in the DIFF-IN-DIFF: 350 Before After Control: 115 141 256 Treated: 44 50 94 159 191

That will at least let you know which group is one short (or one over). Then you can do some detective work to figure out what went wrong. You have to fix this discrepancy before you go any farther.

Once you have fixed that, for the kind of simple regression you are doing, -diff- and -reg- should give the same results. They will not look all that similar, because the outputs report different things and are arranged differently. But the coefficient, standard error, t, and P of the y2017Agri variable in the -reg- output should match the Diff-in-Diff row of the -diff- output, at least to the number of decimal places shown by -diff-. And the other outputs from -diff- can, in principle, be calculated from the outputs of -reg-.

You can improve both commands' outputs by changing the units for your Exports outcome variable. I guess they are denominated in dollars or something like that? If you were to denominate them in millions of dollars instead (-replace Exports = Exports/1000000-) the outputs of both analyses would be numbers that are easier to read and understand. You would just have to bear in mind that they refer to millions of dollars instead of dollars (or Euro, or whatever it is.)

Next, you can improve your -regress- code by using factor variable notation. (Read -help fvvarlist- for more information.) Eliminate the y2017Agri variable and re-do it as:

Code:

reg Exports i.y2017##i.Agri

Then run:

Code:

margins y2017#Agri margins y2017, dydx(Agri)

Assuming you have fixed up the discrepancy in which observations are included so they are both running on the same data, you will now find that the Diff-in-Diff row of the -diff- output matches the 1.y2017#1.Agri row of the -reg- output, and the other outputs from -diff- match up with what is shown in the -margins- outputs.

You won't have to decide which ones to use: they will be the same. The output of -diff- is perhaps more intuitively organized and easier to read. But other than that, there should be nothing to choose.
Comment
Gonzalo Etchart

Join Date: Jul 2018

Posts: 15
#5

29 Aug 2018, 14:34

Dear Clyde,

Thank you for your swift response. It is quite late here in Mozambique but I will make sure to go through your recommendations first thing tomorrow.

Have a good day,

Kind regards
Gonzalo
Comment

Gonzalo Etchart

Join Date: Jul 2018
Posts: 15

29 Aug 2018, 15:38

Dear Clyde,

I managed to find the error- there was a missing data value. I also followed your advice and denominated Exports in Millions of Dollars. This made the data look much clearer.

I will most likely be using the diff output to present my data as it provides the simplest overview. However I'm curious to know what advantage there is in using variable notation in this case.

Code:

DIFFERENCE-IN-DIFFERENCES ESTIMATION RESULTS
Number of observations in the DIFF-IN-DIFF: 350
            Before         After    
   Control: 115            141         256
   Treated: 44             50          94
            159            191
--------------------------------------------------------
 Outcome var.   | Exports | S. Err. |   |t|   |  P>|t|
----------------+---------+---------+---------+---------
Before          |         |         |         | 
   Control      | 1.455   |         |         | 
   Treated      | 12.964  |         |         | 
   Diff (T-C)   | 11.509  | 3.900   | 2.95    | 0.003***
After           |         |         |         | 
   Control      | 1.341   |         |         | 
   Treated      | 9.451   |         |         | 
   Diff (T-C)   | 8.111   | 3.621   | 2.24    | 0.026**
                |         |         |         | 
Diff-in-Diff    | -3.398  | 5.322   | 0.64    | 0.524
--------------------------------------------------------
R-square:    0.04
* Means and Standard Errors are estimated by linear regression
**Inference: *** p<0.01; ** p<0.05; * p<0.1

Above is my -diff- output. Since this is my first time doing a DiD, what part of my results do you think are worth mentioning in my paper (are the p values significant?) ( is R-square important to mention here) --> paper is looking at impact of change in tax policy

Does stata support a graphical representation perhaps? (I have visual illustration of parallel trends between treatment and control already)

And am I correct in stating that this intervention had an estimated effect of -3.398 Million dollars in Exports?

I apologise for the barrage of questions- you have been incredibly helpful and I really appreciate it.

Many Thanks,

Gonzalo

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#7

29 Aug 2018, 16:28

In this case there is probably no advantage to using -reg- with factor variable notation over using -diff-. But -diff- has its limitations. As far as I know, it cannot handle generalized DID models with varying times of onset of the intervention, or other more complicated situations. I'm also unclear whether -diff- works with longitudinal data. For those you have to go back to -regress- (or some other estimation command). And when you do that, you need factor-variable notation so that you can use -margins-. As you saw in your own example, many of the interesting statistics you would want to report are in the -margins- output but not the -regress- output. And you can only use -margins- after -regress-.

So, for just this particular problem, you don't need factor-variable notation, and the -diff- command does all the hard work for you and wraps up the results with a pretty bow on top. But for the more general situation, that won't be the case.
Comment
Gonzalo Etchart

Join Date: Jul 2018

Posts: 15
#8

30 Aug 2018, 02:08

Thank you for your clarification Clyde. I will make sure to keep this in mind the next time I perform such an analysis.

As for my -diff- output- I'm confused as to what a P> (t) means for the significance of my results.

1. My standard errors 5..322 seems to be larger than my estimated effect- Isn't this troubling?
2. Is my R squared value even worth mentioning in the discussion- Does it even make sense to look at best fit when theres only 2 observations?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#9

30 Aug 2018, 08:42

The large standard error means that the data are not able to give a precise estimate of the effect. Since the sample size seems moderate, that is part of the problem. The other part of the problem is probably that the outcome you are looking at shows a lot of variability. Between the two, you end up without a sharp estimate of the effect. In presenting these results, I would not emphasize "significant" vs "not significant," but rather speak in terms of an effect which, as best we can tell, is small, but we don't have a lot of information about it. It might be positive, it might be negative. It isn't too far from zero. But that's all we can say. (By the way, you may already know this, but you can calculate a 95% confidence interval for the estimate by adding and subtracting 1.96 times the standard error. Or you can pull the confidence interval off of the -regress- output.)

Your R square value is rather small. I would say that reporting it is pretty much a standard practice. It means that, although there is a lot of variability in the outcome, very little of that variability is related to treatment vs control group status or the passage of time from the pre- to post-intervention era.

I don't quite understand what you are getting at when you refer to only 2 observations. You have 350 observations. You may have only 2 observations on each entity (one pre- and the other post-), but the R² is not about the fit of the line between a single entity's pre- and post- observation. It's about the fit of a line going through all 350 points overall.

I should mention one more thing. You did not, in #1, indicate whether your pre- and post- observations are on the same entities (longitudinal data, aka panel data) or different entities (serial cross-sections). It makes a difference. The analysis you have done is a between-groups analysis and is appropriate to serial cross-sections. But your remark about "only 2 observations," if I have interpreted it correctly, suggests to me that you actually have paired pre- and post- Export measurements on the same entities. If that is the case, you can get a sharper, more precise estimate of the effect by using a within-entity estimator. As far as I know, that cannot be done in -diff-, but you can do it directly with official Stata commands. The only ingredient needed that you have not mentioned is a variable identifying the entities. So, I'm going to assume that the unit of analysis here is country, and that you have a variable, called Country, in your data set. So then you would do this:

Code:

xtset Country xtreg Exports i.Agri##i.y2017, fe vce(cluster Country) margins Agri#y2017, noestimcheck margins y2017, dydx(Agri) noestimcheck

Now, you will get a warning message from Stata that the variable Agri is omitted from the -xtreg- results due to colinearity. That's not a problem--it's expected. (In fact, if you don't get that message then something has gone wrong.) It will also be the case that any entities for which you have only a pre- or only a post- observation will not contribute information to this analysis, so your effective sample size will be somewhat reduced. But if the bulk of your entities have both pre- and post-data, you will probably get a more precise (i.e. smaller standard error) effect estimate with this approach.

The DID effect estimate comes from the -xtreg- output in the row for 1.Agri#1.y2017. The output of the first -margins- command will give you the predicted values of Exports in both groups at each time. And the output of the second -margins- command will give you the difference between the Agri and non-Agri groups in each time period.

Again, this within-county analysis is only possible if what you have is observations on the same entities (countries) at two different time points. If the countries in the pre- and post- groups are different, you cannot do it this way. When I say the same countries, it is OK if there are some countries where you have pre- data but no post-data or the other way around. But it needs to be the case that for the bulk of the countries have both pre- and post- data.
Comment

Gonzalo Etchart

Join Date: Jul 2018
Posts: 15

#10

30 Aug 2018, 10:03

Dear Clyde,

As always thank you for your response- a statue should be erected in your name sworn by the stat gods.

I should clarify that I am looking at a single country- and comparing the export values for different product types.

Here agricultural products affected by the tax change are my treatment and other commodity types not affected by the tax change my control. (I have visually identified the parallel trends of these two variables)

The Tax rate was changed end of 2015- thus I have added more observations to my analysis and included years 2014 and 2015 as pre--> 2016 and 2017 as post.

I ran a diff command and got the following.

Code:

. diff Exports, t(Agri) p(Post) 

DIFFERENCE-IN-DIFFERENCES ESTIMATION RESULTS
Number of observations in the DIFF-IN-DIFF: 678
            Before         After    
   Control: 245            252         497
   Treated: 84             97          181
            329            349
--------------------------------------------------------
 Outcome var.   | Exports | S. Err. |   |t|   |  P>|t|
----------------+---------+---------+---------+---------
Before          |         |         |         | 
   Control      | 1.567   |         |         | 
   Treated      | 14.680  |         |         | 
   Diff (T-C)   | 13.113  | 2.753   | 4.76    | 0.000***
After           |         |         |         | 
   Control      | 1.394   |         |         | 
   Treated      | 9.470   |         |         | 
   Diff (T-C)   | 8.075   | 2.602   | 3.10    | 0.002***
                |         |         |         | 
Diff-in-Diff    | -5.038  | 3.788   | 1.33    | 0.184
--------------------------------------------------------
R-square:    0.05
* Means and Standard Errors are estimated by linear regression
**Inference: *** p<0.01; ** p<0.05; * p<0.1

Would this still require modification of my approach? Is it even remotely fair for me to say that the tax change resulted in an estimated decrease of 5.038 million in export value?

I await your response,

Kind regards,

Gonzalo

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#11

30 Aug 2018, 10:28

Well, if you are looking at the same products both before and after the tax increase, then, yes, I would use the within-product estimator as it will be more precise. The code and interpretation would be the same as in #9, just replacing country by product (or whatever the name of the variable that identifies the products is) throughout.

That said, I think it is seldom, if ever, appropriate to just present a point estimate without an estimate of the uncertainty of that estimate So I would say that the effect of the tax increase on agricultural exports was an estimated decrease of 5.038 million, with a margin of error of + or - 7.424 (calculated as 1.96 * the standard error).. Or, equivalently, you could say that the best estimate of the effect of the tax increase on export value is an estimated decrease of 5.038 million, with a 95% confidence interval from 12.462 million to an increase of 2.386 million.
Comment

Gonzalo Etchart

Join Date: Jul 2018
Posts: 15

#12

30 Aug 2018, 10:56

Dear Clyde I ran your commands- take into account that y2017=Post, Agri=Agri1, Product= var1 to avoid confusion.

Code:

. xtset var1
       panel variable:  var1 (unbalanced)

. xtreg Exports i.Agri1##i.Post, fe vce(cluster var1)
note: 1.Agri1 omitted because of collinearity

Fixed-effects (within) regression               Number of obs   
>    =       678
Group variable: var1                            Number of groups
>    =       257

R-sq:  within  = 0.0419                         Obs per group: m
> in =         1
       between = 0.0233                                        a
> vg =       2.6
       overall = 0.0084                                        m
> ax =         4

                                                F(2,256)        
>    =      1.96
corr(u_i, Xb)  = -0.1569                        Prob > F        
>    =    0.1431

                                 (Std. Err. adjusted for 257 clu
> sters in var1)
----------------------------------------------------------------
> --------------
             |               Robust
     Exports |      Coef.   Std. Err.      t    P>|t|     [95% C
> on                                                            
>   f. Interval]
-------------+--------------------------------------------------
> --------------
     1.Agri1 |          0  (omitted)
      1.Post |  -.1706095   .4855758    -0.35   0.726    -1.1268
> 41                                                            
>       .7856223
             |
  Agri1#Post |
        1 1  |  -3.895873    2.14345    -1.82   0.070    -8.1169
> 13                                                            
>       .3251673
             |
       _cons |   4.903167    .348979    14.05   0.000     4.2159
> 32                                                            
>       5.590402
-------------+--------------------------------------------------
> --------------
     sigma_u |  17.959716
     sigma_e |  5.9845688
         rho |  .90006022   (fraction of variance due to u_i)
----------------------------------------------------------------
> --------------

. margins Agri1#Post, noestimcheck

Adjusted predictions                              Number of obs 
>   =        678
Model VCE    : Robust

Expression   : Linear prediction, predict()

----------------------------------------------------------------
> --------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% C
> on                                                            
>   f. Interval]
-------------+--------------------------------------------------
> --------------
  Agri1#Post |
        0 0  |   4.903167    .348979    14.05   0.000     4.2191
> 81                                                            
>       5.587153
        0 1  |   4.732557   .4269628    11.08   0.000     3.8957
> 26                                                            
>       5.569389
        1 0  |   4.903167    .348979    14.05   0.000     4.2191
> 81                                                            
>       5.587153
        1 1  |   .8366844   1.798119     0.47   0.642    -2.6875
> 64                                                            
>       4.360933
----------------------------------------------------------------
> --------------

. margins Post, dydx(Agri1) noestimcheck

Conditional marginal effects                      Number of obs 
>   =        678
Model VCE    : Robust

Expression   : Linear prediction, predict()
dy/dx w.r.t. : 1.Agri1

----------------------------------------------------------------
> --------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% C
> on                                                            
>   f. Interval]
-------------+--------------------------------------------------
> --------------
1.Agri1      |
        Post |
          0  |          0  (omitted)
          1  |  -3.895873    2.14345    -1.82   0.069    -8.0969
> 58                                                            
>        .305212
----------------------------------------------------------------
> --------------
Note: dy/dx for factor levels is the discrete change from the base level.

.

However I am completely baffled by what is relevant here to my problem and do not understand why this provides a more precise estimation (I will google this but a brief sentence would help)

Which part of the table do I have to interpret? I apologise if this sounds naive.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#13

30 Aug 2018, 11:46

So the first thing to look at is the Agri1#Post row of the -xtreg- output, reproduced here and edited for improved readability:

Code:

Agri1#Post | 1 1 | -3.895873 2.14345 -1.82 0.070 -8.116913 .3251673

So your DID estimate of the effect is -3.896 (to 3 decimal places) with a standard error of 2.143. Notice that the standard error is considerably smaller than before; that is, this estimate is more precise. The 95% confidence interval goes from -8.117 to +0.325 (again, to 3 decimal places). So there is a smaller range of uncertainty attached to this estimate.

The first -margins- output gives you the model's predicted (or fitted, if you prefer) Exports in each group before and after the tax went into effect:

Code:

Adjusted predictions Number of obs = 678 Model VCE : Robust Expression : Linear prediction, predict() ------------------------------------------------------------------------------ | Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Agri1#Post | 0 0 | 4.903167 .348979 14.05 0.000 4.219181 5.587153 0 1 | 4.732557 .4269628 11.08 0.000 3.895726 5.569389 1 0 | 4.903167 .348979 14.05 0.000 4.219181 5.587153 1 1 | .8366844 1.798119 0.47 0.642 -2.687564 4.360933 ------------------------------------------------------------------------------

So among the non-agricultural products before the tax, expected exports were 4.903 (95% CI 4.219 to 5.587)--first row. In the same group afterthe tax, we see that the expected exports were 4.733 (95% CI 3.896 to 5.569)--second row. In the group of agricultural products, we see that expected exports went from 4.903 (95% CI 4.219 to 5.587) before the tax to 0.837 (95% CI -2.688 to 4.360) after the tax.

There are two things about these results that strikes me as odd. The first is that these results come from 678 observations, whereas before you had only 350. What's going on? The second is that the standard error for the expected exports of agricultural products after the tax is so much larger than any of the other standard errors in that table. That suggests to me that the number of observations of agricultural products after tax is much smaller than the number of observations in the other three situations. (Alternatively, it may be that the exports are much more variable for agricultural products after tax than for the other three situations.) Why would that be? Both of these issues make me worry that something is wrong with your data.
Comment
Gonzalo Etchart

Join Date: Jul 2018

Posts: 15
#14

30 Aug 2018, 12:10

Dear Clyde,

I increased my sample size to include the years 2014-2015(pre) and 2016-2017 (post) which explains the 678 observations.

With regards to the higher standard error of agricultural products after tax- This is quite interesting as in fact there seem to be more observations for post-tax Agri products (not by a lot tho, all situations have around 36-49 observations for agri)

Perhaps a tax increase like the one experienced has varying levels of impact on different agricultural product types? This is only a guess- but perhaps worth testing. Especially with regard to cash crops as they tend to be the most elastic to price spikes.

Very interesting indeed.

Im not sure how to upload a dataset on here so I'm sharing a google drive link. https://drive.google.com/open?id=1mn...VnhMteF3fqYhTZ
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#15

30 Aug 2018, 12:56

Well, to show example data, we use the -dataex- command. (See FAQ #12 for details). But this Forum isn't suitable for showing a complete data set this large.

I have a few generic thoughts, because I am still somewhat disturbed by the results.

1. Is the distribution of Exports reasonably tame? Or is it very skew; does it have some extreme outliers that might be driving your results?

2. Perhaps you need to more finely subdivide your products into other groups whose behavior is theoretically expected to be more homogeneous. This is basically picking up on your conjecture that some crops may respond differently to taxes or other price shocks than others, and model different effects for different groups of crops. At the extreme end of this idea, one might consider a random slopes model, where, in effect, each product has its own response. I'm not going to suggest any specific code to pursue these, because I think you need to first consult with somebody who is well grounded in the science underlying your study to see if these ideas seem reasonable, and whether your discipline has standard ways of looking into them. I am not an economist, and I don't feel qualified to really guide you on these aspects of modeling crop exports. I'm always happy to offer code, and help interpret results, once the model is decided upon, but I don't want to propose models in disciplines I only know superficially.
Comment

Announcement