Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • No Parallel Trend before Difference-in-Difference estimation

    Dear all,

    Before running my Difference-in-Difference estimation, I decided to matched my data with replacement. After the matching approach, I got the frequencies of all control units matched and proceed to use the stata command -expand- to duplicate said control units in my data set. I then end up with a data set where the control units appear as many times as the frequency indicates.

    The issue is that I though that matching our data would make it everything much nicer before the difference-in-difference, but when I plot my outcome in my new data set, no Parallel Trend shows (before the matching, I could see it!)

    Am I doing something wrong? Is it normal that the Parallel Trend assumption is not fulfilled after matching?



    Thank you very much in advance,
    Ferran

  • #2
    You're not necessarily doing anything wrong. Of course, without seeing the code and output, nobody can assure you that what you've done is correct, either.

    Nature does not always cooperate with our research plans. It may just be that in the real world, the treatment and control groups did indeed "behave" differently before the intervention era began. It is also possible that other confounding variables obscured the difference when you examined this prior to the matching, and the matching unmasked the difference.

    Now, if the parallel trends assumption is not fulfilled, it certainly weakens the persuasiveness of a DID analysis. Nevertheless, if the changes in outcome between the before and after intervention eras differ in the two groups, even if the two groups were on different courses before hand, it lends some credibility to the notion that the intervention altered the trajectory of the targeted group.

    Again, though, nobody can really say whether you are doing something wrong unless you show what you've actually done. What you're saying could be quite correct, or it could be due to error on your part.

    Comment


    • #3
      Thanks for the quick reply, Clyde. So I have firms belonging to two countries (treatment and control group), and I wanted to match said firms on sales and on years (this last with exact matching). Here I provide the code that I have used for the matching:

      Code:
      teffects nnmatch (lev_w sales_w) (treatment), generate(match) osample(newvar) ematch(year) metric(euclidean)


      Here is the dataset once the matching is finished and control units, treatment==0, have been duplicated as many times as needed:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input double obs float weight str50 company int year float treatment double(lev_w mb_w) long sales_w double(roa_w tang_w) float(d treatmentxd finan_inst) byte newvar long match1 byte _merge
       57  1 "ANDRITZ AG - TOTAL DEBT % TOTAL ASSETS"          2008 0 14.35 1.86  3609812   6.09   .1088797380778757 0 0 0 0 687 3
       58  1 "ANDRITZ AG - TOTAL DEBT % TOTAL ASSETS"          2009 0 13.22 3.37  3197517   3.61   .1074465732868919 0 0 0 0 492 3
       59  1 "ANDRITZ AG - TOTAL DEBT % TOTAL ASSETS"          2010 0 11.17 4.56  3553787   5.41  .10345242876383344 0 0 0 0 353 3
       60  2 "ANDRITZ AG - TOTAL DEBT % TOTAL ASSETS"          2011 0  9.78 3.51  4595993   5.79  .09718053146797949 1 0 0 0 802 3
       63  1 "ANDRITZ AG - TOTAL DEBT % TOTAL ASSETS"          2014 0 11.42 4.62  5859269   4.14  .12427937147135495 1 0 0 0 105 3
      141 29 "CA IMMOBILIEN AG - TOTAL DEBT % TOTAL ASSETS"    2008 0 44.27  .25   265856  -3.23   .8395542578492284 0 0 1 0  29 3
      142 24 "CA IMMOBILIEN AG - TOTAL DEBT % TOTAL ASSETS"    2009 0 45.99  .44   245271    .68    .790700935408036 0 0 1 0 912 3
      143 25 "CA IMMOBILIEN AG - TOTAL DEBT % TOTAL ASSETS"    2010 0 48.68  .62   246872   2.62   .8380136209633636 0 0 1 0 395 3
      144 29 "CA IMMOBILIEN AG - TOTAL DEBT % TOTAL ASSETS"    2011 0 55.28  .42   407962    3.2   .8710030437758062 1 0 1 0 172 3
      145 27 "CA IMMOBILIEN AG - TOTAL DEBT % TOTAL ASSETS"    2012 0 57.49  .54   409216   2.61   .8785364957481591 1 0 1 0 929 3
      146 27 "CA IMMOBILIEN AG - TOTAL DEBT % TOTAL ASSETS"    2013 0 49.47  .61   440305   2.55    .741309741166034 1 0 1 0 174 3
      147 21 "CA IMMOBILIEN AG - TOTAL DEBT % TOTAL ASSETS"    2014 0 33.52  .77   224597   3.55   .7085781532956603 1 0 1 0 560 3
      183  5 "CONWERT IMMOBIL INV - TOTAL DEBT % TOTAL ASSETS" 2008 0 51.49  .27   402852   1.73   .7104788651486381 0 0 1 0 190 3
      184 11 "CONWERT IMMOBIL INV - TOTAL DEBT % TOTAL ASSETS" 2009 0  50.4  .56   583700   2.64   .7251157603001318 0 0 1 0 891 3
      185  7 "CONWERT IMMOBIL INV - TOTAL DEBT % TOTAL ASSETS" 2010 0 56.44   .7   568600   3.51   .7651172007700148 0 0 1 0 150 3
      186  5 "CONWERT IMMOBIL INV - TOTAL DEBT % TOTAL ASSETS" 2011 0 53.53  .57   879300    2.2   .7689186190279935 1 0 1 0 543 3
      187  8 "CONWERT IMMOBIL INV - TOTAL DEBT % TOTAL ASSETS" 2012 0 53.84  .78   638800  -3.75   .7841752156472969 1 0 1 0 264 3
      188  1 "CONWERT IMMOBIL INV - TOTAL DEBT % TOTAL ASSETS" 2013 0 55.73  .72   516400   1.28     .83307352145834 1 0 1 0 125 3
      189  5 "CONWERT IMMOBIL INV - TOTAL DEBT % TOTAL ASSETS" 2014 0 53.28  .76   381200   2.32   .8446271523401961 1 0 1 0 721 3
      288  3 "DO & CO AG - TOTAL DEBT % TOTAL ASSETS"          2008 0  11.6 1.33   354625   4.36    .247731685990393 0 0 0 0 148 3
      289  7 "DO & CO AG - TOTAL DEBT % TOTAL ASSETS"          2009 0  9.21   .9   387775   1.64   .3485011808877854 0 0 0 0 107 3
      290  4 "DO & CO AG - TOTAL DEBT % TOTAL ASSETS"          2010 0     0 2.24   352744    5.8   .3425561244584482 0 0 0 0 682 3
      291  7 "DO & CO AG - TOTAL DEBT % TOTAL ASSETS"          2011 0     0 1.77   426068   7.32  .23550463563433732 1 0 0 0 123 3
      292  4 "DO & CO AG - TOTAL DEBT % TOTAL ASSETS"          2012 0     0 1.96   466355   7.24   .2362217406070452 1 0 0 0 649 3
      293 11 "DO & CO AG - TOTAL DEBT % TOTAL ASSETS"          2013 0  4.73 2.18   576191   7.65  .36552865404685436 1 0 0 0 503 3
      294 13 "DO & CO AG - TOTAL DEBT % TOTAL ASSETS"          2014 0 29.38 2.71   636140   6.96  .26025770441185203 1 0 0 0 931 3
      323  5 "ERSTE GROUP BANK AG - TOTAL DEBT % TOTAL ASSETS" 2008 0 35.49  .59 14555300   1.01   .0118919457735247 0 0 1 0 435 3
      324  3 "ERSTE GROUP BANK AG - TOTAL DEBT % TOTAL ASSETS" 2009 0 30.83  .95 13267800    .96   .0116467998628619 0 0 1 0 443 3
      325  6 "ERSTE GROUP BANK AG - TOTAL DEBT % TOTAL ASSETS" 2010 0 27.88 1.07 11892800    .96 .011950175165434021 0 0 1 0  10 3
      326  6 "ERSTE GROUP BANK AG - TOTAL DEBT % TOTAL ASSETS" 2011 0 28.83  .45 11926900    .27 .011490463631846502 1 0 1 0 431 3
      327  7 "ERSTE GROUP BANK AG - TOTAL DEBT % TOTAL ASSETS" 2012 0 26.54  .81 12267800    .65 .010451948246906166 1 0 1 0 579 3
      328 10 "ERSTE GROUP BANK AG - TOTAL DEBT % TOTAL ASSETS" 2013 0 25.33  .91 10203555   -.01 .010326359046141492 1 0 1 0 804 3
      329  1 "ERSTE GROUP BANK AG - TOTAL DEBT % TOTAL ASSETS" 2014 0 23.96  .87  9171152   -.73 .011552062695950037 1 0 1 0 735 3
      477 15 "IMMOFINANZ AG - TOTAL DEBT % TOTAL ASSETS"       2008 0 38.23  .25   769683   3.52   .6776126162810079 0 0 1 0 722 3
      478  7 "IMMOFINANZ AG - TOTAL DEBT % TOTAL ASSETS"       2009 0 48.39  .25   888945 -12.28   .7388885149341323 0 0 1 0 254 3
      479  9 "IMMOFINANZ AG - TOTAL DEBT % TOTAL ASSETS"       2010 0 46.09   .6   775832   2.62   .7558607945564433 0 0 1 0  31 3
      480  8 "IMMOFINANZ AG - TOTAL DEBT % TOTAL ASSETS"       2011 0 45.41  .47   870452    4.6   .7690743782486568 1 0 1 0 543 3

      So because I have matched with replacement, in my matched dataset I have the same number of control units than treated units. After that, I just plot the outcome, leverage, over time:
      Click image for larger version

Name:	Graph.png
Views:	1
Size:	11.3 KB
ID:	1382040




      So clearly, no Parallel Trend assumption is met (Year==0 corresponds to the event date). If I, however, look at this graphic before doing the matching, I get:
      Click image for larger version

Name:	Graph2.png
Views:	1
Size:	10.4 KB
ID:	1382041





      Where I see clearly a Parallel Trend assumption met. Then, am I missing something?


      Best,
      Ferran

      Comment


      • #4
        Well, this seems like an odd kind of matching. -teffects- does not understand or respect panel structure. So you have a matching where in one year Company A is matched to Company X, but in another year it is matched to a different Company Y. So your matched pairs are now scrambling lots of other variables (both observed and unobserved) that may be relevant here. I don't think this is a useful way to match panel data. And I don't see how you could use these matched pairs in panel-data analyses and get valid results.

        I think you need to find some different approach that matches each company in the treatment group to a company in the control group consistently over time. That means that you will probably not be able to get the match to be the nearest neighbor in sales in every year. I don't know what the best approach is here, and it probably depends in part on the nature of the relationship between sales and leverage.

        Comment


        • #5
          Thanks Clyde, I totally get your point. I do not know why I thought that matching Company A one year with B and another with C would make sense.

          I was thinking about changing my panel data into wide format, so I can have just 1 observation per company, and then match on the average of sales over time, for example. It is one thing that comes to my mind. After matching and constructing the matched dataset, I would change back to panel data. Does this whole thing make sense to you?


          Best,
          Ferran

          Comment


          • #6
            I think the approach you outline in your second paragraph of #5 makes sense.

            Whether the average sales (as opposed to some other summary statistic, or some weighted average, or least sum of simple or weighted squares difference, or whatever) is the best way to handle the time series is the best way I don't know, but that is a content issue and we're out of my discipline here.

            Comment


            • #7
              Thanks again, Clyde. So I implemented the approach already mentioned:

              - I first converted my pandel data into wide format data, then proceeded to do the matching with pre-treatment average of sales and leverage (IV and DP, respectively):

              Code:
              teffects nnmatch (mymean_leverage mymean_sales) (treatment), generate(match) osample(newvar) metric(euclidean)
              - Then I got the weights for each control unit, changed it to panel data, and finally duplicate those control units that were used more than once. Of course, those control units are clusters in panel data.

              - Once the matched dataset is obtained, I proceed to graph the leverage over time: and it looks much much better than before now! The Parallel trend can be assumed to be met now.
              Click image for larger version

Name:	Graph.png
Views:	1
Size:	10.5 KB
ID:	1382273




              There are two issues still remaining:

              1. I have matched my companies regarding the pre-treatment sales (and used the pre-treatment leverage as the outcome of the -teffects nnmatch- command). Is this right? Or should I match on sales in general, pre-treatment and post-treatment?

              2. After the matching, the Difference-in-difference estimator increases the value a little bit and becomes insignificant (was significant before matching). Does this sound quite possible to you?


              Best,
              Ferran
              Last edited by Ferran Franquesa; 05 Apr 2017, 11:55.

              Comment


              • #8
                1. I have matched my companies regarding the pre-treatment sales (and used the pre-treatment leverage as the outcome of the -teffects nnmatch- command). Is this right? Or should I match on sales in general, pre-treatment and post-treatment?
                I would consider what you did the correct approach. If you start matching on the post-treatment outcome, then you would be, in effect, constraining the treatment and control groups to follow identical trajectories after treatment and any effects of interventions (or other things) would be obscured. If anything, I would lean in the other direction and match on the pre-treatment sales only, not the leverage. By matching on leverage, you are restricting the generalizability of your results to those firms for which there are matchable controls whose leverage-sales relationship in the pre-intervention period is similar to that of the case. That could be construed as over-matching, depending on the circumstances.

                2. After the matching, the Difference-in-difference estimator increases the value a little bit and becomes insignificant (was significant before matching). Does this sound quite possible to you?
                Yes, that's quite possible. The impact of matching on an analysis is not predictable: it can increase or decrease the apparent magnitude of effects.

                I'll resist the temptation to go into my usual rant against interpreting these models based on p-values. (If you want to see it, there are plenty of examples of it on this Forum that you can probably easily find with a search.) I'll just say this: you state that the DID estimator changes "a little bit" and becomes "insignificant." If the estimator change is really only a little bit, then I'd imagine that your pre-matching p-value was only slightly below 0.05. So, this is the kind of thing that happens when you take the p < 0.05 convention and treat it as being reality, rather than a rule of thumb. Dichotomously classifying results as "significant" vs "not-significant," common though it is in the literature, is really a matter of sloppy thinking. A change in p-value from 0.049 to 0.051 is meaninglesss. So, even is a change from 0.04 to 0.06. It's the general problem you get from taking any continuous variable and dichotomizing it at a completely arbitrary, artificial cutoff. OK, I promised not to rant on about p-values, so I'll cut it here.

                Comment


                • #9
                  Many thanks, Clyde!

                  If anything, I would lean in the other direction and match on the pre-treatment sales only, not the leverage. By matching on leverage, you are restricting the generalizability of your results to those firms for which there are matchable controls whose leverage-sales relationship in the pre-intervention period is similar to that of the case
                  But I have not matched on leverage (only pre-treatment sales), have I? If I understood the -teffects nnmatch- command properly, I use the pre-treatment leverage as the outcome variable only.



                  Best,
                  Ferran

                  Comment


                  • #10
                    But I have not matched on leverage (only pre-treatment sales), have I? If I understood the -teffects nnmatch- command properly, I use the pre-treatment leverage as the outcome variable only.
                    Yes, the code you show is matching on sales only. I thought I saw you say somewhere that you had matched on both sales and leverage, but reviewing the thread I can't find that. I must have misread something along the way. Sorry!

                    Comment


                    • #11
                      You are right, I said it here (beginning of post #7):

                      - I first converted my pandel data into wide format data, then proceeded to do the matching with pre-treatment average of sales and leverage (IV and DP, respectively):
                      But it was definitely a wrong choice of words, because I clearly did not mean I was matching on leverage. Anyway, sorry for the confusion!


                      Many thanks again for taking the time to clear all these things up for me, much appreciated Clyde!

                      Comment


                      • #12
                        Hello, I have a similar issue to Ferran, with panel data from one country but 7 sites within the country. The data set has same variable collected for 8 years (2008-2015), but I cannot get past the problem of graping this data so that I can show the (trend per varaible, by site and year) be able to compare the variables within each site by year. I have tried my level best but ain't just good enough to resolve this. sample data:
                        ZonesYear u5_HS a5_HS All_HS u5_HHC
                        1 2008 14 231 245 5
                        1 2009 8 225 233 3
                        1 2010 47 1006 1053 28
                        1 2011 41 1232 1273 28
                        1 2012 63 1668 1731 19
                        1 2013 43 1140 1183 8
                        1 2014 57 1764 1821 14
                        1 2015 34 975 1009 9
                        2 2008 220 1455 1675 53
                        2 2009 404 2822 3226 116
                        2 2010 804 6862 7666 221
                        2 2011 602 5062 5664 140
                        2 2012 435 4359 4794 143
                        2 2013 322 3850 4172 151
                        2 2014 329 3327 3656 138
                        2 2015 213 1637 1850 67
                        3 2008 7 16 23 9
                        3 2009 20 66 86 3
                        3 2010 12 8 20
                        3 2011 1 2 3
                        3 2012 0 0 0
                        3 2013 0 0 0
                        3 2014 0 0 0
                        3 2015 0 1 1 0
                        4 2008 2310 6320 8630 1445
                        4 2009 1416 5278 6694 1047
                        4 2010 2761 15376 18137 1897
                        4 2011 2664 13939 16603 1312
                        4 2012 2489 16859 19348 1044
                        4 2013 1687 13693 15380 1149
                        4 2014 2021 15473 17494 783
                        4 2015 1418 8856 10274 534

                        Kindly help how I can proceed.

                        Comment


                        • #13
                          Well, first you need some variable that distinguishes which are the treatment zones and which are the control zones. You don't have that. You also need a variable that distinguishes the years before intervention from those after. You don't have that either. It's also not clear which of your variables is the outcome you want to graph and compare.

                          Just to illustrate how you might proceed, I'll pretend that zones 1 and 2 are the treatment group, the intervention begins in 2013, and the outcome of interest is All_HS

                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input float(Zones Year u5_HS a5_HS All_HS u5_HHC)
                          1 2008   14   231   245    5
                          1 2009    8   225   233    3
                          1 2010   47  1006  1053   28
                          1 2011   41  1232  1273   28
                          1 2012   63  1668  1731   19
                          1 2013   43  1140  1183    8
                          1 2014   57  1764  1821   14
                          1 2015   34   975  1009    9
                          2 2008  220  1455  1675   53
                          2 2009  404  2822  3226  116
                          2 2010  804  6862  7666  221
                          2 2011  602  5062  5664  140
                          2 2012  435  4359  4794  143
                          2 2013  322  3850  4172  151
                          2 2014  329  3327  3656  138
                          2 2015  213  1637  1850   67
                          3 2008    7    16    23    9
                          3 2009   20    66    86    3
                          3 2015    0     1     1    0
                          4 2008 2310  6320  8630 1445
                          4 2009 1416  5278  6694 1047
                          4 2010 2761 15376 18137 1897
                          4 2011 2664 13939 16603 1312
                          4 2012 2489 16859 19348 1044
                          4 2013 1687 13693 15380 1149
                          4 2014 2021 15473 17494  783
                          4 2015 1418  8856 10274  534
                          end
                          
                          gen byte treatment = inlist(Zones, 1, 2)
                          gen byte pre_post = (Year > 2012)
                          collapse (mean) All_HS (first) pre_post, by(treatment Year)
                          separate All_HS, by(treatment)
                          graph twoway line All_HS? Year if pre_post == 0
                          In the future, when posting example data, please use the -dataex- command, as I have done here. Run -ssc install dataex- and then run -help dataex- to read the simple instructions for using it. The way you posted your data, it was not particularly hard to import into Stata, but the result may not be truly faithful to your data configuration, because important details such as storage types, labeling, etc., are missing. By using -dataex- you enable those who want to help you to create a completely faithful replica of your Stata example with a simple copy/paste operation.
                          .

                          Comment


                          • #14
                            In my Difference-In-Difference-Example I would like to generate a trendline for the treatment group (Leverage_w1) after the year 2014 (because then is the treatment), which should look like the timeline from the controlgroup (Leverage_w0). Is there a code I can enter? Or something in the graph editor?

                            At the moment I have this code

                            Code:
                            preserve
                            collapse (mean) Leverage_w, by (treated year)
                            reshape wide Leverage_w, i(year) j(treated)
                            graph twoway line Leverage_w0 Leverage_w1 year, ytitle(%)  xtitle(year end) xline(2014) sort
                            restore
                            Many thanks for the answer.


                            Click image for larger version

Name:	Verschuldung.png
Views:	1
Size:	35.4 KB
ID:	1480462

                            Comment

                            Working...
                            X