Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    Hi Isabella
    The equivalent to R's unbalance is to request repeated crossection estimator.
    Instead of
    csdid pea nao_emp_Chefe, ivar(id) time(time_calendar) gvar(first_treat)
    write
    csdid pea nao_emp_Chefe, cluster(id) time(time_calendar) gvar(first_treat)
    HTH

    Comment


    • #47
      thanks a lot, Fernado!! but I still have a doubt, my panel is at individual level, can I still cluster? I saw this option in csdid help, but I didn't use cluster precisely because I want the effect at the individual level... how do i interpret the effect using cluster if my data is at the individual level?

      also, what is the difference between removing the "ivar" option and using "cluster" to deal with unbalanced panels? from the help, I understood that removing the "ivar" would also be one kind of solution when my data is unbalanced. Can I use this alternative in my case?

      Thanks again!

      Comment


      • #48
        So, something that may not be well understood, when you use panel estimators with csdid, standard errors are obtained clustering at the individual level.
        If your panel is fully balance, for example, and all variables are time fixed, using ivar(id) or cluster(id) should produce the same result.
        Differences would arise if data is unbalanced or if characteristics change across time.

        Now, when you say your data is at the individual level, do you mean you do not observe the same individuals across time? In that case, cluster will not be useful.
        In any case, cluster only modifies how standard errors are estimated.

        Now, using ivar Forces csdid to use panel data estimators. And, in the first step for these estimators, you always get the within unit change across time: Dy = y - l.y. THis is the reason for the message.
        When you do not use ivar, you request repeated crossection estimators. In this case the DID estimator basically focus on estimating E(Dy) = E(Y|t) - E(Y|t-1). SO rather than getting first difference, you first get the conditional means, then estimate the changes across time.

        If you are interested in the exact formulas, you can check Pedro's original paper, or take a look at my reinterpretation of those values here: https://friosavila.github.io/playing...did_csdid.html
        This is what R's DID do when you allow for unbalanced panel. Except that in Stata you need to be explicit about the clustering too.
        HTH

        Comment


        • #49
          thanks again, now I understand!

          I have one more doubt, now it's about the speed of the package in stata. My model has been running for more than a day, but it doesn't finish, is this normal? Here is the pdf image...
          Attached Files

          Comment


          • #50
            From what i see, you have a lot of data.
            And a lot of that is not even beeing used. As long as you see "." or "x" happening, then it does mean there is progress.
            I wonder, however, if there are some problems regarding your variable selection, that is creating a problem for the model estimation.

            So, one way to figure that out.
            first Tab tempo_calendario primerio_choque, and show me how much data you have there.
            Then you can select 1 cohort plus the never treated, and 2 years (right before treatment and after treatment based on your cohort variable.

            Finally, run a logit model to determine the chances of belonging to the cohort.
            If that gives you any problems, means that you may be overfitting your model.
            HTH

            Comment


            • #51
              Ok, that's the result of:
              Code:
              tab tempo_calendario primeiro_choque
              Total: 751,537.

              Just to explain: this is a base that originated from the largest household survey in Brazil. The base has a quarterly frequency and starts in 2012 and goes until 2019. People are follow for maximum to 5 quarters. To create the variable tempo_calendario I did:
              Code:
              egen tempo_calendario = group(year quarter)
              The treatment happens when the head of the household loses his job in one of the quarter, and the idea is to see the effect of this shock over the quarter on the likelihood that the child will start working to compensate for the loss of income.
              group(year quarter) primeiro_choque
              tempo_calendario 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Total
              1 13,865 932 418 190 82 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15,487
              2 21,590 932 1,075 455 238 93 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24,383
              3 22,225 654 1,075 1,012 478 263 65 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25,772
              4 21,871 390 736 1,012 1,050 556 195 56 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25,866
              5 21,780 179 428 691 1,050 1,110 408 180 60 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25,886
              6 22,017 0 184 444 737 1,110 886 368 205 61 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26,012
              7 21,205 0 0 169 419 675 886 863 400 189 63 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24,869
              8 21,171 0 0 0 204 401 547 863 901 378 159 63 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24,687
              9 21,338 0 0 0 0 178 350 574 901 888 365 184 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24,872
              10 21,574 0 0 0 0 0 158 357 598 888 817 388 244 87 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25,111
              11 21,916 0 0 0 0 0 0 170 398 618 817 833 481 218 79 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25,530
              12 21,938 0 0 0 0 0 0 0 159 415 545 833 995 448 186 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25,613
              13 21,452 0 0 0 0 0 0 0 0 198 342 543 995 929 367 249 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25,165
              14 21,005 0 0 0 0 0 0 0 0 0 149 335 638 929 804 484 269 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24,677
              15 20,551 0 0 0 0 0 0 0 0 0 0 136 404 591 804 1,083 565 284 137 0 0 0 0 0 0 0 0 0 0 0 0 0 24,555
              16 20,308 0 0 0 0 0 0 0 0 0 0 0 167 364 554 1,083 1,146 604 368 125 0 0 0 0 0 0 0 0 0 0 0 0 24,719
              17 20,078 0 0 0 0 0 0 0 0 0 0 0 0 178 362 739 1,146 1,264 683 305 114 0 0 0 0 0 0 0 0 0 0 0 24,869
              18 19,869 0 0 0 0 0 0 0 0 0 0 0 0 0 175 477 816 1,264 1,370 590 289 113 0 0 0 0 0 0 0 0 0 0 24,963
              19 19,740 0 0 0 0 0 0 0 0 0 0 0 0 0 0 237 511 987 1,370 1,335 648 297 80 0 0 0 0 0 0 0 0 0 25,205
              20 19,160 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 220 624 957 1,335 1,403 549 222 80 0 0 0 0 0 0 0 0 24,550
              21 18,528 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 277 580 912 1,403 1,165 455 200 98 0 0 0 0 0 0 0 23,618
              22 18,349 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 268 588 1,007 1,165 1,053 452 289 86 0 0 0 0 0 0 23,257
              23 18,109 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 279 658 835 1,053 1,162 626 261 80 0 0 0 0 0 23,063
              24 17,759 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 322 523 782 1,162 1,305 562 210 76 0 0 0 0 22,701
              25 17,437 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 248 509 812 1,305 1,137 437 207 88 0 0 0 22,180
              26 17,246 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 233 530 906 1,137 965 433 268 107 0 0 21,825
              27 17,196 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 249 599 806 965 1,027 553 282 88 0 21,765
              28 16,846 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 251 497 702 1,027 1,154 526 243 61 21,307
              29 16,466 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 224 422 746 1,154 1,085 491 197 20,785
              30 16,644 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 185 482 798 1,085 1,076 416 20,686
              31 17,235 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 232 511 769 1,076 1,059 20,882
              32 14,186 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 228 449 755 1,059 16,677
              Total 620,654 3,087 3,916 3,973 4,258 4,386 3,495 3,431 3,622 3,635 3,257 3,315 4,018 3,744 3,331 4,446 4,763 5,368 5,733 5,469 5,844 4,895 4,387 4,647 5,379 4,710 3,966 4,230 4,754 4,303 3,729 2,792 751,537
              Last edited by Isabella Helter; 07 Dec 2021, 15:08.

              Comment


              • #52
                Dear Fernando,

                csdid is a great command. I have been using it for several weeks now, and it worked well. However, I have received the following error message today:

                csdid outcome, cluster(state) time(cohort) gvar(staggered_cohort) method(drimp) agg(simple)
                xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                xxxxxxxxxxxxxxestimates post: matrix has missing values
                r(504);

                Any help with resolving this issue would be greatly appreciated.

                Thank you!

                Best,
                Iryna

                Comment


                • #53
                  Dear Isabella
                  So there are two reasons why you have so many X's and that is taking a while to estimate the model.
                  At any point, you are using the full dataset to estimate 5 models. 4 for the outcome and 1 for the propensity. In all cases, you are using a large dataset (751k observations). Even using "IF's" (as csdid does when calling on drdid), Stata starts by using the full dataset. And that takes time.
                  The other reason, as I suggested, could be overfitting. Specially for those cases where you have less than 100 obs.
                  For example, try running the following:
                  drdid pea $x1list [w=peso] if inlist(tempo_calendario,2,6) & inlist(primero,0,3) , tvar(tempo_calendario) tr(gvar)

                  And see what happens. If it takes too long, I would also run a logit model using the same sample
                  logit gvat $x1list [w=peso] if inlist(tempo_calendario,2,6) & inlist(primero,0,3)

                  That will give you a better idea of whether or not you have overfitting.


                  Comment


                  • #54
                    Hi Iryna
                    My first reaction was gonna be if you have -drdid- installed.
                    What the output is telling you is that nothing was estimated. That is whyyou have only "X" as iterations.
                    It is possible that your gvar year set up is not correct, so can you show me what happens when you do
                    tab year gvar

                    the other alternative is that you have very few observations per cohort and year. This makes drimp very difficult to estimate. In this case i would try method(reg). Without covariates, they will give you the same results.

                    Finally, its better if you do not request agg(simple). Just let it provide you all ATTGT's and then request the simple average using "estat simple",

                    HTH
                    Fernando

                    Comment


                    • #55
                      Dear Fernando,

                      Thank you so much for your response!

                      You are right my gvar is not coded correctly, but I am not sure how to fix this issue.

                      Here is

                      tab year gvar
                      | gvar
                      year | 0 23 24 25 26 27 28 | Total
                      1 | 616 0 0 0 0 0 0 | 616
                      2 | 617 0 0 0 0 0 0 | 617
                      3 | 622 0 0 0 0 0 0 | 622
                      4 | 775 0 0 0 0 0 0 | 775
                      5 | 694 0 0 0 0 0 0 | 694
                      6 | 756 0 0 0 0 0 0 | 756
                      7 | 1,142 0 0 0 0 0 0 | 1,142
                      8 | 1,178 0 0 0 0 0 0 | 1,178
                      9 | 1,027 0 0 0 0 0 0 | 1,027
                      10 | 1,048 0 0 0 0 0 0 | 1,048
                      11 | 961 0 0 0 0 0 0 | 961
                      12 | 1,081 0 0 0 0 0 0 | 1,081
                      13 | 1,132 0 0 0 0 0 0 | 1,132
                      14 | 1,165 0 0 0 0 0 0 | 1,165
                      15 | 1,482 0 0 0 0 0 0 | 1,482
                      16 | 1,255 0 0 0 0 0 0 | 1,255
                      17 | 1,598 0 0 0 0 0 0 | 1,598
                      18 | 1,168 0 0 0 0 0 0 | 1,168
                      19 | 1,584 0 0 0 0 0 0 | 1,584
                      20 | 1,251 0 0 0 0 0 0 | 1,251
                      21 | 1,362 0 0 0 0 0 0 | 1,362
                      22 | 1,150 0 0 0 0 0 0 | 1,150
                      23 | 974 288 0 0 0 0 0 | 1,262
                      24 | 697 198 173 0 0 0 0 | 1,068
                      25 | 876 216 72 125 0 0 0 | 1,289
                      26 | 822 360 100 90 143 0 0 | 1,515
                      27 | 666 270 108 180 126 54 0 | 1,404
                      28 | 646 198 90 158 54 54 72 | 1,272
                      29 | 612 192 144 125 88 54 35 | 1,250
                      30 | 562 270 86 78 155 36 0 | 1,187
                      31 | 816 144 112 104 129 72 0 | 1,377
                      32 | 538 198 72 178 201 36 0 | 1,223
                      33 | 405 210 54 115 152 53 0 | 989
                      34 | 391 180 64 67 52 34 42 | 830
                      35 | 481 72 132 67 109 15 18 | 894
                      36 | 607 102 53 28 102 15 46 | 953
                      37 | 338 190 114 58 100 27 0 | 827
                      38 | 273 204 78 79 113 26 16 | 789
                      39 | 280 132 68 38 35 23 0 | 576
                      ---------- -+ ---------- ----------- --------------- ------- ----------- ----------- ----------- -+ ----------
                      Total | 33,648 3,424 1,520 1,490 1,559 499 229 | 42,369

                      If I add any constant to the positive values of gvar, the csdid command works (but, of course, this solution doesn't make any sense), but the p-value for the pretend test is zero and STATA drops about 60% of the observations. p-value is also 0 for any subsample, which is very strange.

                      . estat all
                      Pretrend Test. H0 All Pre-treatment are equal to 0
                      chi2(25) = 20934.78777012576
                      p-value = 0

                      I would greatly appreciate any help with resolving these issues.

                      Thank you,
                      Iryna
                      Last edited by Iryna Hayduk; 13 Dec 2021, 12:34.

                      Comment


                      • #56
                        Can you contact me via email? I think i ll need more information than what you have here to help you with the problem.
                        F

                        Comment


                        • #57
                          Hello Fernando! I have a question about the csdid stata package and R: can I add a time fixed effect? I saw you talking about it in one of the topics and I was confused

                          Comment


                          • #58
                            No you cant, and there is no need.
                            Time fixed effects are used to "take care of differences across time". But with Callaway and Sant'Anna , you use the same years (pre and post) for the treated and control group, to obtain a given estimator. So there is no need to control for that.
                            Also, keep in mind that everytime drdid is used (behind the all operations), you are only using 2 periods of time, thus using trends would make little sense.
                            HTH
                            Fernando

                            Comment


                            • #59
                              Hello Fernando
                              Thank you for the great package. I have three simple questions:
                              Suppose a simple DD mode with fixed effect (with only one treatment shock) in stata with clusterd sd: reghdfe Y treatXpost, absorb(panel_id period) vce(cluster panel_id)

                              1- using csdid with ivar already provides the clustered sd, right (similar to above code)? I am asking this because it does not allow both cluster and ivar in the code [id may not be both target and by()]
                              On the other hand, when usning ivar, in the output table it does not explicitly mention the sd is adjusted for clustered and for how many clusters.

                              2- Regardless of the sd, when we have multiple periods but only one group (i.e., all treated units are treated at the same time vs never-treated units), the ATT of the csdid code (first line below) should be exactly the same as the coeffcient of the second line below, or not? As the coeficients are not the same at all (tested in a few data sets, even in a balanced panel).

                              csdid Y, time(period) gvar(first_treated) method(reg) ivar(panel_id)
                              reghdfe Y treatXpost, absorb(panel_id period)

                              where above first_treated is zero for all never-treated units. And it is the treatment shock period (i.e., the start of the post-treatment periods which is the same for all; let's say t5) for all treated units.

                              3- I wonder why in the tables z-test and score has been used and reported rather than t-test and score.

                              Thank you!

                              Comment


                              • #60
                                Hi Mahdi
                                That is a good point. I do describe in the paper (still being edited), and the helpfile that when you request panel estimators (using ivar), the standard errors are implicitly clustered at the panel id level.
                                One way of seeing this. If you use repeated crossection, with only time constant covaraites, and fully balance panel, using -ivar- (panel) or cluster (repeated crossection) will give you the same results
                                I ll add that to the output next time the program is updated!

                                2. No, they will not be the same, but very similar.
                                The reason for this is that when using regression approach, you are forcing all effects "treatxpost" to be constant.
                                csdid however assume the effects vary across time and groups. Then, when requesting "simple" aggregation, it will take the average of the individual effects across time. THey should be in principle similar to the regression approach, but not the same.

                                3. It reports Z - stats because by default, csdid uses GMM to estimate standard errors, which are valid asymptotically. THus, like with -ml- estimators, all results statistics are really Zstats, not t-stats.
                                Best wishes
                                Fernando

                                Comment

                                Working...
                                X