Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Thanks for the helpful tips Ben! Obtaining the correct ATT with weights is no problem but I am struggling a bit on where to put the weights into the G so the IF is calculated correctly. I saw on p.60 of the aforementioned paper you explain the use of survey weights for the logistic regression example. This clarifies that instead of N one should use the sum of weights (W), and in the simple case the weight w just goes into the cross product. For the various G_subscripts, I am not sure exactly how this applies. G11 is the simplest case, but for the others I am unsure. I tried out several things but didn't get the right ATT using csdid_rif, so would be grateful of further guidance.

    Code:
    preserve
     keep if year==2022 & female==0
    mata  Dnm = "treat"; Xnm = "$controls"
    Ynm = "outcome" ; Znm = "$controls"
    N = st_nobs()
    D = st_data(., Dnm); X = st_data(., Xnm), J(N, 1, 1)
    Y = st_data(., Ynm); Z = st_data(., Znm), J(N, 1, 1)
    w = st_data(., "weight")
    W = sum(W)  
    
    // estimate logit and create weights
    stata("quietly logit " + Dnm + " " + Xnm + " [pw=weight]")
    p = invlogit(X * st_matrix("e(b)")')
    w0 = p :/ (1 :- p) :* !D
    st_store(., st_addvar("double", "w0"), w0)
    stata("gen est_weight=w0*weight")
    w0_weight = "est_weight"
    est_weight = st_data(., w0_weight)  
    
    // estimate regression model
    stata("quietly regress " + Ynm + " " + Znm + " if " + Dnm + "==0 [iw=est_weight]")  
    Zg0 = Z * st_matrix("e(b)")'  
    // compute IF for eta01  
    h1 = X :* (D - p)  
    G11inv = invsym(cross(X, w :* p :* (1 :- p), X) / W) //I think this is now ok after adding w and W  
    h2 = Z :* w0 :* (Y :- Zg0) //Should this use est_weight instead of w0?  
    G21 = cross(-h2, X) / W //Needs editing  
    G22inv = invsym(cross(Z, w0, Z) / W) //Needs editing, perhaps using est_weight instead of w0 or equivalently w:* w0?  
    eta01 = mean(Zg0, D:* w)  
    
    h3 = D :* (Zg0 :- eta01)  G32 = colsum(-D :* Z) / W //Needs editing  
    IF_eta01 = W/sum(D:* w) * (h3 - (h2 - h1 * G11inv' * G21') * G22inv' * G32')  //Not sure here
    
    // compute IF for eta11  
    eta11 = mean(Y, D:* w)  
    IF_eta11 = W/sum(D:* w) * D:* w :* (Y :- eta11) //Needs editing  
    
    // compute IF for ATT  
    ATT = eta11 - eta01  
    ATT  
    st_local("att", strofreal(ATT)) //added to store the att in a local to calculate the RIF from the IF    
    
    IF_ATT = IF_eta11 - IF_eta01  
    st_store(., st_addvar("double", "if_att"), IF_ATT) //added to store the IF for all observations  
    // display results (point estimate, mean of IF, standard error)  
    (ATT, eta11, eta01)', mean((IF_ATT, IF_eta11, IF_eta01))', sqrt(diagonal(variance((IF_ATT, IF_eta11, IF_eta01)) / W)) * sqrt((W-1)/W)  
    end  
    
    sum if_att //check whether the IF is on average 0  
    *Calculate the RIF from IF produced above
    cap drop RIF_att
    gen RIF_att=if_att+`att'  
    
    *Now obtain the ATT with clustered standard errors
    csdid_rif RIF_att, cluster(cluster_var)  
    
    *Save for male population
    keep id RIF_att year
    rename RIF_att RIF_male
    tempfile male_merge
    save `male_merge'  
    restore
    Then repeat for female==1 and merge that onto the main dataset too and use csdid_rif followed by the test command.
    Last edited by Nick Barton; 02 Mar 2023, 18:21. Reason: Fixing formatting of code

    Comment


    • #17
      Upon further review, I edited the G_subscripts to the following:

      Code:
      G11inv = invsym(cross(X, w :* p :* (1 :- p), X) / W)
      G21 =  cross( -h2, X :* w ) / W
      G22inv = invsym(cross( Z, w :* w0, Z) / W)
      G32 = colsum( -D:* Z :* w) / W
      I also dropped a "w" from here, since it didn't make sense to weight the dummy for an observation being treated or not in this outcome:

      Code:
       IF_eta11 = W/sum(D:* w) * D:* (Y :- eta11)
      While it is much closer to the teffects ipwra estimate, it is still not quite there
      Last edited by Nick Barton; 03 Mar 2023, 07:16.

      Comment


      • #18
        Sorry for spamming the thread, but I now realised that if I run
        Code:
        sum RIF_att if_att [iw=weight1]
        Then I actually get exactly what I would expect, i.e. the RIF mean is the ATT and the IF mean is close to zero.

        Furthermore, to get a RIF value that can be used by csdid_rif, it is necessary to reweight the RIF

        Code:
        sum weight1 if !mi(RIF_att)
        gen RIF_att_weighted=RIF_att*weight/`r(mean)'
        
        csdid_rif RIF_att_weighted, cluster(clustervar)
        This then gives the correct ATT from the csdid command, allowing for merging back together and testing.

        Comment


        • #19
          Dear Fernando, and others.

          I was wondering if there was an example to compare the ATTs from two different RIFs from csdid. I see that the code from drdid produces some variables that can be compared using the csdid_rif postestimation command, but from the information within the RIFs from csdid, I have a hard time to make the analogy.

          I started with the following code:

          use RIF1, clear
          csdid_stats simple, wboot estore(simple1)
          use RIF2, clear
          csdid_stats simple, wboot estore(simple2)
          test simple1= simple2

          And I obtain the error simple 1 not found.


          Sorry if this was pointed out at a different spot, but I have been reading this thread out of the title.

          Thanks

          Felipe LR
          Last edited by Felipe Lozano; 15 Feb 2024, 11:56.

          Comment


          • #20
            Hi felipe
            i do not have an example in hand. However I can suggest where to start
            so you already have two rif files. You need to merge them together based on the panel Id. Then use csdid_rif to create the table of interest
            hth

            Comment


            • #21
              Dear FernandoRios ,

              I have been digging into the functioning of csdid_rif.ado a little lately to understand the logic of using RIFs for calculation of asymptotic standard errors. I was wondering about the "correct" number of observations (Nobs) to divide by when calculating the standard errors in line 205 of csdid_rif.ado. I see that for every horizon and cohort you divide by the number of units in your sample, regardless of whether or not they form part of the cohort in question. This confuses me a little, since these cohorts have very different sample sizes (both in terms of overall units and of treated units). This number of observations also remains the same if we make our panel unbalanced. Below you can find a code based on an example used by you in an earlier thread that I used to play around with the different objects to understand better what is going on:

              Code:
              use "https://friosavila.github.io/playingwithstata/drdid/mpdta.dta", clear
              
              * make an unbalanced panel
              drop if countyreal < 30000 & first_treat > 2005 & year < 2004
              drop if countyreal < 30000 & first_treat > 2006 & year < 2005
              
              csdid  lemp lpop , ivar(countyreal) time(year) gvar(first_treat) clust(countyreal) method(dripw) notyet saverif(rif_temp) replace
              
              clear
              svmat e(gtt), names(col)
              putmata hor_Ntrt =(cohort t0 t1 N N_trt), replace
              
              use rif_temp.dta, clear
              
              putmata rif = (_g*)
              
              mata
                  bb=mean(rif)
                  nobs=rows(rif)
                  VV=quadcrossdev(rif,bb,rif,bb):/ (nobs^2)
                  VV2=quadcrossdev(rif,bb,rif,bb):/ ((hor_Ntrt[.,5])':*nobs)
                  
                  se1 = sqrt(diagonal(VV))
                  se2 = sqrt(diagonal(VV2))
                  
                  out1 = bb', bb':-(se1:*2), bb':+(se1:*2), se1
                  out2 = bb', bb':-(se2:*2), bb':+(se2:*2), se2
              end
              
              csdid_rif _g*
              rm rif_temp.dta
              My question here is: Why are we dividing by the total N (i.e. the number of rows)? I suspect this ties into a fundamental lack of understanding on my part as to why the mean of the RIFs gives us the cohort-specific gATT and more importantly, why then its variance-covariance matrix yields the asymptotic standard errors (a pointer to relevant literature would be much appreciated).

              The reason I am interested in this is the following setting: I have an unbalanced panel of firms with staggered treatment. I want to use csdid to compute event-study-type estimators. However, I want to restrict the control group to firms within the same sector. My estimand of interest is an ATT where each treated firm observed at a specific horizon should be weighted equally (i.e. firms and not sectors should be the unit of observation). My strategy so far was to run csdid several times, each time using only observations within the same sector. I save the RIFs and then try to calculate the horizon specific ATT from there. Since the unweighted average of RIFs for each sector returns the sector-specific ATT for a certain cohort, I compute a weighted average of RIFs. My ATT is then the weighted average of these RIFs. I.e., I replace the step bb=mean(rif) with a weighted mean for different sectors. The following pseudo-code illustrates what I do. Let rif be the matrix of variables called _g* produced by the saverif() command.

              Code:
              For sector A that was of cohort t and is observed at t+h:
              weight_between = (number of firms in sector A, cohort t present at t+h) / (number of treated firms obverved at the treatment horizon h)
              weight_within = 1/(number of firms ever observed in sector A)
              weight_tot = weight_within:*weight_between
              
              rif_w = rif:*weight_tot
              I then stack those weighted RIFs belonging to the same treatment horizon for all sectors on top of one another (e.g. treated in 2005 and observed in 2007 on top of treated in 2006 and observed at 2008). Call rif_hor the stacked version of rif and rif_w_hor the stacked version of rif_w
              Code:
              bb=colsum(rif_w_hor)
              rif_for_vv=(rif_hor:-bb):*(sqrt(weight_hor))   // rif_hor is the stacked object without weights applied
              
              VV=quadcross(rif_for_vv,rif_wfpr_vv:/ (nobs)
              se = sqrt(diagonal(VV))
              I am however unsure a) which nobs to apply to this last object and b) whether this procedure is valid. I am reasonably confident of the validity of the point-estimates since it should be equivalent to first computing the RIFs and then taking a weighted average of them. However, for inference I am less certain and would much appreciate some insights.

              Tremendous thanks,
              Daniel
              Last edited by Daniel Prosi; 29 May 2024, 10:26.

              Comment


              • #22
                Hi Daniel
                Its a bit long to explain so let me provide you with the basics
                When using RIFS for different subsamples, you need to take care of the "missing" data, replacing them with the mean (for the IFs would be zero). Then you need to rescale the data, so the mean and variance of the "completed" vector are the same to "partial vector".
                After that, one always uses N to estimate Variances (the Full vector size).
                You may want to explore how gmm works to have a better grasps of all the details.
                F

                Comment


                • #23
                  Originally posted by FernandoRios View Post
                  Hi Nick
                  Actually, if you are using DRDID, this is far easier. (although you will still need csdid subcommand
                  here an example:

                  Code:
                  use https://friosavila.github.io/playingwithstata/drdid/lalonde.dta, clear
                  keep if treated==0 | sample==2
                  ** Say you estimate ATT for Black and Nonblack
                  drdid re age educ married nodegree hisp re74 if black==0, ivar(id) time(year) tr(experimental) stub(nb)
                  drdid re age educ married nodegree hisp re74 if black==1, ivar(id) time(year) tr(experimental) stub(b)
                  ** Then you can use the RIFs to compare the effects
                  csdid_rif nbatt batt
                  
                  ------------------------------------------------------------------------------
                  | Robust
                  | Coefficient std. err. z P>|z| [95% conf. interval]
                  -------------+----------------------------------------------------------------
                  nbatt | -1002.956 574.7723 -1.74 0.081 -2129.489 123.5773
                  batt | -237.7852 409.969 -0.58 0.562 -1041.31 565.7392
                  ------------------------------------------------------------------------------
                  
                  This is similar to the mean command. But follows the Rules behind CSDID to pull together results and address missings
                  From here just test
                  test nbatt=batt
                  
                  ( 1) nbatt - batt = 0
                  
                  chi2( 1) = 1.17
                  Prob > chi2 = 0.2784
                  Hope this helps

                  Also, in this case the RIF works perfectly, because its estimated separately for black and white. Or in your case, for women and men.

                  When RIF fails is when you try to estimate the RIF for everyone, and then test for differences for different groups.
                  For example:

                  Code:
                  ** for everyone
                  drdid re age educ married nodegree hisp re74 black , ivar(id) time(year) tr(experimental) stub(nbn)
                  
                  
                  . tabstat nbatt batt nbnatt, by(black)
                  
                  Summary statistics: Mean
                  Group variable: black
                  
                  black | nbatt batt nbnatt
                  ---------+------------------------------
                  0 | -1002.956 . -595.4293
                  1 | . -237.7852 752.8134
                  ---------+------------------------------
                  Total | -1002.956 -237.7852 -428.4786
                  ----------------------------------------
                  As you can see, the effect is different for both groups. And is more positive for blacks compare to whites. But nbnatt suggests that blacks actually benefit from the program, which is not true in this case.
                  This is the pitfall of trying to use an "unconditional" RIF and estimate separate means across categorical variables.

                  The first approach, however, is correct, because you are estimating "conditional" RIFS.

                  HTH
                  Fernando
                  Hi Fernando,

                  I am trying to use this with when I have staggered implementation. Do you think that is possible? And if that is, in that case, tr variable should be if the state if ever treated?

                  Comment


                  • #24
                    For staggered is more difficult because the IF are not kept in the dataset
                    so it’s possible just harder to implement

                    Comment


                    • #25
                      Do you have any idea how I can go about this in Stata?

                      Comment

                      Working...
                      X