Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Implementing an Iterative Loop While Updating What We're Looping Over

    Hey everyone. Suppose we're interested in implementing an algorithm which selects the "optimal" set of controls for a certain treated unit for policy analysis. Take the dataset below (by the way I'm working with Stata 17)

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(hongkong australia newzealand austria canada denmark)
     .062   .04048913   .04724391  -.01308351   .01006395  -.01229182
     .059   .03785692   .03875869 -.007580798   .02126387 -.003092842
     .058   .02250948   .08991753  .000542671  .018919427 -.007764421
     .062   .02874655   .06975085  .001180751   .02531683 -.004048589
     .079   .03399039   .06019911   .02551085   .04356715    .0310944
     .068   .03791937   .06255518  .019941313   .05022538      .06428
     .046   .05228941   .04292477  .017087875   .06512183   .04595546
     .052  .031070895   .04760897  .023035197   .06733068   .05516641
     .037  .008696091  .022149924  .025292696    .0509212   .04805718
     .029  .006773674  .023302693  .021849955   .03152506  .011953605
     .012   .00302829   .03045487  .018319173  .018179957   .02080968
     .015  .010981606   .03069211   .01345693  .015165864  .008303516
     .025   .03818205   .04089288  .015387368  .007820651  .010101924
     .036   .03452006   .03584749  .017335817  .011510062   .03085883
     .047   .03667319   .03766511  .013595447   .02166007   .04011321
     .059   .03898745  .025593406  .004195063   .02470353   .02569367
     .058  .025748033  .005592703 -.001534697  .035775363  .029461836
     .072   .05186257  .034571752 -.002021203   .03868772    .0398559
     .061   .05928891   .03353216    .0161481   .03672911   .01764238
     .014   .06342068  .018211482  .017984418    .0380428   .03139775
    -.032   .06243018  .018747104   .03047369  .033733726  .033021096
    -.061   .03949186 -.017352613  .032317627  .028866187 -.010453694
    -.081   .04318146 -.019380176   .02719671    .0189473   .02242801
    -.065  .035846256   .02692859   .02283163   .02248398   .01477213
    -.029       .0409   .03889057   .02242066   .03785257  .003396192
     .005   .03857499   .05541473  .028941363   .04818999  .029012986
     .039   .02692009   .06664848   .03839999  .063958384  .010058184
     .083   .03297373   .04101127   .03883952   .06480881  .026324544
     .107   .03959653   .06054202  .036271237   .06725128   .03857407
     .075   .04687658  .034375284  .030074896   .07295554  .030016804
     .076  .024391215    .0207593  .016178379  .065640524   .03717389
     .063  .005838784   .03254855  .011539886   .05321779  .036737546
     .027  .000732203  .015890202  .005441154   .04056079   .01424755
     .015 -.002025588   .05496934 -.005665725  .007522337  .007846387
    -.001  .028212607  .036681067 -.004335946 -.017121905   .01250594
    -.017   .03998247   .04987184 -.000616657 -.014710952  -.00049339
     -.01   .03867529   .05044868  .003416942 -.011813972  -.00188065
     .005   .04083484  .011498967  .010456725   .01313668   .01491368
     .028  .032215476   .04828115   .01278433    .0296833  .002654192
     .048   .03258446  .015707554  .010514015   .03784149 -.001912172
     .041   .02694338   .02053765  .007017912   .03301423  .005125017
    -.009   .02464757    .0400604  .008199689  .015919374  -.01868332
     .038   .03993078   .04220926  .005746135  .025793217 -.005911494
     .047   .05529072   .05431711  .008809593   .02063149  .017165432
     .077   .05989317  .066612795  .013646583   .02710546   .02884403
      .12   .05748491   .06818556  .017229607   .04896003   .03726172
     .066   .04655641   .05111898   .02444402   .05043372  .036128163
     .079  .030096613   .04093498   .02429197   .04781632   .03432884
     .062   .03160237   .01501046   .02370696   .03985261  .020981267
     .071   .04588263   .02387351    .0257305   .03162305   .05209525
     .081   .04553443  .011341385  .026474627   .03574734   .04374102
     .069   .05498263  .008914476  .032616325    .0503335   .02875244
      .09   .04806678  .018647296   .03832044   .04947587   .04931609
     .062   .02698179 -.009260945  .035103742   .04119911   .03880127
     .064  .032730877  .011500795   .03722008  .031677015   .04183601
     .066   .03857545  .036755715   .03898238    .0200051   .02980916
     .055    .0580129   .03946248  .036197655   .03071206  .033133514
     .062   .05951871   .05829326   .03257025   .03982709 -.007168933
     .068   .05664859   .05114701   .03155845   .03474158  .013516975
     .069   .04582468   .04590492   .01909501   .03812844   .02379412
     .073  .027523303  .031214973  .017430725   .02921731 -.005199719
    end
    In our case, hongkong is our treated unit and the other columns are our control group. The way the first step of the algorithm proceeds is as follows:

    Code:
    clear *
    
    
    
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(hongkong australia newzealand austria canada denmark)
     .062   .04048913   .04724391  -.01308351   .01006395  -.01229182
     .059   .03785692   .03875869 -.007580798   .02126387 -.003092842
     .058   .02250948   .08991753  .000542671  .018919427 -.007764421
     .062   .02874655   .06975085  .001180751   .02531683 -.004048589
     .079   .03399039   .06019911   .02551085   .04356715    .0310944
     .068   .03791937   .06255518  .019941313   .05022538      .06428
     .046   .05228941   .04292477  .017087875   .06512183   .04595546
     .052  .031070895   .04760897  .023035197   .06733068   .05516641
     .037  .008696091  .022149924  .025292696    .0509212   .04805718
     .029  .006773674  .023302693  .021849955   .03152506  .011953605
     .012   .00302829   .03045487  .018319173  .018179957   .02080968
     .015  .010981606   .03069211   .01345693  .015165864  .008303516
     .025   .03818205   .04089288  .015387368  .007820651  .010101924
     .036   .03452006   .03584749  .017335817  .011510062   .03085883
     .047   .03667319   .03766511  .013595447   .02166007   .04011321
     .059   .03898745  .025593406  .004195063   .02470353   .02569367
     .058  .025748033  .005592703 -.001534697  .035775363  .029461836
     .072   .05186257  .034571752 -.002021203   .03868772    .0398559
     .061   .05928891   .03353216    .0161481   .03672911   .01764238
     .014   .06342068  .018211482  .017984418    .0380428   .03139775
    -.032   .06243018  .018747104   .03047369  .033733726  .033021096
    -.061   .03949186 -.017352613  .032317627  .028866187 -.010453694
    -.081   .04318146 -.019380176   .02719671    .0189473   .02242801
    -.065  .035846256   .02692859   .02283163   .02248398   .01477213
    -.029       .0409   .03889057   .02242066   .03785257  .003396192
     .005   .03857499   .05541473  .028941363   .04818999  .029012986
     .039   .02692009   .06664848   .03839999  .063958384  .010058184
     .083   .03297373   .04101127   .03883952   .06480881  .026324544
     .107   .03959653   .06054202  .036271237   .06725128   .03857407
     .075   .04687658  .034375284  .030074896   .07295554  .030016804
     .076  .024391215    .0207593  .016178379  .065640524   .03717389
     .063  .005838784   .03254855  .011539886   .05321779  .036737546
     .027  .000732203  .015890202  .005441154   .04056079   .01424755
     .015 -.002025588   .05496934 -.005665725  .007522337  .007846387
    -.001  .028212607  .036681067 -.004335946 -.017121905   .01250594
    -.017   .03998247   .04987184 -.000616657 -.014710952  -.00049339
     -.01   .03867529   .05044868  .003416942 -.011813972  -.00188065
     .005   .04083484  .011498967  .010456725   .01313668   .01491368
     .028  .032215476   .04828115   .01278433    .0296833  .002654192
     .048   .03258446  .015707554  .010514015   .03784149 -.001912172
     .041   .02694338   .02053765  .007017912   .03301423  .005125017
    -.009   .02464757    .0400604  .008199689  .015919374  -.01868332
     .038   .03993078   .04220926  .005746135  .025793217 -.005911494
     .047   .05529072   .05431711  .008809593   .02063149  .017165432
     .077   .05989317  .066612795  .013646583   .02710546   .02884403
      .12   .05748491   .06818556  .017229607   .04896003   .03726172
     .066   .04655641   .05111898   .02444402   .05043372  .036128163
     .079  .030096613   .04093498   .02429197   .04781632   .03432884
     .062   .03160237   .01501046   .02370696   .03985261  .020981267
     .071   .04588263   .02387351    .0257305   .03162305   .05209525
     .081   .04553443  .011341385  .026474627   .03574734   .04374102
     .069   .05498263  .008914476  .032616325    .0503335   .02875244
      .09   .04806678  .018647296   .03832044   .04947587   .04931609
     .062   .02698179 -.009260945  .035103742   .04119911   .03880127
     .064  .032730877  .011500795   .03722008  .031677015   .04183601
     .066   .03857545  .036755715   .03898238    .0200051   .02980916
     .055    .0580129   .03946248  .036197655   .03071206  .033133514
     .062   .05951871   .05829326   .03257025   .03982709 -.007168933
     .068   .05664859   .05114701   .03155845   .03474158  .013516975
     .069   .04582468   .04590492   .01909501   .03812844   .02379412
     .073  .027523303  .031214973  .017430725   .02921731 -.005199719
    end
    
    
    
    cls
    
    * Creates the r-squared frame indexed to each individual unit
    
    mkf rsquare
    
    * We only need one row
    
    frame rsquare: set obs 1
    
    * Our time variable
    
    g time = _n, b(hongkong)
    
    * Gets the list of variable names
    
    qui ds
    
    ** Our time column
    
    loc temp: word 1 of `r(varlist)'
    
    loc time: disp "`temp'"
    
    ** Our treated unit
    
    loc t: word 2 of `r(varlist)'
    
    loc treated_unit: disp "`t'"
    
    loc a: word 3 of `r(varlist)'
    
    * Our first control unit
    
    loc donor_one: disp "`a'"
    
    
    local nwords :  word count `r(varlist)'
    
    loc b: word `nwords' of `r(varlist)'
    
    * Our last control unit
    
    loc last_donor: disp "`b'"
    
    *** Step 1: Initial Selection Loop
    
    /* We begin by looping over our controls in regression. */
    
    qui foreach i of var `donor_one'-`last_donor' {
    cap drop cfp
    
    constraint define 1 `i' = 1
    
    
    qui cnsreg `treated_unit' `i' if `time' < 45, constraint(1)
    
    // Calculating our rsquared statistic for the i-th model
    
    qui predict cfp if e(sample) 
    qui corr `treated_unit' cfp if e(sample)
    
    frame rsquare: g `i' = r(rho)^2
    
    cap drop cfp
    
    
    }
    
    
    ** Step 1b: now we select the unit with the highest r-squared statistic
    
    frame rsquare {
    
    qui ds
    loc donors `r(varlist)'
    qui egen max_value = rowmax(*)
    
    qui gen max_var = ""
    
    * Loop through each column and update max_var
    qui foreach var of varlist `donors' {
        replace max_var = "`var'" if `var' == max_value
    }
    
    loc colmax : di max_var[1]
    
    loc U: di "`colmax'"
    
    
    di "First selected unit is `U'"
    
    // In this case it's canada
    
    }
    
    frame drop rsquare
    cls
    We begin by looping over our control units in DID regression (outcome only). We then calculate the predicted counterfactual for this pre-intervention period and calculate the R-squared statistic. We save the r2 to a separate frame. The first unit to be selected is the one which maximizes the r-squared statistic, which we may access in the rsquare frame. We then put this unit up to the front of our control group, and proceed.

    Now, using the new control group (which for now consists of just canada), we take the average of canada and the other control units individually, estimate difference-in-differences with each of the remaining control units, and then find the unit which maximizes the r-squared statistic. In this case, it's New Zealand.

    Code:
    // Step 2: DID Step
    
    *******
    *******
    
    order `U', a(`treated_unit')
    
    mkf rsquare
    
    frame rsquare: set obs 1
    
    loc newdonors : list donors - U
    
    di "`newdonors'"
    
    
    local nwords :  word count `newdonors'
    
    loc temp: word 1 of `newdonors' // Time
    
    loc donor_one: disp "`temp'"
    
    
    loc last_donor: word `nwords' of `newdonors'
    di "The last donor in the set is `last_donor'"
    
    /* In this step, we loop through the REMAINING donors.
    In this case, australia and austria. */
    
    
    foreach i of var `donor_one'-`last_donor' {
    
    // These must be created each time
    
    cap drop cfp
    cap drop ym
    
    * We take the average of controls, using the U selected group and the new donor `v'
    egen ym = rowmean(`U' `i')
    
    constraint define 1 ym = 1
    
    qui cnsreg `treated_unit' ym if `time' < 45, constraint(1)
    
    qui predict cfp if e(sample) 
    
    qui corr `treated_unit' cfp if e(sample)
    
    qui frame rsquare: g `i' = r(rho)^2
    
    cap drop cfp
    
        if `i'==`last_donor' {
            frame rsquare {
            cap drop max_value
            cap drop max_var
            qui ds
            loc donors `r(varlist)'
            egen max_value = rowmax(*)
    
            gen max_var = ""
    
            * Loop through each column and update max_var
            qui foreach var of varlist `donors' {
                replace max_var = "`var'" if `var' == max_value
            }
    
            loc colmax : di max_var[1]
            di "Our next optimal donor is: `colmax'"
            local selectedunit `U' `colmax'
            loc newdonors : list donors - selectedunit
            di "Here are our new controls: `newdonors'"
            }
                
                
        }
    
    
    }
    Here's where my question lies: Now, we must implement step 2 for the remaining 3 controls (australia austria denmark). That is, we must see which of these three, when added to canada and new zealand, maximizes the r2, adding it to the macro/set U. Then, we must see which of the 2 maximize the r2 statistic, and then we simply add the final control unit to set U, such that we have all units included in the set "U". How might I do this? Or at the very least, what might a good starting point be? My initial reaction is to use a while loop which continues to loop until some condition is fulfilled. Maybe, I should, at the end of the loop for step 2, check how many words there are in the `newdonors'. Once there's 0 words in `newdonors', this means that there are no more control units and then the while loop can conclude. Is that a reasonable starting point?

  • #2
    Not sure I understood all of that. Try

    Code:
    local treated_unit hongkong
    
    local U // void
    
    local predictors australia newzealand austria canada denmark
    
    tempvar ym cfp
    
    tempname max_r2
    
    while ("`predictors'" != "") {
        
        scalar `max_r2' = 0
        
        foreach var of local predictors {
            
            quietly {
                
                egen `ym' = rowmean(`U' `var')
                
                constraint 1 `ym' = 1
                
                cnsreg `treated_unit' `ym' if _n < 45 , constraint(1)
                
                predict `cfp' if e(sample)
                correlate `treated_unit' `cfp' if e(sample)
                
                if r(rho)^2 > scalar(`max_r2') {
                    
                    scalar `max_r2' = r(rho)^2
                    local new_U `var'
                    
                }
                
                drop `ym'
                drop `cfp'
                
            }
            
        }
        
        local U `U' `new_U'
        
        local predictors : list predictors - new_U
    
    }
    
    display "`U'"

    Comment


    • #3
      Thank you! Sorry for the sort of confusing explanation. Here is the solution I devised. It also comports with my Python code:

      Code:
      clear *
      
      import delim "https://raw.githubusercontent.com/jgreathouse9/GSUmetricspolicy/main/data/GDP.csv", clear
      
      //drop china
      cls
      
      * Creates the r-squared frame indexed to each individual unit
      mkf rsquare
      
      * We only need one row
      frame rsquare: set obs 1
      
      * Our time variable
      gen time = _n
      
      
      * Gets the list of variable names
      qui ds
      
      * Our treated unit
      local treated_unit hongkong
      
      * Get control units list
      local control_units australia austria canada denmark finland france germany italy japan korea mexico netherlands newzealand norway switzerland unitedkingdom unitedstates singapore philippines indonesia malaysia thailand taiwan china
      
      * Initial selection loop
      foreach i of varlist `control_units'{
          cap drop cfp
      cap drop ym
      cap drop tss rss
          constraint define 1 `i' = 1
          qui cnsreg `treated_unit' `i' if time < 45, constraint(1)
          qui predict cfp if e(sample)
          * Calculate the mean of observed values
          qui summarize `treated_unit'
          local mean_observed = r(mean)
      
          * Calculate the Total Sum of Squares (TSS)
          qui generate double tss = (`treated_unit' - `mean_observed')^2
          qui summarize tss
          local TSS = r(sum)
      
          * Calculate the Residual Sum of Squares (RSS)
          qui generate double rss = (`treated_unit' - cfp)^2
          qui summarize rss
          local RSS = r(sum)
      
          * Calculate and display R-squared
          loc r2 = 1 - (`RSS' / `TSS')
          frame rsquare: gen `i' = `r2'
          cap drop cfp
      }
      
      
      frame rsquare {
      
      qui ds
      loc donors `r(varlist)'
      qui egen max_value = rowmax(*)
      
      qui gen max_var = ""
      
      * Loop through each column and update max_var
      qui foreach var of varlist `donors' {
          replace max_var = "`var'" if `var' == max_value
      }
      
      loc optimal_control_1 : di max_var[1]
      
      }
      
      * Initialize U and D
      local U `optimal_control_1'
      local D: list control_units - U
      
      * Iterate through the remaining control units
      qui while `"`D'"' != "" {
          local next_control ""
          local max_r2 -1
          
          foreach j of local D {
              local current_set `U' `j'
              cap drop cfp
          cap drop ym
          cap drop tss rss
          egen ym = rowmean(`U' `j')
          //di "`U' `j'"
              constraint define 1 ym = 1
              qui cnsreg `treated_unit' ym if time < 45, constraint(1)
              qui predict cfp if e(sample)
          * Calculate the mean of observed values
          qui summarize `treated_unit'
          local mean_observed = r(mean)
      
          * Calculate the Total Sum of Squares (TSS)
          qui generate double tss = (`treated_unit' - `mean_observed')^2
          qui summarize tss
          local TSS = r(sum)
      
          * Calculate the Residual Sum of Squares (RSS)
          qui generate double rss = (`treated_unit' - cfp)^2
          qui summarize rss
          local RSS = r(sum)
      
          * Calculate and display R-squared
          loc r2 = 1 - (`RSS' / `TSS')
              if `r2' > `max_r2' {
                  local next_control `j'
                  local max_r2 `r2'
              }
              cap drop cfp
          }
          
          * Add the best control unit to U
          local U `U' `next_control'
          
          * Remove the selected unit from D
          local D: list D - next_control
      
      }
      drop ym
      order time, first
      order `U', a(hongkong)
      cls
      
      * Initialize an empty local to build the variable list
      local varlist
      local best_r2 = -1
      local best_model = ""
      * Loop through each variable in the list
      foreach x of local U {
      * Drop previous variables
          cap drop cf 
          cap drop ymean 
          cap drop tss
          cap drop rss
          * Add the current variable to the list
          local varlist `varlist' `x'
          
          * Display the current variable list (for debugging purposes)
          //display "`varlist'"
       
          
          * Generate the row mean for the current set of variables
          egen ymean = rowmean(`varlist')
          
          * Define the constraint
          constraint define 1 ymean = 1
          
          * Run the constrained regression
          qui cnsreg `treated_unit' ymean if time < 45, constraints(1)
          
          qui predict cf if e(sample)
          
          * Calculate the mean of observed values
          qui summarize `treated_unit'
          local mean_observed = r(mean)
      
          * Calculate the Total Sum of Squares (TSS)
          qui generate double tss = (`treated_unit' - `mean_observed')^2
          qui summarize tss
          local TSS = r(sum)
      
          * Calculate the Residual Sum of Squares (RSS)
          qui generate double rss = (`treated_unit' - cf)^2
          qui summarize rss
          local RSS = r(sum)
      
          * Calculate and display R-squared
          loc r2 = 1 - (`RSS' / `TSS')
          
          * Check if the current R-squared is the highest
          if (`r2' > `best_r2') {
              local best_r2 = `r2'
              local best_model = "`varlist'"
          }
      }
      
      * Display the best model and its R-squared value
      di "Model with the highest R2 is with variables `best_model' having R2 = `best_r2'"
      
      keep time `treated_unit' `best_model'
      
      egen ymean = rowmean(`best_model')
      
      * Define the constraint
      constraint define 1 ymean = 1
      
      * Run the constrained regression
      qui cnsreg `treated_unit' ymean if time < 45, constraints(1)
      
      qui predict cf
      
      keep time `treated_unit' cf
      
      twoway (connected hongkong time, mcolor(black) msymbol(diamond) lcolor(black) lwidth(medthick) connect(direct)) (connected cf time, mcolor(red) msymbol(triangle) lcolor(red) lwidth(medium)), legend(order(1 "Hong Kong" 2 "FDID Hong Kong") ring(0))
      It basically selects the best DID model (using r2 as the determining factor)

      Comment


      • #4
        Out of curiosity, does this rather complex code produce results different from the code I posted in #2? It seems as if the whole frame and variable ordering in the datasets is unnecessary overhead.

        Comment


        • #5
          yeah this worked! The only issue that I see, for some reason, is the r-squared statistic. When I run the full example,

          Code:
          clear *
          
          import delim "https://raw.githubusercontent.com/jgreathouse9/GSUmetricspolicy/main/data/GDP.csv", clear
          
          //drop china
          cls
          
          * Creates the r-squared frame indexed to each individual unit
          mkf rsquare
          
          * We only need one row
          frame rsquare: set obs 1
          
          * Our time variable
          gen time = _n
          
          * Our treated unit
          local treated_unit hongkong
          
          * Get control units list
          local predictors australia austria canada denmark finland france germany italy japan korea mexico netherlands newzealand norway switzerland unitedkingdom unitedstates singapore philippines indonesia malaysia thailand taiwan china
          
          local treated_unit hongkong
          
          local U // void
          
          //local predictors australia newzealand austria canada denmark
          
          tempvar ym cfp
          
          tempname max_r2
          
          while ("`predictors'" != "") {
              
              scalar `max_r2' = 0
              
              foreach var of local predictors {
                  
                  quietly {
                      
                      egen `ym' = rowmean(`U' `var')
                      
                      constraint 1 `ym' = 1
                      
                      cnsreg `treated_unit' `ym' if _n < 45 , constraint(1)
                      
                      predict `cfp' if e(sample)
                      correlate `treated_unit' `cfp' if e(sample)
                      
                      if r(rho)^2 > scalar(`max_r2') {
                          
                          scalar `max_r2' = r(rho)^2
                          local new_U `var'
                          
                      }
                      
                      drop `ym'
                      drop `cfp'
                      
                  }
                  
              }
              
              local U `U' `new_U'
              
              local predictors : list predictors - new_U
          
          }
          
          display "`U'"
          I get "malaysia newzealand norway thailand singapore mexico japan china philippines australia korea indonesia unitedstates unitedkingdom canada switzerland taiwan denmark germany netherlands italy austria france finland", but that's not the right answer (in the sense that it does not match the order of the Python code). This has to do, apparently, with the calculation of the R squared statistic. When I calculate R squared a different way

          Code:
          clear *
          
          import delim "https://raw.githubusercontent.com/jgreathouse9/GSUmetricspolicy/main/data/GDP.csv", clear
          
          //drop china
          cls
          
          * Creates the r-squared frame indexed to each individual unit
          mkf rsquare
          
          * We only need one row
          frame rsquare: set obs 1
          
          * Our time variable
          gen time = _n
          
          * Our treated unit
          local treated_unit hongkong
          
          * Get control units list
          local predictors australia austria canada denmark finland france germany italy japan korea mexico netherlands newzealand norway switzerland unitedkingdom unitedstates singapore philippines indonesia malaysia thailand taiwan china
          
          local treated_unit hongkong
          
          local U // void
          
          //local predictors australia newzealand austria canada denmark
          
          tempvar ym cfp
          
          tempname max_r2
          
          while ("`predictors'" != "") {
              
              scalar `max_r2' = 0
              
              foreach var of local predictors {
                  cap drop rss
              cap drop tss
                  
                  quietly {
                      
                      egen `ym' = rowmean(`U' `var')
                      
                      constraint 1 `ym' = 1
                      
                      cnsreg `treated_unit' `ym' if _n < 45 , constraint(1)
                      
                      predict `cfp' if e(sample)
                  
                      * Calculate the mean of observed values
                  qui summarize `treated_unit'
                  local mean_observed = r(mean)
          
                  * Calculate the Total Sum of Squares (TSS)
                  qui generate double tss = (`treated_unit' - `mean_observed')^2
                  qui summarize tss
                  local TSS = r(sum)
          
                  * Calculate the Residual Sum of Squares (RSS)
                  qui generate double rss = (`treated_unit' - `cfp')^2
                  qui summarize rss
                  local RSS = r(sum)
          
                  * Calculate and display R-squared
                  loc r2 = 1 - (`RSS' / `TSS')
                  
                      
                      if `r2' > scalar(`max_r2') {
                          
                          scalar `max_r2' = `r2'
                          local new_U `var'
                          
                      }
                      
                      drop `ym'
                      drop `cfp'
                      
                  }
                  
              }
              
              local U `U' `new_U'
              
              local predictors : list predictors - new_U
          
          }
          
          display "`U'"
          I get the answer that matches the original matlab code and my Python code.

          The order matters in this instance because, per my rather messy code above, we will iteratively estimate DD over each of these donors in optimal order of assignment. That is to say, we begin with Phillipines, then we add Singapore, then Thailand... until the end. In this case, that's 24 models. We select the model of these 24 with the best R2, and use that as our final model. So, your code did what I attempted, but for some reason the R2 produces different values compared to this one. The correct answer, in this instance, is the first 9 optimal controls

          Comment


          • #6
            I did not want to imply that order is irrelevant. I just did (and do) not see the need to order within the loop(s). Likewise, the values of treated_unit never change so you should not recalculate TSS repeatedly within the loop; and you certainly do not want to do it within the inner loop. Also, you only need one call to summarize if you

            Code:
            summarize `treated_unit'
            scalar TSS = r(Var) * (r(N)-1)
            Note that I use a scalar not a local to maintain precision. I don't think that is necessary for what you are doing but I noticed that you used double precision for the variables. You do not do that with predict, though.

            One more thing: Do you really need the e(sample) restriction inside the loop? Shouldn't the regressions (and sums of squares) not all be based on the same observations?

            Comment


            • #7
              Likewise, the values of treated_unit never change so you should not recalculate TSS repeatedly within the loop; and you certainly do not want to do it within the inner loop... Note that I use a scalar not a local to maintain precision.
              Oh yeah that's pretty true. And yes, the scalar method does maintain precision, so I'll change this then.

              The e(sample) restriction simply means we are calculating for only the sample we just estimated with right? In this case, we are fitting the model in the pre-period only (or, t < 44). If I didn't use e(sample), that would be predicting the post-intervention counterfactual, right?

              Comment


              • #8
                Originally posted by Jared Greathouse View Post
                The e(sample) restriction simply means we are calculating for only the sample we just estimated with right? In this case, we are fitting the model in the pre-period only (or, t < 44). If I didn't use e(sample), that would be predicting the post-intervention counterfactual, right?
                I see. But then you should estimate the TSS and RSS on that sample, too, right? Also, the period is constant over all regressions so I would start with

                Code:
                mark sample if time < 45
                Then, to make sure that all regressions include the same non-missing observations, I would add

                Code:
                markout sample hongkong australia austria canada ...
                After all, you don't want to compare r-squared values for regressions based on different samples because of missing values, do you?

                With my variable sample (which could be a temporary variable if needed) in place, I would

                Code:
                preserve
                keep if sample
                ...
                and then start the algorithm on the reduced dataset, which should also result in slight speed gains compared to repeatedly evaluating the if expression. The cost is preserve, of course.



                EDIT: I just realized that the RSS is calculated for the sample only, because the predictions are restricted to e(sample), leading to out-of-sample missing values for the differences in rss that summarize will then ignore. But that's pretty implicit code. Also, TSS is still based on the whole (wrong) sample.
                Last edited by daniel klein; 11 Jul 2024, 17:36.

                Comment

                Working...
                X