95% Confidence/Prediction Intervals for CVLASSO

Jared Greathouse

Join Date: Sep 2021
Posts: 2170

95% Confidence/Prediction Intervals for CVLASSO

08 Jan 2022, 14:01

I'm writing a synthetic control estimator using LASSO (SCUL) based on the original R code (which we thankfully don't need to get into very much at all). The goal of course is to predict the counterfactual for a given policy. Here I use the pre-intervention outcomes of donor-pool units as predictors of the pre-intervention outcomes for the treated unit. I impute the counterfactual via prediction, and graph the observed versus potential outcome over time, much as we currently do in synthetic control research. The sub-routine of interest in my command is the est_lasso routine, which uses the user-written cvlasso command from the lassopack

I'd like to generate prediction/confidence intervals and graph these as part of the Differences_1 graph. The utility of this is that such intervals allow myself and others to see how uncertain the estimates themselves are. However, doing this would demand a standard error, which doesn't appear to be available in this context.

Does anyone have any immediate reaction as to how I might do this?

Let's look at my full program (don't worry, I install all user written commands I use and so on, so it should work the first time).

Code:

*****************************************************
version 14 // Stata SE
set more off
*****************************************************

* Programmer: Jared A. Greathouse

* Institution:    Georgia State University

* Contact:         [email protected]

* Created on : Jan 2, 2022

* Contents: 1. Purpose

*  2. Program Versions

*****************************************************

* 1. Purpose

/* */


* 2. Program

cap prog drop scul // Drops previous iterations of the program

*! SCUL v1.0.0, Jared Greathouse, 1/2/22
prog scul
    graph close _all
    graph drop _all
    
    * Installs needed commands
    loc package st0594 gr0034 dm0042_3
    
    foreach x of loc package { // begin foreach

        qui: cap which cvlasso
    
            if _rc { // if command is missing

            qui: net inst `x'.pkg, replace
        
            } // ends if
    } // ends foreach
    
    loc comm gtools labvars schemepack
    
    foreach x of loc comm { // begin foreach

        qui: cap which `x'
    
            if _rc { // if command is missing

            qui: ssc inst `x', replace
        
            } // ends if
    } // ends foreach
    
qui: xtset
gl time: disp "`r(timevar)'"

gl panel: disp "`r(panelvar)'"


    syntax anything, ///
        TReated(varname) /// We need a treatment variable as 0 1
        ahead(numlist min=1 max=1 >=1 int) // Number of forecasting periods. 
        
gettoken depvar anything: anything

local y_lab: variable label `depvar'

gl outlab: disp "`y_lab'"

        
/**********************************************************

    
    
    * Pre-Processing*


   
**********************************************************/
preserve

numcheck, unit($panel) time($time) depvar(`depvar') // Routine 1

treatprocess, treat(`treated') time($time) unit($panel) // Routine 2

data_org, unit($panel) time($time) depvar(`depvar') treat(`treated') // Routine 3



/**********************************************************

    
    
    * Estimation*


   
**********************************************************/


est_lasso, time($time) h(`ahead')


restore
end


/**********************************************************

    *Section 1: Data Setup
    
**********************************************************/

cap prog drop numcheck // Subroutine 1.1
prog numcheck
    
syntax, unit(varname) time(varname) depvar(varname)
    
        
/*#########################################################

    * Section 1.1: Extract panel vars

    Before SCM can be done, we need panel data.
    
    
    Along with the R package, I'm checking that
    our main vairables of interest, that is,
    our panel variables and outcomes are all:
    
    a) Numeric
    b) Non-missing and
    c) Non-Constant
    
*########################################################*/
cls
di as txt "{hline}"
di as txt "Algorithm: Synthetic LASSO"
di as txt "{hline}"
di as txt "First Step: Data Setup"
di as txt "{hline}"
di as txt "Checking that setup variables make sense."

tempvar obs_count
by `unit' (`time'), sort: gen `obs_count' = _N
qui su `obs_count'

qui drop if `obs_count' < `r(max)'


    foreach v of var `unit' `time' `depvar' {
    cap {    
        conf numeric v `v', ex
        
        as !mi(`v')
        
        qui: su `v'
        
        as r(sd) ~= 0
    }
    }
    if !_rc {
        
        
        disp as res "Setup successful. All variables `unit' (ID), `time' (Time) and `depvar' (Outcome) pass."
        di as txt ""
        disp as res "All are numeric, not missing and non-constant."
    }
    
    else if _rc {
        
        
        
        disp as err "All variables `unit' (ID), `time' (Time) and `depvar' must be numeric, not missing and non-constant."
        exit 498
    }

end
        
cap prog drop treatprocess // Subroutine 1.2
prog treatprocess
    
syntax, treat(varname) time(varname) unit(varname)

/*#########################################################

    * Section 1.2: Check Treatment Variable

    Before SCM can be done, we need the treatment to make sense.
    
    
    Making sense is simply the treatment being non-missing
    and being either 0 or 1
*########################################################*/

di as txt "{hline}"
di as txt "Checking that the treatment makes sense..."
di as txt "{hline}"

cap {
cap as !mi(`treat') & inlist(`treat',0,1)

qui: su `time' if `treat' ==1

gl int_date = r(min)
loc last_date = r(max)

su `unit' if `treat' ==1, mean

loc treat_id: disp `r(mean)'
}

    if !_rc {
        
        su $panel if `treat' ==1, mean
        loc clab: label ($panel) `r(mean)'
        gl treat_lab: disp "`clab'"
        
        
        qui: levelsof $panel if `treat' == 0 & $time > $int_date, l(labs)

        local lab : value label $panel

        foreach l of local labs {
            local all `all' `: label `lab' `l'',
        }

        loc controls: display "`all'"
        
        
        disp as res "The intervention variable `treat' passes. Continue."        
        di as txt ""
        disp as res "Intervention is measured between $int_date to `last_date'"
        
        qui: distinct `unit' if `treat' == 0
        
        loc dp_num = r(ndistinct) - 1
        
        as `dp_num' > 2
        di as res "{hline}"
        di as res "Treated Unit: $treat_lab"
        di as txt ""
        di as res "Control Units: `dp_num' total donor pool units"
        di as txt ""
        di as res "Specifically: `controls'"

    }



end

cap prog drop data_org // Subroutine 1.3
prog data_org
    
syntax, time(varname) depvar(varname) unit(varname) treat(varname)

/*#########################################################

    * Section 1.3: Reorganizing the Data into Matrix Form

    We need a wide dataset to do what we want.
*########################################################*/

di as txt "{hline}"
di as txt "Reshaping Data..."

su `unit' if `treat' ==1, mean

loc treat_id: disp `r(mean)'


keep `unit' `time' `depvar'

qui: greshape wide `depvar', j(`unit') i(`time')
di as txt "Done"
qui: tsset `time'

order `depvar'`treat_id', a(`time')

end


/**********************************************************

    *Section 2: Estimating Causal Effects
    
**********************************************************/

cap prog drop est_lasso // Subroutine 2.1
prog est_lasso
    
syntax, time(varname) h(numlist min=1 max=1 >=1 int)
di as txt "{hline}"
di as txt "Second Step: Estimation"
di as txt "{hline}"


qui: ds

loc temp: word 1 of `r(varlist)'

loc time: disp "`temp'"

loc t: word 2 of `r(varlist)'

loc treated_unit: disp "`t'"

loc a: word 3 of `r(varlist)'

loc donor_one: disp "`a'"

local nwords :  word count `r(varlist)'

loc b: word `nwords' of `r(varlist)'

loc last_donor: disp "`b'"

di as txt "Estimating... This could take a while..."

cvlasso `treated_unit' `donor_one'-`last_donor' if `time' < $int_date, lse lglmnet roll h(`h')

qui{
cap drop cf
predict double cf, lopt
}

qui: keep `time' `treated_unit' cf

g relative = `time'- $int_date

g diff = `treated_unit'- cf

sa "scul_$treat_lab", replace

qui: esize unpaired `treated_unit' == cf if relative < 0, cohensd

loc D: disp float(`r(d)')

qui: su diff if relative > 0, mean

loc ATT: disp %9.4g `r(mean)'


tw ///
    (line `treated_unit' `time', lcol(black)) /// Real Outcomes
    (line cf `time', lcol("237 41 57")), /// Potential Outcomes
        xli($int_date, lcol("152 152 152")) ///
        scheme(white_tableau) ///
        legend(order(1 "$treat_lab" 2 "SCUL $treat_lab") ///
        color(black) fcolor(white) region(fcolor(white)) position(6) rows(1)) ///
        yti("$outlab") ///
        ylab(, noticks nogrid) xlab(, noticks nogrid) name("Differences_1")
    
qui su relative

loc time_max = `r(max)'

qui su diff

loc effmin = `r(min)'

loc effmax = `r(max)'
    
twoway (line diff relative, lcol(black)), xli(-1, lpat(solid) lcol(black)) yli(0, lcol(red)) ///
    ylab(, noticks nogrid) ///
    xlab(, noticks nogrid) ///
    scheme(white_tableau) ///
    legend(off) ///
    yti("Treatment Effect") ///
    xti("Relative Event Time, $treat_lab") name("Differences_2") ///
    note("Cohen's D: `D'" "ATT: `ATT'", position(3)) 

macro drop treat_lab outlab int_date
end

with the code of interest here being the very last routine. Again, all I'd like to do here is simply generate uncertainty estimates for the counterfactual.

The code which runs the estimator itself for Proposition 99's dataset (publicly available from GitHub) is

Code:

import delim "https://raw.githubusercontent.com/synth-inference/synthdid/master/data/california_prop99.csv", clear

egen id = group(state) // makes a unique ID


cap which labmask
if _rc {
    
    qui net inst gr0034.pkg, replace
}
labmask id, values(state)

xtset id year, y // We now have yearly panel data

lab var packspercapita "Packs Sold per 100k"
cls

scul packspercapita, tr(treated) ahead(5)

I presume there's a way to do this. Any idea how I'd begin?

Tags: lassopack, panel data, regression, Time Series, treatment-effect

Announcement

95% Confidence/Prediction Intervals for CVLASSO