Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • "complete" sample sizes and e(sample) with lags and leads

    When using a time series operation like "D.var" or "L3.var" in a regression, Stata reports the number of observation in the regression matrix, which will differ from the number of observations used by the lags and leads (example below). This approach is of course correct in the context of a regression as this yields appropriate degrees of freedom, etc. I really want to stress that I understand why Stata reports the N it reports and I would not want it any other way.

    However, this also means that the e(sample) function does not flag every observation necessary for performing a time series regression. Running a regression with a time series operator and then restricting the sample to e(sample)==1, then rerunning the same regression will change results. And there are some cases in which one may be interested in the complete pool of observations necessary to perform a command. For instance, some data centres prohibit exporting results that include fewer than some minimum amount of observations per reported result for privacy reasons. Relying on the reported observations of a time series regression will show fewer observations than actually used. I assume that there are also other cases in which knowing the "full" sample used by a command is of interest, especially when restricting data based on e(sample).

    What I am trying to figure out is whether there is a (built-in?) option of reporting "complete" sample sizes and sample markers (= e(sample) function). I provide an easy illustrative example below, but you can imagine that the task of flagging "missed" observations becomes quite complicated with large data sets and more involved models.

    Consider this easy example:

    Code:
    clear
    set seed 123456789
    set obs 10
    gen y = runiform()
    gen time = runiformint(0,2)
    bys time: gen id = _n
    gen first_treat = id - 1
    gen treat = 0
    replace treat = 1 if first_treat<=time
    
    xtset id time
    
    reg D.y D.treat
    gen sample = e(sample)
    
    keep if sample==1
    reg D.y D.treat
    The first regression sample size is 6. However, looking at the data frame, information from 3 additional observations is used (id 1, 2, 3 at time==0, see screenshot below). In the second regression, these observations are omitted, resulting in a different result.


    Click image for larger version

Name:	example.PNG
Views:	1
Size:	10.7 KB
ID:	1756481








  • #2
    Unless you are dealing with an estimator that excludes singletons in panel data, the estimation sample is always the set of observations with nonmissing values. You could use a regex of the form below to quietly run a regression excluding time-series operators that will select the wanted estimation sample.

    Code:
    clear
    set seed 123456789
    set obs 20
    gen y = runiform()
    gen time = runiformint(0,2)
    bys time: gen id = _n
    gen first_treat = id - 1
    gen treat = 0
    replace treat = 1 if first_treat<=time
    replace treat=. in 15/18
    
    xtset id time
    
    reg D.y D.treat
    di ustrregexra(lower("`e(cmdline)'"), "\bd\.\b|\bf\.\b|\bs\.\b|\bl\.\n", "")
    quietly `=ustrregexra(lower("`e(cmdline)'"), "\bd\.\b|\bf\.\b|\bs\.\b|\bl\.\n", "")'
    gen sample= e(sample)
    
    keep if sample
    reg D.y D.treat
    Res.:

    Code:
    . reg D.y D.treat
    
          Source |       SS           df       MS      Number of obs   =         9
    -------------+----------------------------------   F(1, 7)         =      5.97
           Model |  .520004991         1  .520004991   Prob > F        =    0.0445
        Residual |  .609597802         7    .0870854   R-squared       =    0.4603
    -------------+----------------------------------   Adj R-squared   =    0.3832
           Total |  1.12960279         8  .141200349   Root MSE        =     .2951
    
    ------------------------------------------------------------------------------
             D.y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
           treat |
             D1. |  -.7648566   .3130033    -2.44   0.045    -1.504992   -.0247214
                 |
           _cons |   -.056797   .1043344    -0.54   0.603    -.3035088    .1899147
    ------------------------------------------------------------------------------
    
    .
    . di ustrregexra(lower("`e(cmdline)'"), "\bd\.\b|\bf\.\b|\bs\.\b|\bl\.\n", "")
    regress y treat
    
    .
    . quietly `=ustrregexra(lower("`e(cmdline)'"), "\bd\.\b|\bf\.\b|\bs\.\b|\bl\.\n", "")'
    
    .
    . gen sample= e(sample)
    
    .
    .
    .
    . keep if sample
    (4 observations deleted)
    
    .
    . reg D.y D.treat
    
          Source |       SS           df       MS      Number of obs   =         9
    -------------+----------------------------------   F(1, 7)         =      5.97
           Model |  .520004991         1  .520004991   Prob > F        =    0.0445
        Residual |  .609597802         7    .0870854   R-squared       =    0.4603
    -------------+----------------------------------   Adj R-squared   =    0.3832
           Total |  1.12960279         8  .141200349   Root MSE        =     .2951
    
    ------------------------------------------------------------------------------
             D.y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
           treat |
             D1. |  -.7648566   .3130033    -2.44   0.045    -1.504992   -.0247214
                 |
           _cons |   -.056797   .1043344    -0.54   0.603    -.3035088    .1899147
    ------------------------------------------------------------------------------

    Comment


    • #3
      I don't think Stata provides a marker for the observations used to create the leads and lags. You can probably get that manually via something like:

      Code:
      mark full_sample
      tsrevar varlist_with_ts_operators , list
      markout full_sample `r(varlist)'
      ​​​​​​​

      Comment

      Working...
      X