"complete" sample sizes and e(sample) with lags and leads

Alexander Busch

Join Date: Oct 2022

Posts: 17
#1

"complete" sample sizes and e(sample) with lags and leads

18 Jun 2024, 06:05

When using a time series operation like "D.var" or "L3.var" in a regression, Stata reports the number of observation in the regression matrix, which will differ from the number of observations used by the lags and leads (example below). This approach is of course correct in the context of a regression as this yields appropriate degrees of freedom, etc. I really want to stress that I understand why Stata reports the N it reports and I would not want it any other way.

However, this also means that the e(sample) function does not flag every observation necessary for performing a time series regression. Running a regression with a time series operator and then restricting the sample to e(sample)==1, then rerunning the same regression will change results. And there are some cases in which one may be interested in the complete pool of observations necessary to perform a command. For instance, some data centres prohibit exporting results that include fewer than some minimum amount of observations per reported result for privacy reasons. Relying on the reported observations of a time series regression will show fewer observations than actually used. I assume that there are also other cases in which knowing the "full" sample used by a command is of interest, especially when restricting data based on e(sample).

What I am trying to figure out is whether there is a (built-in?) option of reporting "complete" sample sizes and sample markers (= e(sample) function). I provide an easy illustrative example below, but you can imagine that the task of flagging "missed" observations becomes quite complicated with large data sets and more involved models.

Consider this easy example:

Code:

clear set seed 123456789 set obs 10 gen y = runiform() gen time = runiformint(0,2) bys time: gen id = _n gen first_treat = id - 1 gen treat = 0 replace treat = 1 if first_treat<=time xtset id time reg D.y D.treat gen sample = e(sample) keep if sample==1 reg D.y D.treat

The first regression sample size is 6. However, looking at the data frame, information from 3 additional observations is used (id 1, 2, 3 at time==0, see screenshot below). In the second regression, these observations are omitted, resulting in a different result.
Tags: lags/leads, panel data, regression, sample size, Time Series

Andrew Musau

Join Date: Oct 2014
Posts: 9939

18 Jun 2024, 06:40

Unless you are dealing with an estimator that excludes singletons in panel data, the estimation sample is always the set of observations with nonmissing values. You could use a regex of the form below to quietly run a regression excluding time-series operators that will select the wanted estimation sample.

Code:

clear
set seed 123456789
set obs 20
gen y = runiform()
gen time = runiformint(0,2)
bys time: gen id = _n
gen first_treat = id - 1
gen treat = 0
replace treat = 1 if first_treat<=time
replace treat=. in 15/18

xtset id time

reg D.y D.treat
di ustrregexra(lower("`e(cmdline)'"), "\bd\.\b|\bf\.\b|\bs\.\b|\bl\.\n", "")
quietly `=ustrregexra(lower("`e(cmdline)'"), "\bd\.\b|\bf\.\b|\bs\.\b|\bl\.\n", "")'
gen sample= e(sample)

keep if sample
reg D.y D.treat

Res.:

Code:

. reg D.y D.treat

      Source |       SS           df       MS      Number of obs   =         9
-------------+----------------------------------   F(1, 7)         =      5.97
       Model |  .520004991         1  .520004991   Prob > F        =    0.0445
    Residual |  .609597802         7    .0870854   R-squared       =    0.4603
-------------+----------------------------------   Adj R-squared   =    0.3832
       Total |  1.12960279         8  .141200349   Root MSE        =     .2951

------------------------------------------------------------------------------
         D.y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       treat |
         D1. |  -.7648566   .3130033    -2.44   0.045    -1.504992   -.0247214
             |
       _cons |   -.056797   .1043344    -0.54   0.603    -.3035088    .1899147
------------------------------------------------------------------------------

.
. di ustrregexra(lower("`e(cmdline)'"), "\bd\.\b|\bf\.\b|\bs\.\b|\bl\.\n", "")
regress y treat

.
. quietly `=ustrregexra(lower("`e(cmdline)'"), "\bd\.\b|\bf\.\b|\bs\.\b|\bl\.\n", "")'

.
. gen sample= e(sample)

.
.
.
. keep if sample
(4 observations deleted)

.
. reg D.y D.treat

      Source |       SS           df       MS      Number of obs   =         9
-------------+----------------------------------   F(1, 7)         =      5.97
       Model |  .520004991         1  .520004991   Prob > F        =    0.0445
    Residual |  .609597802         7    .0870854   R-squared       =    0.4603
-------------+----------------------------------   Adj R-squared   =    0.3832
       Total |  1.12960279         8  .141200349   Root MSE        =     .2951

------------------------------------------------------------------------------
         D.y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       treat |
         D1. |  -.7648566   .3130033    -2.44   0.045    -1.504992   -.0247214
             |
       _cons |   -.056797   .1043344    -0.54   0.603    -.3035088    .1899147
------------------------------------------------------------------------------

Comment

daniel klein

Join Date: Mar 2014

Posts: 3798
#3

18 Jun 2024, 06:43

I don't think Stata provides a marker for the observations used to create the leads and lags. You can probably get that manually via something like:

Code:

mark full_sample tsrevar varlist_with_ts_operators , list markout full_sample `r(varlist)'
Comment

Announcement

"complete" sample sizes and e(sample) with lags and leads

Comment

Comment