When using a time series operation like "D.var" or "L3.var" in a regression, Stata reports the number of observation in the regression matrix, which will differ from the number of observations used by the lags and leads (example below). This approach is of course correct in the context of a regression as this yields appropriate degrees of freedom, etc. I really want to stress that I understand why Stata reports the N it reports and I would not want it any other way.
However, this also means that the e(sample) function does not flag every observation necessary for performing a time series regression. Running a regression with a time series operator and then restricting the sample to e(sample)==1, then rerunning the same regression will change results. And there are some cases in which one may be interested in the complete pool of observations necessary to perform a command. For instance, some data centres prohibit exporting results that include fewer than some minimum amount of observations per reported result for privacy reasons. Relying on the reported observations of a time series regression will show fewer observations than actually used. I assume that there are also other cases in which knowing the "full" sample used by a command is of interest, especially when restricting data based on e(sample).
What I am trying to figure out is whether there is a (built-in?) option of reporting "complete" sample sizes and sample markers (= e(sample) function). I provide an easy illustrative example below, but you can imagine that the task of flagging "missed" observations becomes quite complicated with large data sets and more involved models.
Consider this easy example:
The first regression sample size is 6. However, looking at the data frame, information from 3 additional observations is used (id 1, 2, 3 at time==0, see screenshot below). In the second regression, these observations are omitted, resulting in a different result.
However, this also means that the e(sample) function does not flag every observation necessary for performing a time series regression. Running a regression with a time series operator and then restricting the sample to e(sample)==1, then rerunning the same regression will change results. And there are some cases in which one may be interested in the complete pool of observations necessary to perform a command. For instance, some data centres prohibit exporting results that include fewer than some minimum amount of observations per reported result for privacy reasons. Relying on the reported observations of a time series regression will show fewer observations than actually used. I assume that there are also other cases in which knowing the "full" sample used by a command is of interest, especially when restricting data based on e(sample).
What I am trying to figure out is whether there is a (built-in?) option of reporting "complete" sample sizes and sample markers (= e(sample) function). I provide an easy illustrative example below, but you can imagine that the task of flagging "missed" observations becomes quite complicated with large data sets and more involved models.
Consider this easy example:
Code:
clear set seed 123456789 set obs 10 gen y = runiform() gen time = runiformint(0,2) bys time: gen id = _n gen first_treat = id - 1 gen treat = 0 replace treat = 1 if first_treat<=time xtset id time reg D.y D.treat gen sample = e(sample) keep if sample==1 reg D.y D.treat
Comment