Estpost summarize missing and whole sample

Tharcisio Leone

Join Date: Sep 2019

Posts: 37
#1

Estpost summarize missing and whole sample

05 Jan 2022, 10:48

Dear Statalisters,

I suspect that the missing in my database is NOT completely at random. For this reason, I will check this assumption later with the -mcartest.

But before that, I will create a table reporting the means for the whole sample and for the final estimated sample. (You can experiment yourself in Stata as follow).

PHP Code:

webuse nhanes2 misstable summarize bpdiast lead fhtatk loglead highlead qui regress bpdiast lead eststo whole qui regress bpdiast lead fhtatk loglead highlead eststo subsample esttab whole subsample // Note that N decreased from 4,948 to 2,569 -------------------------------------------- (1) (2) bpdiast bpdiast -------------------------------------------- lead 0.251*** -0.194 (8.45) (-0.83) fhtatk 2.648 (1.69) loglead 6.488* (2.43) highlead 2.208 (0.78) _cons 78.11*** 66.75*** (169.00) (17.45) -------------------------------------------- N 4948 2569 -------------------------------------------- t statistics in parentheses * p<0.05, ** p<0.01, *** p<0.001 generate missing=0 replace missing = 1 if lead==. | fhtatk==. | loglead==. | highlead==. eststo nomissing: quietly estpost summarize lead fhtatk loglead highlead if missing == 0 eststo missing: quietly estpost summarize lead fhtatk loglead highlead if missing == 1 eststo diff: quietly estpost ttest lead fhtatk loglead highlead, by(missing) unequal esttab nomissing missing diff, cells("mean(pattern(1 1 0) fmt(3)) sd(pattern(1 1 0)) b(star pattern(0 0 1) fmt(3)) t(pattern(0 0 1) par fmt(3))") nolabel --------------------------------------------------------------------------------------------- (1) (2) (3) mean sd mean sd b t --------------------------------------------------------------------------------------------- lead 11.968 4.651 16.860 6.581 -4.892*** (-29.982) fhtatk 0.028 0.164 0.030 0.172 -0.003 (-0.599) loglead 2.411 0.381 2.758 0.365 -0.347*** (-32.689) highlead 0.014 0.119 0.108 0.310 -0.093*** (-13.756) --------------------------------------------------------------------------------------------- N 2569 5244 7813 ---------------------------------------------------------------------------------------------

Note that the table above is comparing two groups (missing = 0 vs missing = 1). But this is not exactly what I am looking for.
I would like to create a table where the first group is the whole sample, this means the first estimation with N = 4,948. While the second group remains the same (second estimation with N = 2,569). The first rows could be similar to:

PHP Code:

eststo nomissing: quietly estpost summarize lead fhtatk loglead highlead eststo missing: quietly estpost summarize lead fhtatk loglead highlead if missing == 1 eststo diff: quietly estpost ttest ????

Does anyone have any idea of how can I change the "eststo diff" to achieve this goal?
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 9950

05 Jan 2022, 13:07

estout is from the Stata Journal/ SSC (FAQ Advice #12). The entire sample and the subsample are not disjoint, and thus a comparison of the two makes no sense. However, I suppose that you want to compare the sample with missing values and excluding nonmissing values to that consisting of nonmissing values as these are disjoint. Note that in your example, the variable "fhtatk" is what selects the samples, so you exclude it from the test.

Code:

webuse nhanes2, clear
qui regress bpdiast lead
gen whole=e(sample)
qui regress bpdiast lead fhtatk loglead highlead
gen subsample= e(sample)
gen group= whole & !subsample if whole
generate nomiss= !missing(lead)
eststo nomissing: estpost summarize lead fhtatk loglead highlead if whole
eststo missing: estpost summarize lead fhtatk loglead highlead if subsample
eststo diff: estpost ttest lead loglead highlead, by(group) unequal
esttab nomissing missing diff, cells("mean(pattern(1 1 0) fmt(3)) sd(pattern(1 1 0)) b(star pattern(0 0 1) fmt(3)) t(pattern(0 0 1) par fmt(3))") nolabel

Res.:

Code:

. esttab nomissing missing diff, cells("mean(pattern(1 1 0) fmt(3)) sd(pattern(1 1 0)) b(star pattern(0 0 1) fmt(3)
> ) t(pattern(0 0 1) par fmt(3))") nolabel

---------------------------------------------------------------------------------------------
                      (1)                       (2)                       (3)                
                                                                                             
                     mean           sd         mean           sd            b               t
---------------------------------------------------------------------------------------------
lead               14.320        6.166       11.968        4.651       -4.892***    (-29.982)
fhtatk              0.028        0.164        0.028        0.164                             
loglead             2.578        0.412        2.411        0.381       -0.347***    (-32.689)
highlead            0.059        0.236        0.014        0.119       -0.093***    (-13.756)
---------------------------------------------------------------------------------------------
N                    4948                      2569                      4948                
---------------------------------------------------------------------------------------------

Comment

Tharcisio Leone

Join Date: Sep 2019

Posts: 37
#3

05 Jan 2022, 13:53

Thank you for your support.
But the "eststo diff" is not working correctly. See the coefficient b in your table.

lead: 14.320 - 11.968 = 2,352 and NOT -4.892
loglead: 2.578 - 2.411 = 0,167 and NOT -0.347
highlead: 0.059 - 0.014 = 0,045 and NOT -0.093
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9950

05 Jan 2022, 15:29

The variable "group" defines the disjoint groups, so you just change this in the code.

Code:

webuse nhanes2, clear
qui regress bpdiast lead
gen whole=e(sample)
qui regress bpdiast lead fhtatk loglead highlead
gen subsample= e(sample)
gen group= whole & !subsample if whole
generate nomiss= !missing(lead)
eststo nomissing: estpost summarize lead fhtatk loglead highlead if !group
eststo missing: estpost summarize lead loglead highlead if group
eststo diff: estpost ttest lead loglead highlead, by(group) unequal
esttab nomissing missing diff, cells("mean(pattern(1 1 0) fmt(3)) sd(pattern(1 1 0)) b(star pattern(0 0 1) fmt(3)) t(pattern(0 0 1) par fmt(3))") nolabe

Res.:

Code:

. esttab nomissing missing diff, cells("mean(pattern(1 1 0) fmt(3)) sd(pattern(1 1 0)) b(star pattern(0 0 1) fmt(3)
> ) t(pattern(0 0 1) par fmt(3))") nolabe

---------------------------------------------------------------------------------------------
                      (1)                       (2)                       (3)                
                                                                                             
                     mean           sd         mean           sd            b               t
---------------------------------------------------------------------------------------------
lead               11.968        4.651       16.860        6.581       -4.892***    (-29.982)
fhtatk              0.028        0.164                                                       
loglead             2.411        0.381        2.758        0.365       -0.347***    (-32.689)
highlead            0.014        0.119        0.108        0.310       -0.093***    (-13.756)
---------------------------------------------------------------------------------------------
N                    2569                      2379                      4948                
---------------------------------------------------------------------------------------------

Comment

Tharcisio Leone

Join Date: Sep 2019

Posts: 37
#5

05 Jan 2022, 16:29

Not really. Note that the coefficients b are wrongly calculated when we include additional variables.

PHP Code:

webuse nhanes2, clear qui regress bpdiast lead gen whole=e(sample) qui regress bpdiast lead fhtatk loglead highlead sex race age height weight bpsystol heartatk diabetes sizplace finalwgt leadwt tcresult gen subsample= e(sample) gen group= whole & !subsample if whole generate nomiss= !missing(lead) eststo nomissing: estpost summarize lead fhtatk loglead highlead sex race age height weight bpsystol heartatk diabetes sizplace finalwgt leadwt tcresult if !group eststo missing: estpost summarize lead fhtatk loglead highlead sex race age height weight bpsystol heartatk diabetes sizplace finalwgt leadwt tcresult if group eststo diff: estpost ttest lead loglead highlead sex race age height weight bpsystol heartatk diabetes sizplace finalwgt leadwt tcresult, by(group) unequal esttab nomissing missing diff, cells("mean(pattern(1 1 0) fmt(3)) b(star pattern(0 0 1) fmt(3)) t(pattern(0 0 1) par fmt(3))") nolabe ------------------------------------------------------------------- (1) (2) (3) mean mean b t ------------------------------------------------------------------- lead 11.968 16.860 -4.892*** (-29.982) fhtatk 0.028 0.030 loglead 2.411 2.758 -0.347*** (-32.689) highlead 0.014 0.108 -0.093*** (-13.756) sex 2.000 1.368 0.999*** (1681.146) race 1.142 1.144 -0.005 (-0.472) age 47.568 47.584 0.548 (1.124) height 161.388 169.718 -13.468*** (-67.742) weight 66.383 73.718 -11.738*** (-28.963) bpsystol 128.865 131.547 -3.869*** (-5.905) heartatk 0.028 0.052 -0.029*** (-5.014) diabetes 0.051 0.047 0.008 (1.282) sizplace 5.156 5.169 -0.112 (-1.490) finalwgt 11354.726 11306.505 -93.414 (-0.456) leadwt 22944.276 7434.492 257.782 (0.656) tcresult 220.611 216.699 7.031*** (5.055) ------------------------------------------------------------------- N 2569 7782 4948 -------------------------------------------------------------------

Only some examples:
sex: 2.000 - 1.368 = 0.632 and NOT 0.999
finalwgt: 11354.726 - 11306.505 = 48.221 and NOT -93.414
tcresult: 220.611 - 216.699 = 3.912 and NOT 7.031
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9950
#6

05 Jan 2022, 18:47

Re-read my comment in #4. In that simple example, we cannot include the variable "fhtatk" as its missing values are what select our sample. With more variables, there may be overlapping patterns of missingness, but the important point to keep in mind is that a t-test requires disjoint groups. So the comparison is between jointly missing and jointly nonmissing, excluding variables where these two groups are the same.
Comment

Announcement