Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Estpost summarize missing and whole sample

    Dear Statalisters,

    I suspect that the missing in my database is NOT completely at random. For this reason, I will check this assumption later with the -mcartest.

    But before that, I will create a table reporting the means for the whole sample and for the final estimated sample. (You can experiment yourself in Stata as follow).
    PHP Code:
    webuse nhanes2
    misstable summarize bpdiast lead fhtatk loglead highlead

    qui regress bpdiast lead
    eststo whole
    qui regress bpdiast lead fhtatk loglead highlead
    eststo subsample
    esttab whole subsample 
    // Note that N decreased from 4,948 to 2,569

    --------------------------------------------
                          (
    1)             (2)   
                      
    bpdiast         bpdiast   
    --------------------------------------------
    lead                0.251***       -0.194   
                       
    (8.45)         (-0.83)   

    fhtatk                              2.648   
                                       
    (1.69)   

    loglead                             6.488*  
                                       (
    2.43)   

    highlead                            2.208   
                                       
    (0.78)   

    _cons               78.11***        66.75***
                     (
    169.00)         (17.45)   
    --------------------------------------------
    N                    4948            2569   
    --------------------------------------------
    t statistics in parentheses
    p<0.05, ** p<0.01, *** p<0.001



    generate missing
    =
    replace missing 
    if lead==. | fhtatk==. | loglead==. | highlead==.


    eststo nomissingquietly estpost summarize lead fhtatk loglead highlead if missing == 0
    eststo missing
    quietly estpost summarize lead fhtatk loglead highlead if missing == 1
    eststo diff
    quietly estpost ttest lead fhtatk loglead highleadby(missingunequal
    esttab nomissing missing diff
    cells("mean(pattern(1 1 0) fmt(3)) sd(pattern(1 1 0)) b(star pattern(0 0 1) fmt(3)) t(pattern(0 0 1) par fmt(3))"nolabel

    ---------------------------------------------------------------------------------------------
                          (
    1)                       (2)                       (3)                
                                                                                                 
                         
    mean           sd         mean           sd            b               t
    ---------------------------------------------------------------------------------------------
    lead               11.968        4.651       16.860        6.581       -4.892***    (-29.982)
    fhtatk              0.028        0.164        0.030        0.172       -0.003        (-0.599)
    loglead             2.411        0.381        2.758        0.365       -0.347***    (-32.689)
    highlead            0.014        0.119        0.108        0.310       -0.093***    (-13.756)
    ---------------------------------------------------------------------------------------------
    N                    2569                      5244                      7813                
    --------------------------------------------------------------------------------------------- 

    Note that the table above is comparing two groups (missing = 0 vs missing = 1). But this is not exactly what I am looking for.
    I would like to create a table where the first group is the whole sample, this means the first estimation with N = 4,948. While the second group remains the same (second estimation with N = 2,569). The first rows could be similar to:

    PHP Code:
    eststo nomissingquietly estpost summarize lead fhtatk loglead highlead
    eststo missing
    quietly estpost summarize lead fhtatk loglead highlead if missing == 1
    eststo diff
    quietly estpost ttest ???? 

    Does anyone have any idea of how can I change the "eststo diff" to achieve this goal?

  • #2
    estout is from the Stata Journal/ SSC (FAQ Advice #12). The entire sample and the subsample are not disjoint, and thus a comparison of the two makes no sense. However, I suppose that you want to compare the sample with missing values and excluding nonmissing values to that consisting of nonmissing values as these are disjoint. Note that in your example, the variable "fhtatk" is what selects the samples, so you exclude it from the test.

    Code:
    webuse nhanes2, clear
    qui regress bpdiast lead
    gen whole=e(sample)
    qui regress bpdiast lead fhtatk loglead highlead
    gen subsample= e(sample)
    gen group= whole & !subsample if whole
    generate nomiss= !missing(lead)
    eststo nomissing: estpost summarize lead fhtatk loglead highlead if whole
    eststo missing: estpost summarize lead fhtatk loglead highlead if subsample
    eststo diff: estpost ttest lead loglead highlead, by(group) unequal
    esttab nomissing missing diff, cells("mean(pattern(1 1 0) fmt(3)) sd(pattern(1 1 0)) b(star pattern(0 0 1) fmt(3)) t(pattern(0 0 1) par fmt(3))") nolabel
    Res.:

    Code:
    . esttab nomissing missing diff, cells("mean(pattern(1 1 0) fmt(3)) sd(pattern(1 1 0)) b(star pattern(0 0 1) fmt(3)
    > ) t(pattern(0 0 1) par fmt(3))") nolabel
    
    ---------------------------------------------------------------------------------------------
                          (1)                       (2)                       (3)                
                                                                                                 
                         mean           sd         mean           sd            b               t
    ---------------------------------------------------------------------------------------------
    lead               14.320        6.166       11.968        4.651       -4.892***    (-29.982)
    fhtatk              0.028        0.164        0.028        0.164                             
    loglead             2.578        0.412        2.411        0.381       -0.347***    (-32.689)
    highlead            0.059        0.236        0.014        0.119       -0.093***    (-13.756)
    ---------------------------------------------------------------------------------------------
    N                    4948                      2569                      4948                
    ---------------------------------------------------------------------------------------------

    Comment


    • #3
      Thank you for your support.
      But the "eststo diff" is not working correctly. See the coefficient b in your table.

      lead: 14.320 - 11.968 = 2,352 and NOT -4.892
      loglead: 2.578 - 2.411 = 0,167 and NOT -0.347
      highlead: 0.059 - 0.014 = 0,045 and NOT -0.093

      Comment


      • #4
        The variable "group" defines the disjoint groups, so you just change this in the code.

        Code:
        webuse nhanes2, clear
        qui regress bpdiast lead
        gen whole=e(sample)
        qui regress bpdiast lead fhtatk loglead highlead
        gen subsample= e(sample)
        gen group= whole & !subsample if whole
        generate nomiss= !missing(lead)
        eststo nomissing: estpost summarize lead fhtatk loglead highlead if !group
        eststo missing: estpost summarize lead loglead highlead if group
        eststo diff: estpost ttest lead loglead highlead, by(group) unequal
        esttab nomissing missing diff, cells("mean(pattern(1 1 0) fmt(3)) sd(pattern(1 1 0)) b(star pattern(0 0 1) fmt(3)) t(pattern(0 0 1) par fmt(3))") nolabe
        Res.:

        Code:
        . esttab nomissing missing diff, cells("mean(pattern(1 1 0) fmt(3)) sd(pattern(1 1 0)) b(star pattern(0 0 1) fmt(3)
        > ) t(pattern(0 0 1) par fmt(3))") nolabe
        
        ---------------------------------------------------------------------------------------------
                              (1)                       (2)                       (3)                
                                                                                                     
                             mean           sd         mean           sd            b               t
        ---------------------------------------------------------------------------------------------
        lead               11.968        4.651       16.860        6.581       -4.892***    (-29.982)
        fhtatk              0.028        0.164                                                       
        loglead             2.411        0.381        2.758        0.365       -0.347***    (-32.689)
        highlead            0.014        0.119        0.108        0.310       -0.093***    (-13.756)
        ---------------------------------------------------------------------------------------------
        N                    2569                      2379                      4948                
        ---------------------------------------------------------------------------------------------

        Comment


        • #5
          Not really. Note that the coefficients b are wrongly calculated when we include additional variables.
          PHP Code:
          webuse nhanes2clear
          qui regress bpdiast lead
          gen whole
          =e(sample)
          qui regress bpdiast lead fhtatk loglead highlead sex race age height weight bpsystol heartatk diabetes sizplace finalwgt leadwt tcresult
          gen subsample
          e(sample)
          gen groupwhole & !subsample if whole
          generate nomiss
          = !missing(lead)

          eststo nomissingestpost summarize lead fhtatk loglead highlead sex race age height weight bpsystol heartatk diabetes sizplace finalwgt leadwt tcresult if !group
          eststo missing
          estpost summarize lead fhtatk loglead highlead sex race age height weight bpsystol heartatk diabetes sizplace finalwgt leadwt tcresult if group
          eststo diff
          estpost ttest lead loglead highlead sex race age height weight bpsystol heartatk diabetes sizplace finalwgt leadwt tcresultby(groupunequal
          esttab nomissing missing diff
          cells("mean(pattern(1 1 0) fmt(3)) b(star pattern(0 0 1) fmt(3)) t(pattern(0 0 1) par fmt(3))"nolabe
          -------------------------------------------------------------------
                                (
          1)          (2)          (3)                
                                                                             
                               
          mean         mean            b               t
          -------------------------------------------------------------------
          lead               11.968       16.860       -4.892***    (-29.982)
          fhtatk              0.028        0.030                             
          loglead             2.411        2.758       
          -0.347***    (-32.689)
          highlead            0.014        0.108       -0.093***    (-13.756)
          sex                 2.000        1.368        0.999***   (1681.146)
          race                1.142        1.144       -0.005        (-0.472)
          age                47.568       47.584        0.548         (1.124)
          height            161.388      169.718      -13.468***    (-67.742)
          weight             66.383       73.718      -11.738***    (-28.963)
          bpsystol          128.865      131.547       -3.869***     (-5.905)
          heartatk            0.028        0.052       -0.029***     (-5.014)
          diabetes            0.051        0.047        0.008         (1.282)
          sizplace            5.156        5.169       -0.112        (-1.490)
          finalwgt        11354.726    11306.505      -93.414        (-0.456)
          leadwt          22944.276     7434.492      257.782         (0.656)
          tcresult          220.611      216.699        7.031***      (5.055)
          -------------------------------------------------------------------
          N                    2569         7782         4948                
          ------------------------------------------------------------------- 
          Only some examples:
          sex: 2.000 - 1.368 = 0.632 and NOT 0.999
          finalwgt: 11354.726 - 11306.505 = 48.221 and NOT -93.414
          tcresult: 220.611 - 216.699 = 3.912 and NOT 7.031

          Comment


          • #6
            Re-read my comment in #4. In that simple example, we cannot include the variable "fhtatk" as its missing values are what select our sample. With more variables, there may be overlapping patterns of missingness, but the important point to keep in mind is that a t-test requires disjoint groups. So the comparison is between jointly missing and jointly nonmissing, excluding variables where these two groups are the same.

            Comment

            Working...
            X