Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why does "Number of obs" in svy: tab differ by how subpop() is set

    I was using Stata 13.1, and try to work with a survey data with item non-responses, and but I had a hard time to figure out how I should set the subpop().
    Especially because the "Number of obs =" and "Population size" changes in an unexpected way.

    To demonstrate my problem, I used a modified the nhanes2d on the web and ran the same program. Here I modified the original data by
    - limiting the data to the first 5000 records
    - introducing missing data
    - pretend like a stratified random sampling by overriding svyset using only strata and weights.
    and ran "svy tab" in three ways.

    #1 svy: tab heartatk, ci format(%9.0g) , if female==1
    #2 svy, subpop(female): tab heartatk, ci format(%9.0g)
    #3 svy, subpop(if female==1 & heartatk!=.): tab heartatk, ci format(%9.0g)

    They gave back all different
    -Number of obs (on the right top);
    -Population size; and
    -Confidence Intervals (lb and ub)

    First difference appeared natural, but was actually not, when I looked into the "number of obs" in #2. I started wondering where the "number of obs" of 3301 come from?
    It is not the number of persons who has non-missing values in "female," which is 4001.

    I'd appreciate any comments.

    Here is the results from Stata.

    Code:
    . use http://www.stata-press.com/data/r13/nhanes2d, clear
    
    . keep if _n<=5000        //  limit the sample
    (5351 observations deleted)
    
    . replace heartatk=. if _n<1000   // create missing to the analyzed variable
    (999 real changes made, 999 to missing)
    
    . replace female=. if _n>700 & _n<1700    // create missing to to the subpop() variable
    (999 real changes made, 999 to missing)
    
    . ta female heartatk, mis  // show the missing patterns
    
     1=female, |    heart attack, 1=yes, 0=no
        0=male |         0          1          . |     Total
    -----------+---------------------------------+----------
             0 |     1,437        115        329 |     1,881
             1 |     1,700         49        371 |     2,120
             . |       657         43        299 |       999
    -----------+---------------------------------+----------
         Total |     3,794        207        999 |     5,000
    
    
    . svyset  [pweight=finalwgt],strata(strata)  // pretend simple random sampling
    
          pweight: finalwgt
              VCE: linearized
      Single unit: missing
         Strata 1: strata
             SU 1: <observations>
            FPC 1: <zero>
    
    . ** Now run svy: tab in three ways
    . svy: tab heartatk, ci format(%9.0g) , if female==1
    (running tabulate on estimation sample)
    
    Number of strata   =        11                  Number of obs      =      1749
    Number of PSUs     =      1749                  Population size    =  19445224
                                                    Design df          =      1738
    
    -------------------------------------------------
    heart     |
    attack,   |
    1=yes,    |
    0=no      | proportions           lb           ub
    ----------+--------------------------------------
            0 |    .9788396     .9704661     .9848761
            1 |    .0211604     .0151239     .0295339
              |
        Total |           1                          
    -------------------------------------------------
      Key:  proportions  =  cell proportions
            lb           =  lower 95% confidence bounds for cell proportions
            ub           =  upper 95% confidence bounds for cell proportions
    
    . svy, subpop(female): tab heartatk, ci format(%9.0g)
    (running tabulate on estimation sample)
    
    Number of strata   =        11                  Number of obs      =      3301
    Number of PSUs     =      3301                  Population size    =  36690922
                                                    Subpop. no. of obs =      1749
                                                    Subpop. size       =  19445224
                                                    Design df          =      3290
    
    -------------------------------------------------
    heart     |
    attack,   |
    1=yes,    |
    0=no      | proportions           lb           ub
    ----------+--------------------------------------
            0 |    .9788396     .9704643      .984877
            1 |    .0211604      .015123     .0295357
              |
        Total |           1                          
    -------------------------------------------------
      Key:  proportions  =  cell proportions
            lb           =  lower 95% confidence bounds for cell proportions
            ub           =  upper 95% confidence bounds for cell proportions
    
    Note: 3 strata omitted because they contain no subpopulation members.
    
    . svy, subpop(if female==1 & heartatk!=.): tab heartatk, ci format(%9.0g)
    (running tabulate on estimation sample)
    
    Number of strata   =        11                  Number of obs      =      3375
    Number of PSUs     =      3375                  Population size    =  37657063
                                                    Subpop. no. of obs =      1749
                                                    Subpop. size       =  19445224
                                                    Design df          =      3364
    
    -------------------------------------------------
    heart     |
    attack,   |
    1=yes,    |
    0=no      | proportions           lb           ub
    ----------+--------------------------------------
            0 |    .9788396     .9704638     .9848773
            1 |    .0211604     .0151227     .0295362
              |
        Total |           1                          
    -------------------------------------------------
      Key:  proportions  =  cell proportions
            lb           =  lower 95% confidence bounds for cell proportions
            ub           =  upper 95% confidence bounds for cell proportions
    
    Note: 5 strata omitted because they contain no subpopulation members.

  • #2
    Welcome to Statalist, Takahiro!


    See the Manual Entry for "subpopulation estimation" for a description about what is happening. Your sample has 3375 observations, and there are 1749 in the su bpopulation. Your "if" code excluded observations not in the subpopulation. The subpop() option uses all 3375 observations in the sample, but for the standard error calculation assigns zeros to those not in the subpopulation. Notice the difference in the CIs. The theory behind this is contained in every sampling text, e.g. Lohr, 2009. The reason that the subpop() option is correct is that it acknowledges that the number of subpopulation members is random, not fixed by design. So, in estimating a mean or proportion, both numerator and denominator are random, and the estimate itself is a ratio.


    Reference: Lohr, Sharon L. 2009. Sampling: Design and Analysis. Boston, MA: Cengage Brooks/Cole.
    Last edited by Steve Samuels; 27 May 2015, 15:31.
    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3
      Thank you, Steve, for your prompt response.

      Actually, I have read Lohr and the Stata manual, and I think I understand the reasons for the difference between #1 and #2.

      But what I was wondering was:
      - why the number of observation in the #2 and #3 are different;
      - why the number of obs in #3 is not 5000, since I made the size of the sample to 5000.
      - how the number of 3301 in #2 is made.

      Thanks.

      Takahiro

      Comment


      • #4
        Takahiro,

        subpop() responds differently in how it treats missing data depending on how you enter the subpop-ing variable. Cross-tabulating female and heartatk, you can see that 3,301 are the number of obs with no missingness on either variable, the subpop female and the response heartatk (which is how the # of obs in #2 originates).

        As you have observed, #2 and #3 differ. This is, probing further, due to #2 excluding and #3 including missings on female (see the cross-tabulations of female with e(sample)s sub2 and sub3 below).

        The number of obs in #3 is not 5,000 as some obs are being omitted due to omission of strata (note the warning "Note: 5 strata omitted because they contain no subpopulation members.") The cross-tabs with female and heartatk reveal why they're omitted - they have no members in the subpop (as Stata warned) and those strata were taken out of the computation.

        - joe

        Code:
        . svy, subpop(female): tab heartatk, ci format(%9.0g)
        (running tabulate on estimation sample)
        
        Number of strata   =        11                  Number of obs      =      3301
        Number of PSUs     =      3301                  Population size    =  36690922
                                                        Subpop. no. of obs =      1749
                                                        Subpop. size       =  19445224
                                                        Design df          =      3290
        
        -------------------------------------------------
        heart     |
        attack,   |
        1=yes,    |
        0=no      | proportions           lb           ub
        ----------+--------------------------------------
                0 |    .9788396     .9704643      .984877
                1 |    .0211604      .015123     .0295357
                  | 
            Total |           1                          
        -------------------------------------------------
          Key:  proportions  =  cell proportions
                lb           =  lower 95% confidence bounds for cell proportions
                ub           =  upper 95% confidence bounds for cell proportions
        
        Note: 3 strata omitted because they contain no subpopulation members.
        
        . gen sub2 = e(sample)
        
        . svy, subpop(if female==1 & heartatk!=.): tab heartatk, ci format(%9.0g)
        (running tabulate on estimation sample)
        
        Number of strata   =        11                  Number of obs      =      3375
        Number of PSUs     =      3375                  Population size    =  37657063
                                                        Subpop. no. of obs =      1749
                                                        Subpop. size       =  19445224
                                                        Design df          =      3364
        
        -------------------------------------------------
        heart     |
        attack,   |
        1=yes,    |
        0=no      | proportions           lb           ub
        ----------+--------------------------------------
                0 |    .9788396     .9704638     .9848773
                1 |    .0211604     .0151227     .0295362
                  | 
            Total |           1                          
        -------------------------------------------------
          Key:  proportions  =  cell proportions
                lb           =  lower 95% confidence bounds for cell proportions
                ub           =  upper 95% confidence bounds for cell proportions
        
        Note: 5 strata omitted because they contain no subpopulation members.
        
        . gen sub3 = e(sample)
        
        . tab female heartatk, missing
        
         1=female, |    heart attack, 1=yes, 0=no
            0=male |         0          1          . |     Total
        -----------+---------------------------------+----------
                 0 |     1,437        115        329 |     1,881 
                 1 |     1,700         49        371 |     2,120 
                 . |       657         43        299 |       999 
        -----------+---------------------------------+----------
             Total |     3,794        207        999 |     5,000 
        
        
        . tab female sub2, missing
        
         1=female, |         sub2
            0=male |         0          1 |     Total
        -----------+----------------------+----------
                 0 |       329      1,552 |     1,881 
                 1 |       371      1,749 |     2,120 
                 . |       999          0 |       999 
        -----------+----------------------+----------
             Total |     1,699      3,301 |     5,000 
        
        
        . tab female sub3, missing
        
         1=female, |         sub3
            0=male |         0          1 |     Total
        -----------+----------------------+----------
                 0 |       329      1,552 |     1,881 
                 1 |       371      1,749 |     2,120 
                 . |       925         74 |       999 
        -----------+----------------------+----------
             Total |     1,625      3,375 |     5,000 
        
        
        . tab heartatk strata if !sub3, missing
        
             heart |
           attack, |
            1=yes, |                stratum identifier, 1-32
              0=no |         1          2          3          4          5 |     Total
        -----------+-------------------------------------------------------+----------
                 0 |         0          0          0        349        239 |       588 
                 1 |         0          0          0         25         13 |        38 
                 . |       380        185        348         86          0 |       999 
        -----------+-------------------------------------------------------+----------
             Total |       380        185        348        460        252 |     1,625 
        
        
        . tab female strata if !sub3, missing
        
         1=female, |                stratum identifier, 1-32
            0=male |         1          2          3          4          5 |     Total
        -----------+-------------------------------------------------------+----------
                 0 |       188         91         50          0          0 |       329 
                 1 |       192         94         85          0          0 |       371 
                 . |         0          0        213        460        252 |       925 
        -----------+-------------------------------------------------------+----------
             Total |       380        185        348        460        252 |     1,625
        Joseph Nicholas Luchman, Ph.D., PStatĀ® (American Statistical Association)
        ----
        Research Fellow
        Fors Marsh

        ----
        Version 18.0 MP

        Comment


        • #5
          Hi guys!

          In line with this thread, I also have a question.
          Assuming one would like to model the relationship between say earnings(y) for men (x) in employment(t) and the level of education(z).
          The subpop obviously consists of working male.
          However, there will be missing cases on all variables.

          Is it correct to list in the subpop command considering all the missing or NR cases (a code of -9 on education):

          Code:
          gen flag=1
          replace flag=0 if y==.
          replace flag=0 if x==.| x==0
          replace flag =0 if t==0 | t==.
          replace flag =0 if z==. | z<0
          
          
          svy, subpop (flag): regress y z
          or should one list the subpop as working male and leave it as such:

          Code:
          gen flag=1
          replace flag=0 if  x==0
          replace flag =0 if t==0
          
          svy, subpop (flag): regress y z
          Obviously the number of observations is subpop differs and so do the estimated SE. Moreover, the -9 affects the estimates in the second scenario.

          I am inclined toward the first option, but I've found on the old statalist a note from steve saying ' I've never seen a recommendation to consider observations with non-missing values as a subpopulation.' and don't know how to take it.

          Best,
          Natalia
          Last edited by natalia malancu; 15 Nov 2016, 03:58.

          Comment


          • #6
            Hi Joseph!

            What if I don't want Stata to take out the strata with no subpopulation members as you said in #4: https://www.statalist.org/forums/for...58#post1295958


            tks

            Comment

            Working...
            X