Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finite population correction for PSUs sampled with certainty

    Hi fellow Stata users,

    I’m running analysis on a dataset that employs a complex survey design and which samples a large portion of a population, indicating that I should use a finite population correction (fpc). As I understand it (based on page 172-173 of Stata’s Survey Data Reference Manual (Release 13)), Stata requires a variable whose values contain either the proportion of PSUs sampled within each stratum, or the total number of PSUs in the sample population within each stratum in order to calculate the fpc.

    Several stratum contain only PSUs sampled with certainty, making their fpc equal to 1. When running a variety of models (chi squared tests, regressions), I find that when I defined new stratum for each site with fpc equal to one, significance levels decreased. If a PSU is sampled with certainty, I don’t understand why it would matter whether a PSU is alone in a statum or shares a stratum with other PSUs sampled with certainty.
    Any advice would be appreciated. Some of the significance changes are disconcertingly large.

    I’ve contained some jerry-rigged and annotated code that demonstrates my issue on a simple sample dataset. All analysis is done Stata 13.1.

    Code:
    use http://www.stata-press.com/data/r13/fpc, clear
    list
    gen Nh2=nh/5
    generate double y = runiform()
    
    * I use the variable Nh2 to calculate the finite population correction.
    * This variable contains 5 obs. sampled with certainty and 3 at a rate of .6
    svyset psuid [pweight=weight], strata(stratid) fpc(Nh2)
    svy:reg y x
    
    * Now I create a new strata variable where each obs. sampled with certainty
    * is given a unique strata.
    gen strata2=stratid
    replace strata2=strata2+y if stratid==1
    
    svyset psuid [pweight=weight], strata(strata2) fpc(Nh2)
    svy:reg y x
    
    * Between the first and second model, we see that significance levels are
    * consistently larger when emplying the modified strata viarable.

  • #2
    I give the results of running your code below. I only altered the name of the fpc variable to "fpc1" and gave your new stratum variable more readable values. Here are my comments:

    1. Estimates and standard errors are identical for the two analyses

    2. The only difference is the "Design df". Your first design (one stratum for 5 certainty units) has 2 strata and 8 PSUs. Stata's degrees of freedom calculator uses the formula:


    df = (# Strata ) - (# PSUs)


    Your second design, the one specified by the Stata manual has 6 strata & 8 PSUs, so that df = 2. The 95% t multiplier for 2 degrees freedom is 4.30; for 6 degrees of freedom it is 2.447. Using the 6 degrees of freedom design has apparently decreased the size of t-statistics by 43%.

    However that decrease is an illusion. The correct degrees of freedom for both designs is 2. By putting multiple certainty units into one stratum, contrary to the Stata manual specification, you've fooled Stata's df calculator.

    Look more closely at the degrees of freedom calculation: Stratum \(h\) contributes \(n_h -1)\) degrees of freedom, where \(n_h\) is the number of PSUs in the stratum. However that is correct only if the sampling units in the stratum have random variation. But in the stratum of certainty units only, there is no random variation, and the contribution to the overall standard error is zero. The only random variation is in Stratum 2, which contributes 3-1 = 2 degrees of freedom to both designs.

    svyset has a a dof() option, which allows you to specify degrees of freedom by hand. When would one need this? In some multi-stage designs, one might have insufficient degrees of freedom for some analyses. One "fix" is to designate certainty units as "pseudo-strata" only; and the second-stage units in those strata are elevated to "pseudo-PSUs, which will often be much smaller than other, non-certainty, PSUs. This is the recommendation of Korn and Graubard, 1999, pp. 207. The recommendation is that each of the "pseudo-strata" contribute 1 degree of freedom.




    Reference:


    Korn, Edward Lee, and Barry I Graubard. 1999. Analysis of health surveys. New York: Wiley.
    Code:
     
    . use http://www.stata-press.com/data/r13/fpc, clear
    . gen fpc1=nh/5
    . list
         +-------------------------------------------------+
         | stratid   psuid   weight   nh   Nh     x   fpc1 |
         |-------------------------------------------------|
      1. |       1       1        3    5   15   2.8      1 |
      2. |       1       2        3    5   15   4.1      1 |
      3. |       1       3        3    5   15   6.8      1 |
      4. |       1       4        3    5   15   6.8      1 |
      5. |       1       5        3    5   15   9.2      1 |
         |-------------------------------------------------|
      6. |       2       1        4    3   12   3.7     .6 |
      7. |       2       2        4    3   12   6.6     .6 |
      8. |       2       3        4    3   12   4.2     .6 |
         +-------------------------------------------------+
     
    . generate double y = runiform()
     
    . * I use the variable fpc1 to calculate the finite population correction.
    . * This variable contains 5 obs. sampled with certainty and 3 at a rate of .6
    . svyset psuid [pweight=weight], strata(stratid) fpc(fpc1)
    
    . svy:reg y x
    
    
    Number of strata   =         2                  Number of obs     =          8
    Number of PSUs     =         8                  Population size   =         27
                                                    Design df         =          6
                                                    F(   1,      6)   =     967.02
                                                    Prob > F          =     0.0000
                                                    R-squared         =     0.2886
    ------------------------------------------------------------------------------
                 |             Linearized
               y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
               x |  -.0629301   .0020237   -31.10   0.000    -.0678819   -.0579784
           _cons |   1.009205   .0242993    41.53   0.000      .949747    1.068663
    ------------------------------------------------------------------------------
    . * Now I create a new strata variable where each obs. sampled with certainty
    . * is given a unique strata.
    
    . gen strata2=stratid
    . bys stratid: replace strata2 = _n + .5 if stratid==1
    . list
    
         +-----------------------------------------------------------------------+
         | stratid   psuid   weight   nh   Nh     x   fpc1           y   strata2 |
         |-----------------------------------------------------------------------|
      1. |       1       1        3    5   15   2.8      1   .79708925       1.5 |
      2. |       1       2        3    5   15   4.1      1   .78358534       2.5 |
      3. |       1       3        3    5   15   6.8      1    .6546342       3.5 |
      4. |       1       4        3    5   15   6.8      1   .09688907       4.5 |
      5. |       1       5        3    5   15   9.2      1   .68850586       5.5 |
         |-----------------------------------------------------------------------|
      6. |       2       1        4    3   12   3.7     .6   .87249602         2 |
      7. |       2       2        4    3   12   6.6     .6   .52963527         2 |
      8. |       2       3        4    3   12   4.2     .6   .83022092         2 |
         +-----------------------------------------------------------------------+
    
    . svyset psuid [pweight=weight], strata(strata2) fpc(fpc1)
    
    . svy:reg y x
    Number of strata   =         6                  Number of obs     =          8
    Number of PSUs     =         8                  Population size   =         27
                                                    Design df         =          2
                                                    F(   1,      2)   =     967.02
                                                    Prob > F          =     0.0010
                                                    R-squared         =     0.2886
    
    ------------------------------------------------------------------------------
                 |             Linearized
               y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
               x |  -.0629301   .0020237   -31.10   0.001    -.0716373   -.0542229
           _cons |   1.009205   .0242993    41.53   0.001     .9046538    1.113757
    ------------------------------------------------------------------------------
    Last edited by Steve Samuels; 24 Aug 2015, 09:10.
    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3
      Nick privately pointed out to me that I had the dof formula backwards. It should be:

      df = (# PSUs ) - (# Strata)

      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment

      Working...
      X