Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Oaxaca Blinder decomposition - different number of observations with svy

    Dear all,
    I'm using the Oaxaca decomposition command, in Stata 18, with svy subpopulation option.
    I noticed that some observations are being excluded from the models within both Group 1 and Group 2. Additionally, I observed that in the decomposition model, the number of observations is the total sample size rather than the intended subpopulation size.
    My subpopulation's sample size is 14,367, with Group 1 (lower education) comprising 2,192 observations and Group 2 constituting 12,175 observations.
    When I run the decomposition without the svy option I get the correct subpopulation sample size in all the models.
    I would greatly appreciate your guidance and advice regarding this issue.
    Here are the outputs for the number of obsevations in the regression analyses for each group and the outputs for the number of observations in the oaxaca decomposition.

    Thanks in advance for your help.


    Total sample

    Code:
    svy, subpop(if subpop==1): logistic self age sex income badl visit eat prot_d2 dent
    Code:
     Survey: Logistic regression
    
    Number of strata =   574         Number of obs    =    90,846
    Number of PSUs   = 8,027       Population size    =    168,426,190
                                                 Subpop. no. obs    =    14,367
                                                 Subpop. size    =    21,722,187.6
                                                 Design df    =    7,453
                                                 F(8, 7446)    =    75.68
                                                 Prob > F    =    0.0000
    Group 1
    Code:
    svy, subpop(if subpop==1 & ses==0): logistic self age sex income badl visit eat prot_d2 dent
    Code:
    Number of strata =   457                        Number of obs   =       80,899
    Number of PSUs   = 7,085                        Population size =  135,901,805
                                                    Subpop. no. obs =        2,192
                                                    Subpop. size    = 2,505,225.62
                                                    Design df       =        6,628
                                                    F(8, 6621)      =        22.36
                                                    Prob > F        =       0.0000
    Group 2
    Code:
    svy, subpop(if subpop==1 & ses==1): logistic self age sex income badl visit eat prot_d2 dent
    Code:
    Number of strata =   573                         Number of obs   =      90,789
    Number of PSUs   = 8,022                         Population size = 168,374,254
                                                     Subpop. no. obs =      12,175
                                                     Subpop. size    =  19,216,962
                                                     Design df       =       7,449
                                                     F(8, 7442)      =       55.80
                                                     Prob > F        =      0.0000
    Here is the code I used for the decomposition

    Code:
    oaxaca self age sex income badl visit eat prot_d2 dent, ///
    by(ses) logit weight(0) svy(,subpop(subpop)) noisily cformat(%4.3f)

    Here is the output for the number of observations generated by the decomposition


    Code:
    Model for group 1
    (running logit on estimation sample)
    
    Survey: Logistic regression
    
    Number of strata =   456                        Number of obs   =       80,842
    Number of PSUs   = 7,080                        Population size =  135,849,869
                                                    Subpop. no. obs =        2,183
                                                    Subpop. size    = 2,497,605.89
                                                    Design df       =        6,624
                                                    F(8, 6617)      =        22.28
                                                    Prob > F        =       0.0000
    
    Note: 117 strata omitted because they contain no subpopulation members.
    
    Model for group 2
    (running logit on estimation sample)
    
    Survey: Logistic regression
    
    Number of strata =   456                         Number of obs   =      80,842
    Number of PSUs   = 7,080                         Population size = 135,849,869
                                                     Subpop. no. obs =      10,365
                                                     Subpop. size    =  14,703,327
                                                     Design df       =       6,624
                                                     F(8, 6617)      =       49.14
                                                     Prob > F        =      0.0000
    
    Blinder-Oaxaca decomposition
    
    Number of strata =   456                       Number of obs     =      80,842
    Number of PSUs   = 7,080                       Population size   = 135,849,869
                                                   Design df         =       6,624
                                                   Model             =       logit
    Group 1: ses = 0                               N of obs 1        =       7,353
    Group 2: ses = 1                               N of obs 2        =      73,489
    
        explained: (X1 - X2) * b2
      unexplained: X1 * (b1 - b2)


  • #2
    If you look at ereturn list, you'll see the actual number of observations in the subpop: e(N_sub)

    Comment


    • #3
      Dear George,

      Thank you for your assistance.

      I followed your suggestion and examined the return list, and it appears that the number of observations in the groups corresponds to what is shown in the output. However, it seems that the decomposition is using the overall sample instead of the subpopulation, and some observations are being removed from the groups.

      If you have any further guidance, it would be greatly appreciated.

      Best regards

      Comment


      • #4
        I used the auto.dta file.

        ereturn list

        shows the e(N_sub)

        I see that oaxaca does not provide this though.

        not sure why observations are being deleted, or if they actually are. hmmm?
        Click image for larger version

Name:	subpop001.jpg
Views:	1
Size:	128.7 KB
ID:	1725889

        Last edited by George Ford; 03 Sep 2023, 13:08.

        Comment


        • #5
          what is the definition of "subpop" in the svy, subpop(subpop) part?

          g

          Comment


          • #6
            Dear George,
            Thank you for the follow-up.
            Regarding the "subpop" variable, it indicates the observations to be included in my analyses.
            It appears there might have been an inconsistency between the command versions. I did the analyses again using a previous version, which yielded the correct number of observations.
            Thank you very much for your assistance.
            Best regards,

            Code:
            Model for group 1
            (running logit on estimation sample)
             
            Survey: Logistic regression
             
            Number of strata =   457                        Number of obs   =       80,899
            Number of PSUs   = 7,085                        Population size =  135,901,805
                                                            Subpop. no. obs =        2,192
                                                            Subpop. size    = 2,505,225.62
                                                            Design df       =        6,628
                                                            F(8, 6621)      =        22.36
                                                            Prob > F        =       0.0000
             
             
            Model for group 2
            (running logit on estimation sample)
             
            Survey: Logistic regression
             
            Number of strata =   573                         Number of obs   =      90,789
            Number of PSUs   = 8,022                         Population size = 168,374,254
                                                             Subpop. no. obs =      12,175
                                                             Subpop. size    =  19,216,962
                                                             Design df       =       7,449
                                                             F(8, 7442)      =       55.80
                                                             Prob > F        =      0.0000
             
            Blinder-Oaxaca decomposition
             
            Number of strata =   574                         Number of obs   =      90,846
            Number of PSUs   = 8,027                         Population size = 168,426,190
                                                             Design df       =       7,453
                                                            Model              =     logit
            Group 1: escol2_oaxaca = 0                      N of obs 1         =      2192
            Group 2: escol2_oaxaca = 1                      N of obs 2         =     12175

            Comment


            • #7
              Dear @George Ford,

              I recently updated the Oaxaca version and I'm encountering the same issue as before regarding the number of observations in my analyses. I’m using Stata 18.

              My total population size is 22,728, with group 1 consisting of 18,011 individuals and group 2 having 4,717.

              However, when I run the Oaxaca decomposition, I observe discrepancies in the number of individuals. Specifically:
              • In the model for group 1, I notice a loss in the number of observations for the subpopulation. The expected count should be 18,011, but I only see 16,990.
              • In the decomposition model, the total number of observations for both groups does not add up correctly. It should reflect 18,011 for group 1 and 4,717 for group 2.
              Could you please provide guidance on how to resolve this issue?

              Thank you for your assistance.

              Here is the number of the population and individuals in the groups I want to analyze.

              Code:
               svy, subpop(if v6>=60 & v7==1): tab group, obs percent format (%12.1f)
               
              Number of strata =   574                      Number of obs   =      90,846
              Number of PSUs   = 8,027                   Population size = 168,426,190
                                                                       Subpop. no. obs =      22,728
                                                                       Subpop. size    =  34,398,853
                                                                       Design df       =       7,453
              
              group                                                       percentage         obs
              Some schooling (group 1)                           83.2     18011.0
              No scholing       (group 2)                          16.8      4717.0
              Total                                                        100.0     22728.0
              Code:
                 oaxaca outcome v1 v2 v3 v4, by(group) logit weight(1) svy(,subpop(if v6>=60 & v7==1)) noisily
              Code:
                
              Model for group 1
              (running logit on estimation     sample)
              Survey: Logistic regression
               
              Number of strata =   522               Number of obs= 87,235
              Number of PSUs   = 7,672            Population size=153,627,429
                                                                Subpop. no. obs=16,990
                                                                Subpop. size=25,504,979.2
                                                                Design df=7,150
                                                                F(4, 7147)=191.74
                                                                Prob > F =0.0000
               
              Model for group 2
              (running logit on estimation sample)
               
              Survey: Logistic regression
               
              Number of strata =   522              Number of obs=87,235
              Number of PSUs   = 7,672           Population size=153,627,429
                                                               Subpop. no. obs= 4,717
                                                               Subpop. size    = 5,787,182.59
                                                               Design df=7,150
                                                               F(4, 7147)=8.87
                                                               Prob > F=0.0000
              Blinder-Oaxaca decomposition
               
              Number of strata =   522                       Number of obs =87,235
              Number of PSUs   = 7,672                    Population size=153,627,429
                                                                        Design df =7,150
                                                                        Model =       logit
              Group 1: group = 0                               N of obs 1 =      79,622
              Group 2: group = 1                               N of obs 2 =       7,613
              If I run the logit model for each of the groups I get the correct numbers

              Code:
                svy, subpop(if v6>=60 & v7==1 & group==1): logit outcome v1 v2 v3 v4
              Code:
                
              Number of strata=522                           Number of obs =87,235
              Number of PSUs=7,672                        Population size =153,627,429
                                                                        Subpop. no. obs = 4,717
                                                                        Subpop. size =5,787,182.59
                                                                        Design df=7,150
                                                                        F(4, 7147) =8.87
                                                                        Prob > F = 0.0000
              Code:
               svy, subpop(if v6>=60 & v7==1 & group==0): logit outcome v1 v2 v3 v4
              Code:
              Number of strata =   574                        Number of obs=90,846
              Number of PSUs   = 8,027                     Population size =  168,426,190
                                                                         Subpop. no. obs = 18,011
                                                                         Subpop. size= 28,611,670.4
                                                                         Design df =7,453
                                                                         F(4, 7450)=172.14
                                                                         Prob > F=0.0000

              The same is true when I run the Oaxaca without the svy, althout the number of obs for goup 1 is lower, but the subpop sample is correct.

              Comment

              Working...
              X