Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • svy, subpop and svy with interactions give different results?

    Dear all

    I am having a problem with the svy prefix which i cannot explain. More specifically i svyset my data and then i run a regression of the kind:

    svy: glm outcome i.var1##i.var2 var3 var4 var5, family(gamma) link(identity)
    margins r.var1, over(var2)

    Please note that both var1 and var2 are binary. However when i run:

    tab var2, gen(var2)

    svy, subpop(var21): glm outcome i.var1 var3 var4 var5, family(gamma) link(identity)
    svy, subpop(var22): glm outcome i.var1 var3 var4 var5, family(gamma) link(identity)

    i get considerably different coefficients for var1. Is this normal? Shouldn't both approaches produce the same results?

    Thank you

    Cynthia

  • #2
    It's hard to answer this without seeing the results. Could you please place your results (including the code) between code tags so that the output may be easily read..See FAQ entry 12 for how to do this if you are not familiar with code tags.
    Richard T. Campbell
    Emeritus Professor of Biostatistics and Sociology
    University of Illinois at Chicago

    Comment


    • #3
      Thank you for your reply Dick. I am now attaching a summary of the output produced.

      As you can see, in one occasion the coefficient is -310 and in the other -351! My question is: shouldn't these two be the same as the coefficient for the other category which is in both cases -340?


      Code:
      . svy: glm outcome i.var1##i.var2 $othervars, family (gamma) link (identity)
      (running glm on estimation sample)
      
      Survey: Generalized linear models
      
      Number of strata   =         1                  Number of obs      =     35034
      Number of PSUs     =       177                  Population size    = 242314.21
                                                      Design df          =       176
      
      ----------------------------------------------------------------------------------------
                             |             Linearized
                     outcome |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -----------------------+----------------------------------------------------------------
                      1.var1 |  -310.6841   55.57934    -5.59   0.000    -420.3718   -200.9963
                      1.var2 |  -922.5135    32.8154   -28.11   0.000    -987.2759   -857.7512
                             |
                   var1#var2 |
                        1 1  |  -29.72786   50.93206    -0.58   0.560     -130.244    70.78831
                             |
      (rest of table omitted)
      
      . margins r.var1, over(var2)
      
      Contrasts of predictive margins
      
      Number of strata   =         0
      Number of PSUs     =         0
      Model VCE    : Linearized
      
      Expression   : Predicted mean outcome, predict()
      over         : var2
      
      ------------------------------------------------
                   |         df        chi2     P>chi2
      -------------+----------------------------------
         var1@var2 |
       (1 vs 0) 0  |          1       31.25     0.0000
       (1 vs 0) 1  |          1      102.99     0.0000
            Joint  |          2      104.71     0.0000
      ------------------------------------------------
      
      --------------------------------------------------------------
                   |            Delta-method
                   |   Contrast   Std. Err.     [95% Conf. Interval]
      -------------+------------------------------------------------
         var1@var2 |
       (1 vs 0) 0  |  -310.6841   55.57934     -419.6176   -201.7506
       (1 vs 0) 1  |   -340.412   33.54348      -406.156   -274.6679
      --------------------------------------------------------------
      
      . tab var2, gen(var2)
      
             var2 |      Freq.     Percent        Cum.
      ------------+-----------------------------------
                0 |     14,577       41.61       41.61
                1 |     20,457       58.39      100.00
      ------------+-----------------------------------
            Total |     35,034      100.00
      
      
      . svy, subpop(var21): glm outcome i.var1 $othervars, family (gamma) link (identity)
      (running glm on estimation sample)
      
      Survey: Generalized linear models
      
      Number of strata   =         1                  Number of obs      =     35034
      Number of PSUs     =       177                  Population size    = 242314.21
                                                      Subpop. no. of obs =     14577
                                                      Subpop. size       =  128538.3
                                                      Design df          =       176
      
      ----------------------------------------------------------------------------------------
                             |             Linearized
                     outcome |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -----------------------+----------------------------------------------------------------
                      1.var1 |   -351.876   53.56435    -6.57   0.000    -457.5871    -246.165
                             |
      (rest of table omitted)
      
      
      . svy, subpop(var22): glm outcome i.var1 $othervars, family (gamma) link (identity)
      (running glm on estimation sample)
      
      Survey: Generalized linear models
      
      Number of strata   =         1                  Number of obs      =     35034
      Number of PSUs     =       177                  Population size    = 242314.21
                                                      Subpop. no. of obs =     20457
                                                      Subpop. size       = 113775.91
                                                      Design df          =       176
      
      ----------------------------------------------------------------------------------------
                             |             Linearized
                     outcome |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -----------------------+----------------------------------------------------------------
                      1.var1 |   -340.086   34.02309   -10.00   0.000    -407.2318   -272.9403
                             |
      (rest of table omitted)

      Comment


      • #4
        First, look at your first model, which contains an interaction: The equation is:

        outcome = b1(var1) + b2(var2) + b3(var1*var2).

        When var2 = 0 the fitted value = b1 = -310.6841 because the second and third terms drop out with var2 = 0.

        When var2 = 1, the effect of var1, after combining terms is:

        b1 + b3 = -310.6841 - 29.72786 = -340.41196.

        This is very close to the coefficient obtained for 1.var1 in your second subpop estimate.

        However, I am puzzled by your code. The tab statement generates var2, although that variable appears to already exist as shown in the previous glm statement so I don't understand why you don't get an error.. Then, you run two glm's with the subpop options, referring to variables var21 and var22. The latter model, as I said, produces an estimate for var1 which I would expect given the first glm model. However, I don't understand exactly what is going on because I don't know what var21 and var22 are. This may well be a result of my ignorance of some Stata feature, but you will have to explain it to me.
        Richard T. Campbell
        Emeritus Professor of Biostatistics and Sociology
        University of Illinois at Chicago

        Comment


        • #5
          Thank you for taking the time to look into this Dick.

          Var2 has two categories, say: 0 Manual 1 Automatic. The tab statement breaks var2 into two covariates: var21 which is Manual (with categories 0 no, 1 yes) and var22 which is automatic (0 no 1 yes).

          All results make sense except those from the first glm with the subpop option which produces a coefficient of -351. This should not be happening. The coefficient should be around -310.
          Last edited by Cynthia Inglesias; 16 Mar 2016, 15:15.

          Comment


          • #6
            Oh my, of course that is how tab, gen() works. My mistake, it's been so long since I used this feature that I had forgotten how it works.

            Here is an example of a simple regression where things work as advertised, However, your setup is not quite the same as mine because you have other variables in the equation. So, the interaction model you run is not the same as the two subpop models. In the former, you are assuming that the effects of the other variables are the same within the two groups since you do not include interaction terms for them. For the subpop models, you are assuming that the effects of all of the variables in the model differ by group, that is, you are assuming a saturated model. If you drop the other variables from the model you should get what you expect to see.

            When I did the calculations in my first note I was a bit puzzled as to why got -340.6841 when I did the hand calculation (I should have used lincom) and -340.41196 in the subpop analysis. I was in a hurry and I wrote it off to differences in calculation methods or something, but that is not the case. I think if you drop the other variables you will get exactly what you should see and if you include all of the possible interactions in your first model you will replicate the subpop analysis.
            Code:
            . use http://www.stata-press.com/data/r14/nhanes2.dta
            
            . gen male = sex == 1
            
            . svy: reg bpsys i.male##i.black
            (running regress on estimation sample)
            
            Survey: Linear regression
            
            Number of strata   =        31                Number of obs     =       10,351
            Number of PSUs     =        62                Population size   =  117,157,513
                                                          Design df         =           31
                                                          F(   3,     29)   =        33.44
                                                          Prob > F          =       0.0000
                                                          R-squared         =       0.0190
            
            ------------------------------------------------------------------------------
                         |             Linearized
                bpsystol |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                  1.male |   6.006678   .5842816    10.28   0.000     4.815028    7.198328
                 1.black |   3.288238   1.533527     2.14   0.040     .1605885    6.415888
                         |
              male#black |
                    1 1  |   -2.79918   1.778945    -1.57   0.126    -6.427363    .8290032
                         |
                   _cons |   123.8742   .7445479   166.38   0.000     122.3557    125.3927
            ------------------------------------------------------------------------------
            
            . tab male, gen(male)
            
                   male |      Freq.     Percent        Cum.
            ------------+-----------------------------------
                      0 |      5,436       52.52       52.52
                      1 |      4,915       47.48      100.00
            ------------+-----------------------------------
                  Total |     10,351      100.00
            
            . svy, subpop(male1): reg bpsys i.black
            (running regress on estimation sample)
            
            Survey: Linear regression
            
            Number of strata   =        31                Number of obs     =       10,351
            Number of PSUs     =        62                Population size   =  117,157,513
                                                          Subpop. no. obs   =        5,436
                                                          Subpop. size      =   60,998,033
                                                          Design df         =           31
                                                          F(   1,     31)   =         4.60
                                                          Prob > F          =       0.0400
                                                          R-squared         =       0.0019
            
            ------------------------------------------------------------------------------
                         |             Linearized
                bpsystol |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                 1.black |   3.288238   1.533527     2.14   0.040     .1605885    6.415888
                   _cons |   123.8742   .7445479   166.38   0.000     122.3557    125.3927
            ------------------------------------------------------------------------------
            
            . svy, subpop(male2): reg bpsys i.black
            (running regress on estimation sample)
            
            Survey: Linear regression
            
            Number of strata   =        31                Number of obs     =       10,351
            Number of PSUs     =        62                Population size   =  117,157,513
                                                          Subpop. no. obs   =        4,915
                                                          Subpop. size      =   56,159,480
                                                          Design df         =           31
                                                          F(   1,     31)   =         0.24
                                                          Prob > F          =       0.6258
                                                          R-squared         =       0.0001
            
            ------------------------------------------------------------------------------
                         |             Linearized
                bpsystol |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                 1.black |   .4890588   .9928028     0.49   0.626    -1.535776    2.513894
                   _cons |   129.8809   .6551705   198.24   0.000     128.5446    131.2171
            ------------------------------------------------------------------------------
            Richard T. Campbell
            Emeritus Professor of Biostatistics and Sociology
            University of Illinois at Chicago

            Comment


            • #7
              Thanks for this Dick. You are right, when i remove $othervars the results are the same. However, when i include all interactions with i.var1 i get different results. Should i also include all interactions with i.var2 to get the same results or even all interactions between covariates (eg var4*var6)? I have 9 covariates in total (var1 and var2 inclusive). Any guidance on this would be greatly appreciated.

              PS I suspect the ovarall interaction test for i.var1##i.var2 is also misleading?

              Comment


              • #8
                You are right, when i remove $othervars the results are the same. However, when i include all interactions with i.var1 i get different results.
                Cynthia, I suspect if you posted your exact code and output we would see that you are still doing something wrong that keeps the results from being identical. But even if you are right we can't comment without seeing exactly what you did and what you got.
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 19.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  Hi Richard

                  At first I merely run the same regression with all the interactions between i.var1 and the rest of the variables. But this was wrong as i was supposed to include the interactions for group:

                  Code:
                  . svy: glm outcome i.var1##($othervars), family (gamma) link (identity)
                  
                  should be
                  
                  . svy: glm outcome i.var1##i.var2 i.var2##($othervars), family (gamma) link (identity)
                  So now i do get the same results. Can i please ask if the overall interaction should be tested (testparm i.var1#i.var2) only when the latter saturated model is used?
                  Last edited by Cynthia Inglesias; 17 Mar 2016, 02:12.

                  Comment

                  Working...
                  X