Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • svy melogit on mi data ignoring survey weights

    I am trying to do multilevel logit analysis on survey data that is mi set. I have no trouble running regular logistic regressions on this data using several different weights, but Stata seems to be ignoring the weights when I implement melogit instead of logit. I have pasted the code below, showing that the output of melogit is identical when I run it on the data without using svy, or when I run it using surveyset data with three complete different weight variables.

    Am I doing something incorrect in the way that I run the multilevel logit models that could explain why melogit seems to be ignoring the survey weights? Or is there another approach that someone could recommend that might allow me to do this analysis?

    At the moment I am running Stata/MP 14.0 for Mac, in case that is relevant.


    Code:
     mi estimate, or: melogit success fully  || course:
    
    Multiple-imputation estimates                   Imputations       =         35
    Mixed-effects logistic regression               Number of obs     =      9,484
                                                    Average RVI       =     0.0000
                                                    Largest FMI       =     0.0000
    DF adjustment:   Large sample                   DF:     min       =          .
                                                            avg       =          .
                                                            max       =          .
    Model F test:       Equal FMI                   F(   1,      .)   =       3.12
    Within VCE type:          OIM                   Prob > F          =     0.0771
    
    ------------------------------------------------------------------------------------
               success | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------------+----------------------------------------------------------------
    success            |
                 fully |   1.313333      .2025     1.77   0.077     .9708002    1.776723
                 _cons |   17.01076   1.407229    34.26   0.000     14.46463    20.00507
    -------------------+----------------------------------------------------------------
    var(_cons[course]) |
                 _cons |    .799304   .1253634     6.38   0.000     .5535961    1.045012
    ------------------------------------------------------------------------------------
    Note: Number of groups varies among imputations.
    Note: Number of observations per group varies among imputations.
    
    
    mi svyset _n [pweight=newweight], vce(linearized) singleunit(missing)
    
          pweight: newweight
              VCE: linearized
      Single unit: missing
         Strata 1: <one>
             SU 1: <observations>
            FPC 1: <zero>
    
    mi estimate, or: svy linearized: melogit success fully  || course:
    
    Multiple-imputation estimates                   Imputations       =         35
    Survey: Mixed-effects logistic regression       Number of obs     =      9,484
    
    Number of strata  =         1                   Population size   =      9,484
    Number of PSUs    =     9,484
                                                    Average RVI       =     0.0000
                                                    Largest FMI       =     0.0000
                                                    Complete DF       =       9483
    DF adjustment:   Small sample                   DF:     min       =   9,481.00
                                                            avg       =   9,481.00
                                                            max       =   9,481.00
    Model F test:       Equal FMI                   F(   1, 9481.0)   =       2.74
    Within VCE type:   Linearized                   Prob > F          =     0.0977
    
    ------------------------------------------------------------------------------------
               success | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------------+----------------------------------------------------------------
    success            |
                 fully |   1.313333   .2161077     1.66   0.098     .9512453    1.813248
                 _cons |   17.01076   1.451358    33.21   0.000     14.39097    20.10748
    -------------------+----------------------------------------------------------------
    var(_cons[course]) |
                 _cons |    .799304   .1299294     6.15   0.000     .5446145    1.053993
    ------------------------------------------------------------------------------------
    Note: Number of groups varies among imputations.
    Note: Number of observations per group varies among imputations.
    
    mi svyset _n [pweight=normweightsubdl], vce(linearized) singleunit(missing)
    
          pweight: normweightsubdl
              VCE: linearized
      Single unit: missing
         Strata 1: <one>
             SU 1: <observations>
            FPC 1: <zero>
    
    mi estimate, or: svy linearized: melogit success fully  || course:
    
    Multiple-imputation estimates                   Imputations       =         35
    Survey: Mixed-effects logistic regression       Number of obs     =      9,484
    
    Number of strata  =         1                   Population size   =      9,484
    Number of PSUs    =     9,484
                                                    Average RVI       =     0.0000
                                                    Largest FMI       =     0.0000
                                                    Complete DF       =       9483
    DF adjustment:   Small sample                   DF:     min       =   9,481.00
                                                            avg       =   9,481.00
                                                            max       =   9,481.00
    Model F test:       Equal FMI                   F(   1, 9481.0)   =       2.74
    Within VCE type:   Linearized                   Prob > F          =     0.0977
    
    ------------------------------------------------------------------------------------
               success | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------------+----------------------------------------------------------------
    success            |
                 fully |   1.313333   .2161077     1.66   0.098     .9512453    1.813248
                 _cons |   17.01076   1.451358    33.21   0.000     14.39097    20.10748
    -------------------+----------------------------------------------------------------
    var(_cons[course]) |
                 _cons |    .799304   .1299294     6.15   0.000     .5446145    1.053993
    ------------------------------------------------------------------------------------
    Note: Number of groups varies among imputations.
    Note: Number of observations per group varies among imputations.
    
    
    mi svyset _n [pweight=newweightmlm], vce(linearized) singleunit(missing)
    
          pweight: newweightmlm
              VCE: linearized
      Single unit: missing
         Strata 1: <one>
             SU 1: <observations>
            FPC 1: <zero>
    
    mi estimate, or: svy linearized: melogit success fully  || course:
    
    Multiple-imputation estimates                   Imputations       =         35
    Survey: Mixed-effects logistic regression       Number of obs     =      9,484
    
    Number of strata  =         1                   Population size   =      9,484
    Number of PSUs    =     9,484
                                                    Average RVI       =     0.0000
                                                    Largest FMI       =     0.0000
                                                    Complete DF       =       9483
    DF adjustment:   Small sample                   DF:     min       =   9,481.00
                                                            avg       =   9,481.00
                                                            max       =   9,481.00
    Model F test:       Equal FMI                   F(   1, 9481.0)   =       2.74
    Within VCE type:   Linearized                   Prob > F          =     0.0977
    
    ------------------------------------------------------------------------------------
               success | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------------+----------------------------------------------------------------
    success            |
                 fully |   1.313333   .2161077     1.66   0.098     .9512453    1.813248
                 _cons |   17.01076   1.451358    33.21   0.000     14.39097    20.10748
    -------------------+----------------------------------------------------------------
    var(_cons[course]) |
                 _cons |    .799304   .1299294     6.15   0.000     .5446145    1.053993
    ------------------------------------------------------------------------------------
    Note: Number of groups varies among imputations.
    Note: Number of observations per group varies among imputations.


  • #2
    In case this helps, here is some more information:

    I've tried these commands on the original data (before they were multiply imputed) to see if I run into the same issue. When I run svy: melogit on data that is not mi set, the output does reflect different standard errors that seem to reflect the weights, but the point estimates are identical (see code and output below). When I run the same models with logit instead of melogit, the point estimates are not identical, so it seems unlikely to me that the point estimates should be completely identical with and without the weights when using melogit. (I have displayed the output with odds ratios below, but I get the same pattern also when the or option is not selected in the code.)

    Another issue that I noticed is that melogit does not correctly reflect the population--it lists the number of observations as the population size, whereas the correct population size (and the one correctly displayed by logit) is 144,133. I am wondering if this is related, and I am also concerned as to whether this reflects a problem with the way that melogit is calculating the standard errors (which seem large to me in the example below that uses weights)?

    Code:
    svyset _n [pweight=normweightsubdl], vce(linearized) singleunit(missing)
    
          pweight: normweightsubdl
              VCE: linearized
      Single unit: missing
         Strata 1: <one>
             SU 1: <observations>
            FPC 1: <zero>
    
    svy linearized: melogit success fully  || course:, or
    (running melogit on estimation sample)
    
    Survey: Mixed-effects logistic regression
    
    Number of strata   =         1                  Number of obs     =      9,484
    Number of PSUs     =     9,484                  Population size   =      9,484
                                                    Design df         =      9,483
                                                    F(   1,   9483)   =       2.84
                                                    Prob > F          =     0.0922
    
    -------------------------------------------------------------------------------
                  |             Linearized
          success | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
    --------------+----------------------------------------------------------------
            fully |   1.415809   .2923007     1.68   0.092     .9445989     2.12208
            _cons |   11.14855   1.111488    24.19   0.000      9.16948    13.55478
    --------------+----------------------------------------------------------------
    coursecollege |
        var(_cons)|   .1528434    .059602                      .0711663    .3282609
    -------------------------------------------------------------------------------
    
    melogit success fully  || course:, or
    
    Fitting fixed-effects model:
    
    Iteration 0:   log likelihood = -2912.1919  
    Iteration 1:   log likelihood = -2864.7363  
    Iteration 2:   log likelihood = -2864.4718  
    Iteration 3:   log likelihood = -2864.4713  
    Iteration 4:   log likelihood = -2864.4713  
    
    Refining starting values:
    
    Grid node 0:   log likelihood = -2812.9432
    
    Fitting full model:
    
    Iteration 0:   log likelihood = -2812.9432  (not concave)
    Iteration 1:   log likelihood = -2808.5592  (not concave)
    Iteration 2:   log likelihood = -2804.8099  
    Iteration 3:   log likelihood =  -2803.315  
    Iteration 4:   log likelihood = -2803.0731  
    Iteration 5:   log likelihood = -2803.0697  
    Iteration 6:   log likelihood = -2803.0697  
    
    Mixed-effects logistic regression               Number of obs     =      9,484
    Group variable:   coursecollege                 Number of groups  =         19
    
                                                    Obs per group:
                                                                  min =          1
                                                                  avg =      499.2
                                                                  max =        965
    
    Integration method: mvaghermite                 Integration pts.  =          7
    
                                                    Wald chi2(1)      =       5.89
    Log likelihood = -2803.0697                     Prob > chi2       =     0.0153
    -------------------------------------------------------------------------------
          success | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
    --------------+----------------------------------------------------------------
            fully |   1.415809   .2028878     2.43   0.015     1.069119    1.874922
            _cons |   11.14855    1.17918    22.80   0.000     9.061231    13.71671
    --------------+----------------------------------------------------------------
    coursecollege |
        var(_cons)|   .1528434   .0594787                      .0712858    .3277104
    -------------------------------------------------------------------------------
    LR test vs. logistic model: chibar2(01) = 122.80      Prob >= chibar2 = 0.0000

    Comment


    • #3
      I don't know what's going on, but I have some thoughts:

      1. I suspect that the weights have been normalized to sum to the sample size. This would account for identical population and sample counts. You can investigate by running non-survey total on the weight variables.

      2. Very similar weights might account for the uniform standard errors in your first p;ost. How different are the weights? You might try corr and graph matrix to see.

      Maybe not relevant to your problem, but:

      3. Your svyset statements indicate that the study had a single stage, non-clustered, non-stratified design. That is, each random number picked a single individual. However you analyze a two-level model, with the higher level unit named "coursecollege". If, in fact, the random numbers selected these units, then the svyset statement is incorrect.

      4. The singleunit() option to svyset is needed only for stratified designs.
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4
        Thanks so much for your reply, Steven--I have been pulling my hair out for the last three days trying to make sense of this, and can't proceed with this analysis until I can work this out, and so I am very very grateful for your help!! To respond to your thoughts:

        1. The weights have not been normalized to sum to the sample size--they have been normalized to sum to the population size. When I run the identical analysis using logit instead of melogit, the population size reflected by the analysis is correct. It is only melogit that seems to report the sample size for the population size.

        2. How similar would the weights have to be in order to produce uniform standard errors? I don't think that my weights are similar enough to produce this result (but I'm not a statistician, so perhaps I am wrong?). I have pasted summary information about two different weight variables below, in case that helps. What seems suspicious to me is that it is only with melogit that all the output is identical regardless of weights. With logit, point estimates and standard errors are very different, depending upon the weights, and I can't think of a plausible reason why this would happen with logit but not melogit (but again, I'm not a statistician, so maybe I'm overlooking something).

        3. This data is analyzing course outcomes for students who took different courses. The multilevel model is not to control for survey structure, but rather to control for unobserved heterogeneity in outcomes by the specific course that the student took (since the outcomes of students who took the same course would be correlated). Students weren't sampled in a second stage from the courses, but rather from the whole population of students who took any of the courses. So I believe the way I have set things up is correct, but if not, please feel free to let me know!

        4. Thanks, I will correct this. I think I was using the menu option instead of typing the code in this instance, and I think Stata added this automatically? I will remove it and type code manually.



        weight variable 1:
        Code:
                         range:  [10.60849,50.345139]         units:  1.000e-07
                 unique values:  7,842                    missing .:  0/9,484
        
                          mean:   15.1975
                      std. dev:   2.13416
        
                   percentiles:        10%       25%       50%       75%       90%
                                   13.2182   14.1555   15.0918   15.9148   16.7836
        weight variable 2:
        Code:
                         range:  [1.0004365,798.32281]        units:  1.000e-07
                 unique values:  9,405                    missing .:  0/9,484
        
                          mean:   4.66952
                      std. dev:   21.7418
        
                   percentiles:        10%       25%       50%       75%       90%
                                   1.00326   1.00819   1.02387    1.0687   5.71433

        Comment


        • #5
          I meant to add to my last post that I am not just concerned about the standard errors being identical across weight variables, but also the point estimates. Based on the differences in point estimates that I see when I run logit models, I expected to see at least someone difference in the point estimates when I use melogit with versus without the weights (for some logit models, the direction of the relationship dramatically shifts when the weights are used).

          Comment


          • #6
            Claire has found a bug in gsem and the me commands that
            support the svy prefix. The fix will be available in the next
            update to Stata 14.

            In Claire's example, the problem is that svy: melogit is failing
            to exit with an error indicating that the weights are not constant
            within the levels of the course variable. The reason for this is
            that

            Code:
            svyset _n [pweight=normweightsubdl], vce(linearized) singleunit(missing)
            specifies that the data were sampled without strata or clusters in a
            single level, yet the model

            Code:
            svy linearized: melogit success fully  || course:, or
            indicates that observations within the levels of the course
            variable are somehow clustered. svy linearized tries to line up
            the random effect levels with the sampling stages, but there is only
            one sampling stage, and the bug is preventing it from acknowledging the
            sampling weights. After the fix, svy: melogit will exit with an
            error indicating that normweightsubdl is not constant within the
            levels of the course variable.

            In this situation, svy: melogit requires the analyst to line up
            the model's random effects with the survey's sampling stages, so if
            Claire's sampling weights are truely observation-level weights and there
            are no sampling units that course is nested within, then the
            correct svyset syntax would be

            Code:
            svyset course || _n, weight(normweightsubdl) vce(linearized) singleunit(missing)

            Comment


            • #7
              By the way, the bug I mention has to do with the user specifying weights
              using the weight expression clause, instead of using the weight() option
              in the associated sampling stages.

              Comment


              • #8
                Just realized the bug I'm referring to was fixed in the 22apr2015 update.

                Claire should verify that her Stata 14 is fully up-to-date.

                If this problem persists, even after verifying that Stata is fully up-to-date, I encourage
                Claire to write to tech-support with an example do-file and dataset that reproduces
                the problem.

                Comment


                • #9
                  You apparently have found a bug which accounts for the aberrant behavior--good eye! My major questions are about other issues:.

                  I still don't understand your design fully. In your second set of results there are 19 groups (courses). I interpret your post to mean that you took a random sample of students from a population of about 144,133 unique students who took any of 19 courses. Is this right? How do you account for a student who takes more than one course?

                  A hallmark of a weight variable is that the total is an estimate of the population size. The totals for your two weights (found by multiplying 9484 x mean) are 144,148, which is close to the total 144,133 you quote earlier,, and 44,285. What is the second weight?
                  Last edited by Steve Samuels; 02 Jun 2015, 17:34.
                  Steve Samuels
                  Statistical Consulting
                  [email protected]

                  Stata 14.2

                  Comment


                  • #10
                    Thanks for the update, Jeff--this is very helpful. I am having convergence problems with the model when it is run with the weights right now, so I can't confirm that it works just yet, but I will report back after I've made some progress. I just installed stata on this computer from scratch about a week ago, so I am pretty sure it is up to date, but I will check and contact stata technical help again as you suggest if I see that stata is in fact updated and still having this problem.

                    Comment


                    • #11
                      Steve, thanks also for writing again. The number of groups is actually 1118 (I'm not sure why it showed up as 19 in that output, and I can't reproduce it--when I run that command, I get 1118 total groups in the dataset, so it seems to be correct now). Yes, the sample was selected from the 144,133 students who took any one of the 1118 courses. Students may have taken more than one of these courses, but they were randomly asked only about one of the courses in which they were enrolled, so that no single student is represented more than once in the dataset, and the number of courses that a student took that were on our course list did not impact their chance of participation.

                      You are right about the weight variables not summing correctly according to those summary statistics--I'm not sure why that is the case--I can only presume that I must have pasted summary data for the wrong variables here (I redid the weights a few times, to normalize them, but also because a few subjects had to be dropped from the sample frame because of corrected information about course enrollments). I just rechecked the weights and they both sum to 144,133 as they should, so that was probably my mistake in running the summary information. The first weight is the sampling weight, and the second weight is a propensity score adjusted weight (that incorporates the sampling weights as well).

                      Comment


                      • #12
                        This all makes sense. I'll just note a phenomenon that could bias analyses of course comparisons. Because you selected one course at random for each student, then for a specific course C, students who took higher numbers of courses were undersampled-sampled. Suppose a student takes k courses. The standard weighting solution in this situation is to multiply the original sampling weight by k. Another approach is to include k as a covariate, which might be interesting in its own right.
                        Steve Samuels
                        Statistical Consulting
                        [email protected]

                        Stata 14.2

                        Comment


                        • #13
                          Dear all,

                          I have a somewhat related problem. I was puzzled to see that the combination of mi data, svy weights and melogit is possible. I tried to do the same with my set and get the following error message
                          HTML Code:
                          survey final weights not allowed with multilevel models;
                              a final weight variable was svyset using the [pw=exp] syntax, but multilevel models require that each stage-level weight variable is svyset using the stage's corresponding
                              weight() option
                          an error occurred when svy executed melogit
                          an error occurred when mi estimate executed svy:melogit on m=1
                          r(459);
                          I set my data with:
                          HTML Code:
                          mi svyset aw [pweight=w1], strata(p2) fpc(g2)
                          the snytax I use is
                          HTML Code:
                          mi est, post cmdok: svy: melogit mm_25_s4 i.d_eth_s4 i.p_du_s4 d_pov_s14 i.p_bsc_s14 i.p_bnvq_s4  i.inv_s4 i.p_sy_s4  c.p_age_sample_s1##c.p_age_sample_s1 i.p_sex_s4  ||sserial: ||tserial:
                          I've previouly tried it without the cmdok option, but STATA tells me do it, if I don't (svy:melogit is not officially supported by mi estimate; You can use option cmdok to allow estimation anyway).


                          I am working with STATA 15, if this is of relevance.

                          I'd greatly appreciate all help!

                          Comment


                          • #14
                            As the error message stated, you need a weight for each stage-level, including the first stage of sampling PSUs, your aw. See the section on Survey data in the Manual entry for meglm. At levels where there was no sampling (all units were selected), use weight variables equal to 1.

                            By the way, I hope that before you did your imputation modeling, you saw these statement in the MI manual section "Introduction to multiple-imputation analysis":
                            In the survey context, all structural variables such as sampling weights, strata, and cluster identifiers (or at least main strata and main clusters) need to be included in the imputation model. In general, any predictors involved in the definition of the completed-data estimators and the sampling design should be included in the imputation model
                            Last edited by Steve Samuels; 24 Apr 2018, 14:39.
                            Steve Samuels
                            Statistical Consulting
                            [email protected]

                            Stata 14.2

                            Comment


                            • #15
                              Dear Steve,


                              thank you so much for your reply!
                              I've just read the section in the ME and this was very helpful, but I am still a bit confused, if what I am doing is even (conceptionally) feasible.

                              I have variables, that account for the survey design, i.e. non-response and drop out. For these I correct when svy setting my data. The multilevel model I am trying to estimate considers nesting of students in classes in schools. Somehow I don't get it togehter in my head how I can include the svy design weights for classes and schools, as I have no information on this. Could you maybe give me a clue?

                              Thank you for the reminder of the MI inclusion. I did include all the design variables in my imputation models. In a very first attempt I also considered additional information that is driving these design weights (i.e. being a lone parent, unemployed, renter, young, --> the factors that are driving non-response), yet most of them were dropped, and I then concluded that as my weights are final weights, all these generating factors are already considered, am I wrong here?

                              Cheers, Evelyn

                              Comment

                              Working...
                              X