svy melogit on mi data ignoring survey weights

Claire Wladis

Join Date: May 2015
Posts: 38

svy melogit on mi data ignoring survey weights

31 May 2015, 07:58

I am trying to do multilevel logit analysis on survey data that is mi set. I have no trouble running regular logistic regressions on this data using several different weights, but Stata seems to be ignoring the weights when I implement melogit instead of logit. I have pasted the code below, showing that the output of melogit is identical when I run it on the data without using svy, or when I run it using surveyset data with three complete different weight variables.

Am I doing something incorrect in the way that I run the multilevel logit models that could explain why melogit seems to be ignoring the survey weights? Or is there another approach that someone could recommend that might allow me to do this analysis?

At the moment I am running Stata/MP 14.0 for Mac, in case that is relevant.

Code:

 mi estimate, or: melogit success fully  || course:

Multiple-imputation estimates                   Imputations       =         35
Mixed-effects logistic regression               Number of obs     =      9,484
                                                Average RVI       =     0.0000
                                                Largest FMI       =     0.0000
DF adjustment:   Large sample                   DF:     min       =          .
                                                        avg       =          .
                                                        max       =          .
Model F test:       Equal FMI                   F(   1,      .)   =       3.12
Within VCE type:          OIM                   Prob > F          =     0.0771

------------------------------------------------------------------------------------
           success | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
success            |
             fully |   1.313333      .2025     1.77   0.077     .9708002    1.776723
             _cons |   17.01076   1.407229    34.26   0.000     14.46463    20.00507
-------------------+----------------------------------------------------------------
var(_cons[course]) |
             _cons |    .799304   .1253634     6.38   0.000     .5535961    1.045012
------------------------------------------------------------------------------------
Note: Number of groups varies among imputations.
Note: Number of observations per group varies among imputations.


mi svyset _n [pweight=newweight], vce(linearized) singleunit(missing)

      pweight: newweight
          VCE: linearized
  Single unit: missing
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

mi estimate, or: svy linearized: melogit success fully  || course:

Multiple-imputation estimates                   Imputations       =         35
Survey: Mixed-effects logistic regression       Number of obs     =      9,484

Number of strata  =         1                   Population size   =      9,484
Number of PSUs    =     9,484
                                                Average RVI       =     0.0000
                                                Largest FMI       =     0.0000
                                                Complete DF       =       9483
DF adjustment:   Small sample                   DF:     min       =   9,481.00
                                                        avg       =   9,481.00
                                                        max       =   9,481.00
Model F test:       Equal FMI                   F(   1, 9481.0)   =       2.74
Within VCE type:   Linearized                   Prob > F          =     0.0977

------------------------------------------------------------------------------------
           success | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
success            |
             fully |   1.313333   .2161077     1.66   0.098     .9512453    1.813248
             _cons |   17.01076   1.451358    33.21   0.000     14.39097    20.10748
-------------------+----------------------------------------------------------------
var(_cons[course]) |
             _cons |    .799304   .1299294     6.15   0.000     .5446145    1.053993
------------------------------------------------------------------------------------
Note: Number of groups varies among imputations.
Note: Number of observations per group varies among imputations.

mi svyset _n [pweight=normweightsubdl], vce(linearized) singleunit(missing)

      pweight: normweightsubdl
          VCE: linearized
  Single unit: missing
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

mi estimate, or: svy linearized: melogit success fully  || course:

Multiple-imputation estimates                   Imputations       =         35
Survey: Mixed-effects logistic regression       Number of obs     =      9,484

Number of strata  =         1                   Population size   =      9,484
Number of PSUs    =     9,484
                                                Average RVI       =     0.0000
                                                Largest FMI       =     0.0000
                                                Complete DF       =       9483
DF adjustment:   Small sample                   DF:     min       =   9,481.00
                                                        avg       =   9,481.00
                                                        max       =   9,481.00
Model F test:       Equal FMI                   F(   1, 9481.0)   =       2.74
Within VCE type:   Linearized                   Prob > F          =     0.0977

------------------------------------------------------------------------------------
           success | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
success            |
             fully |   1.313333   .2161077     1.66   0.098     .9512453    1.813248
             _cons |   17.01076   1.451358    33.21   0.000     14.39097    20.10748
-------------------+----------------------------------------------------------------
var(_cons[course]) |
             _cons |    .799304   .1299294     6.15   0.000     .5446145    1.053993
------------------------------------------------------------------------------------
Note: Number of groups varies among imputations.
Note: Number of observations per group varies among imputations.


mi svyset _n [pweight=newweightmlm], vce(linearized) singleunit(missing)

      pweight: newweightmlm
          VCE: linearized
  Single unit: missing
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

mi estimate, or: svy linearized: melogit success fully  || course:

Multiple-imputation estimates                   Imputations       =         35
Survey: Mixed-effects logistic regression       Number of obs     =      9,484

Number of strata  =         1                   Population size   =      9,484
Number of PSUs    =     9,484
                                                Average RVI       =     0.0000
                                                Largest FMI       =     0.0000
                                                Complete DF       =       9483
DF adjustment:   Small sample                   DF:     min       =   9,481.00
                                                        avg       =   9,481.00
                                                        max       =   9,481.00
Model F test:       Equal FMI                   F(   1, 9481.0)   =       2.74
Within VCE type:   Linearized                   Prob > F          =     0.0977

------------------------------------------------------------------------------------
           success | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
success            |
             fully |   1.313333   .2161077     1.66   0.098     .9512453    1.813248
             _cons |   17.01076   1.451358    33.21   0.000     14.39097    20.10748
-------------------+----------------------------------------------------------------
var(_cons[course]) |
             _cons |    .799304   .1299294     6.15   0.000     .5446145    1.053993
------------------------------------------------------------------------------------
Note: Number of groups varies among imputations.
Note: Number of observations per group varies among imputations.

Tags: None

Claire Wladis

Join Date: May 2015
Posts: 38

01 Jun 2015, 03:31

In case this helps, here is some more information:

I've tried these commands on the original data (before they were multiply imputed) to see if I run into the same issue. When I run svy: melogit on data that is not mi set, the output does reflect different standard errors that seem to reflect the weights, but the point estimates are identical (see code and output below). When I run the same models with logit instead of melogit, the point estimates are not identical, so it seems unlikely to me that the point estimates should be completely identical with and without the weights when using melogit. (I have displayed the output with odds ratios below, but I get the same pattern also when the or option is not selected in the code.)

Another issue that I noticed is that melogit does not correctly reflect the population--it lists the number of observations as the population size, whereas the correct population size (and the one correctly displayed by logit) is 144,133. I am wondering if this is related, and I am also concerned as to whether this reflects a problem with the way that melogit is calculating the standard errors (which seem large to me in the example below that uses weights)?

Code:

svyset _n [pweight=normweightsubdl], vce(linearized) singleunit(missing)

      pweight: normweightsubdl
          VCE: linearized
  Single unit: missing
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

svy linearized: melogit success fully  || course:, or
(running melogit on estimation sample)

Survey: Mixed-effects logistic regression

Number of strata   =         1                  Number of obs     =      9,484
Number of PSUs     =     9,484                  Population size   =      9,484
                                                Design df         =      9,483
                                                F(   1,   9483)   =       2.84
                                                Prob > F          =     0.0922

-------------------------------------------------------------------------------
              |             Linearized
      success | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
        fully |   1.415809   .2923007     1.68   0.092     .9445989     2.12208
        _cons |   11.14855   1.111488    24.19   0.000      9.16948    13.55478
--------------+----------------------------------------------------------------
coursecollege |
    var(_cons)|   .1528434    .059602                      .0711663    .3282609
-------------------------------------------------------------------------------

melogit success fully  || course:, or

Fitting fixed-effects model:

Iteration 0:   log likelihood = -2912.1919  
Iteration 1:   log likelihood = -2864.7363  
Iteration 2:   log likelihood = -2864.4718  
Iteration 3:   log likelihood = -2864.4713  
Iteration 4:   log likelihood = -2864.4713  

Refining starting values:

Grid node 0:   log likelihood = -2812.9432

Fitting full model:

Iteration 0:   log likelihood = -2812.9432  (not concave)
Iteration 1:   log likelihood = -2808.5592  (not concave)
Iteration 2:   log likelihood = -2804.8099  
Iteration 3:   log likelihood =  -2803.315  
Iteration 4:   log likelihood = -2803.0731  
Iteration 5:   log likelihood = -2803.0697  
Iteration 6:   log likelihood = -2803.0697  

Mixed-effects logistic regression               Number of obs     =      9,484
Group variable:   coursecollege                 Number of groups  =         19

                                                Obs per group:
                                                              min =          1
                                                              avg =      499.2
                                                              max =        965

Integration method: mvaghermite                 Integration pts.  =          7

                                                Wald chi2(1)      =       5.89
Log likelihood = -2803.0697                     Prob > chi2       =     0.0153
-------------------------------------------------------------------------------
      success | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
        fully |   1.415809   .2028878     2.43   0.015     1.069119    1.874922
        _cons |   11.14855    1.17918    22.80   0.000     9.061231    13.71671
--------------+----------------------------------------------------------------
coursecollege |
    var(_cons)|   .1528434   .0594787                      .0712858    .3277104
-------------------------------------------------------------------------------
LR test vs. logistic model: chibar2(01) = 122.80      Prob >= chibar2 = 0.0000

Comment

Steve Samuels

Join Date: Mar 2014

Posts: 1786
#3

01 Jun 2015, 17:13

I don't know what's going on, but I have some thoughts:

1. I suspect that the weights have been normalized to sum to the sample size. This would account for identical population and sample counts. You can investigate by running non-survey total on the weight variables.

2. Very similar weights might account for the uniform standard errors in your first p;ost. How different are the weights? You might try corr and graph matrix to see.

Maybe not relevant to your problem, but:

3. Your svyset statements indicate that the study had a single stage, non-clustered, non-stratified design. That is, each random number picked a single individual. However you analyze a two-level model, with the higher level unit named "coursecollege". If, in fact, the random numbers selected these units, then the svyset statement is incorrect.

4. The singleunit() option to svyset is needed only for stratified designs.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Claire Wladis

Join Date: May 2015

Posts: 38
#4

02 Jun 2015, 03:04

Thanks so much for your reply, Steven--I have been pulling my hair out for the last three days trying to make sense of this, and can't proceed with this analysis until I can work this out, and so I am very very grateful for your help!! To respond to your thoughts:

1. The weights have not been normalized to sum to the sample size--they have been normalized to sum to the population size. When I run the identical analysis using logit instead of melogit, the population size reflected by the analysis is correct. It is only melogit that seems to report the sample size for the population size.

2. How similar would the weights have to be in order to produce uniform standard errors? I don't think that my weights are similar enough to produce this result (but I'm not a statistician, so perhaps I am wrong?). I have pasted summary information about two different weight variables below, in case that helps. What seems suspicious to me is that it is only with melogit that all the output is identical regardless of weights. With logit, point estimates and standard errors are very different, depending upon the weights, and I can't think of a plausible reason why this would happen with logit but not melogit (but again, I'm not a statistician, so maybe I'm overlooking something).

3. This data is analyzing course outcomes for students who took different courses. The multilevel model is not to control for survey structure, but rather to control for unobserved heterogeneity in outcomes by the specific course that the student took (since the outcomes of students who took the same course would be correlated). Students weren't sampled in a second stage from the courses, but rather from the whole population of students who took any of the courses. So I believe the way I have set things up is correct, but if not, please feel free to let me know!

4. Thanks, I will correct this. I think I was using the menu option instead of typing the code in this instance, and I think Stata added this automatically? I will remove it and type code manually.

weight variable 1:

Code:

range: [10.60849,50.345139] units: 1.000e-07 unique values: 7,842 missing .: 0/9,484 mean: 15.1975 std. dev: 2.13416 percentiles: 10% 25% 50% 75% 90% 13.2182 14.1555 15.0918 15.9148 16.7836

weight variable 2:

Code:

range: [1.0004365,798.32281] units: 1.000e-07 unique values: 9,405 missing .: 0/9,484 mean: 4.66952 std. dev: 21.7418 percentiles: 10% 25% 50% 75% 90% 1.00326 1.00819 1.02387 1.0687 5.71433
Comment
Claire Wladis

Join Date: May 2015

Posts: 38
#5

02 Jun 2015, 08:53

I meant to add to my last post that I am not just concerned about the standard errors being identical across weight variables, but also the point estimates. Based on the differences in point estimates that I see when I run logit models, I expected to see at least someone difference in the point estimates when I use melogit with versus without the weights (for some logit models, the direction of the relationship dramatically shifts when the weights are used).
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 680
#6

02 Jun 2015, 09:37

Claire has found a bug in gsem and the me commands that
support the svy prefix. The fix will be available in the next
update to Stata 14.

In Claire's example, the problem is that svy: melogit is failing
to exit with an error indicating that the weights are not constant
within the levels of the course variable. The reason for this is
that

Code:

svyset _n [pweight=normweightsubdl], vce(linearized) singleunit(missing)

specifies that the data were sampled without strata or clusters in a
single level, yet the model

Code:

svy linearized: melogit success fully || course:, or

indicates that observations within the levels of the course
variable are somehow clustered. svy linearized tries to line up
the random effect levels with the sampling stages, but there is only
one sampling stage, and the bug is preventing it from acknowledging the
sampling weights. After the fix, svy: melogit will exit with an
error indicating that normweightsubdl is not constant within the
levels of the course variable.

In this situation, svy: melogit requires the analyst to line up
the model's random effects with the survey's sampling stages, so if
Claire's sampling weights are truely observation-level weights and there
are no sampling units that course is nested within, then the
correct svyset syntax would be

Code:

svyset course || _n, weight(normweightsubdl) vce(linearized) singleunit(missing)
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 680
#7

02 Jun 2015, 09:43

By the way, the bug I mention has to do with the user specifying weights
using the weight expression clause, instead of using the weight() option
in the associated sampling stages.
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 680
#8

02 Jun 2015, 14:49

Just realized the bug I'm referring to was fixed in the 22apr2015 update.

Claire should verify that her Stata 14 is fully up-to-date.

If this problem persists, even after verifying that Stata is fully up-to-date, I encourage
Claire to write to tech-support with an example do-file and dataset that reproduces
the problem.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#9

02 Jun 2015, 16:43

You apparently have found a bug which accounts for the aberrant behavior--good eye! My major questions are about other issues:.

I still don't understand your design fully. In your second set of results there are 19 groups (courses). I interpret your post to mean that you took a random sample of students from a population of about 144,133 unique students who took any of 19 courses. Is this right? How do you account for a student who takes more than one course?

A hallmark of a weight variable is that the total is an estimate of the population size. The totals for your two weights (found by multiplying 9484 x mean) are 144,148, which is close to the total 144,133 you quote earlier,, and 44,285. What is the second weight?

Last edited by Steve Samuels; 02 Jun 2015, 17:34.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Claire Wladis

Join Date: May 2015

Posts: 38
#10

03 Jun 2015, 03:46

Thanks for the update, Jeff--this is very helpful. I am having convergence problems with the model when it is run with the weights right now, so I can't confirm that it works just yet, but I will report back after I've made some progress. I just installed stata on this computer from scratch about a week ago, so I am pretty sure it is up to date, but I will check and contact stata technical help again as you suggest if I see that stata is in fact updated and still having this problem.
Comment
Claire Wladis

Join Date: May 2015

Posts: 38
#11

03 Jun 2015, 03:53

Steve, thanks also for writing again. The number of groups is actually 1118 (I'm not sure why it showed up as 19 in that output, and I can't reproduce it--when I run that command, I get 1118 total groups in the dataset, so it seems to be correct now). Yes, the sample was selected from the 144,133 students who took any one of the 1118 courses. Students may have taken more than one of these courses, but they were randomly asked only about one of the courses in which they were enrolled, so that no single student is represented more than once in the dataset, and the number of courses that a student took that were on our course list did not impact their chance of participation.

You are right about the weight variables not summing correctly according to those summary statistics--I'm not sure why that is the case--I can only presume that I must have pasted summary data for the wrong variables here (I redid the weights a few times, to normalize them, but also because a few subjects had to be dropped from the sample frame because of corrected information about course enrollments). I just rechecked the weights and they both sum to 144,133 as they should, so that was probably my mistake in running the summary information. The first weight is the sampling weight, and the second weight is a propensity score adjusted weight (that incorporates the sampling weights as well).
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#12

03 Jun 2015, 18:05

This all makes sense. I'll just note a phenomenon that could bias analyses of course comparisons. Because you selected one course at random for each student, then for a specific course C, students who took higher numbers of courses were undersampled-sampled. Suppose a student takes k courses. The standard weighting solution in this situation is to multiply the original sampling weight by k. Another approach is to include k as a covariate, which might be interesting in its own right.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Evelyn Mare

Join Date: Sep 2017

Posts: 26
#13

24 Apr 2018, 10:49

Dear all,

I have a somewhat related problem. I was puzzled to see that the combination of mi data, svy weights and melogit is possible. I tried to do the same with my set and get the following error message

HTML Code:

survey final weights not allowed with multilevel models; a final weight variable was svyset using the [pw=exp] syntax, but multilevel models require that each stage-level weight variable is svyset using the stage's corresponding weight() option an error occurred when svy executed melogit an error occurred when mi estimate executed svy:melogit on m=1 r(459);

I set my data with:

HTML Code:

mi svyset aw [pweight=w1], strata(p2) fpc(g2)

the snytax I use is

HTML Code:

mi est, post cmdok: svy: melogit mm_25_s4 i.d_eth_s4 i.p_du_s4 d_pov_s14 i.p_bsc_s14 i.p_bnvq_s4 i.inv_s4 i.p_sy_s4 c.p_age_sample_s1##c.p_age_sample_s1 i.p_sex_s4 ||sserial: ||tserial:

I've previouly tried it without the cmdok option, but STATA tells me do it, if I don't (svy:melogit is not officially supported by mi estimate; You can use option cmdok to allow estimation anyway).

I am working with STATA 15, if this is of relevance.

I'd greatly appreciate all help!
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#14

24 Apr 2018, 14:21

As the error message stated, you need a weight for each stage-level, including the first stage of sampling PSUs, your aw. See the section on Survey data in the Manual entry for meglm. At levels where there was no sampling (all units were selected), use weight variables equal to 1.

By the way, I hope that before you did your imputation modeling, you saw these statement in the MI manual section "Introduction to multiple-imputation analysis":

In the survey context, all structural variables such as sampling weights, strata, and cluster identifiers (or at least main strata and main clusters) need to be included in the imputation model. In general, any predictors involved in the definition of the completed-data estimators and the sampling design should be included in the imputation model

Last edited by Steve Samuels; 24 Apr 2018, 14:39.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Evelyn Mare

Join Date: Sep 2017

Posts: 26
#15

25 Apr 2018, 05:40

Dear Steve,

thank you so much for your reply!
I've just read the section in the ME and this was very helpful, but I am still a bit confused, if what I am doing is even (conceptionally) feasible.

I have variables, that account for the survey design, i.e. non-response and drop out. For these I correct when svy setting my data. The multilevel model I am trying to estimate considers nesting of students in classes in schools. Somehow I don't get it togehter in my head how I can include the svy design weights for classes and schools, as I have no information on this. Could you maybe give me a clue?

Thank you for the reminder of the MI inclusion. I did include all the design variables in my imputation models. In a very first attempt I also considered additional information that is driving these design weights (i.e. being a lone parent, unemployed, renter, young, --> the factors that are driving non-response), yet most of them were dropped, and I then concluded that as my weights are final weights, all these generating factors are already considered, am I wrong here?

Cheers, Evelyn
Comment

Announcement

svy melogit on mi data ignoring survey weights

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment