Svy: number of observation??

Menglu Peng

Join Date: Aug 2019

Posts: 8
#1

Svy: number of observation??

09 Aug 2019, 14:42

Dear Statalists,
I am using a large scale national dataset for my analyses. Since it is complex survey data, I am using the svyset command to let STATA know the survey design characteristics. Below is my code:
. svyset psu [pweight=W3W2STUTR],strata [STRATA_ID] vce (linerized) singleunit(centered)

I have a variable named mtheff11, which has 5435 missing values out of a total number of 25206 observations (showing below)

When I use svy: prefix to calculate its weighted mean, the number of observations=24607 (showing below). But the thing is the variable mtheff11 has 5435 missing values. Only 19771 observations have value on this variable. Where did the number 24607 come from? Can anybody give me a hint on how STATA calculated the number of obs? I feel confident that I am using the right code to calculate the mean for the variable I am interested in. Thank you so much!!
Tags: None
Richard Williams

Join Date: Apr 2014

Posts: 4946
#2

09 Aug 2019, 16:52

What happens if you do not use singleunit(centered) or specify some other singleunit option?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Menglu Peng

Join Date: Aug 2019

Posts: 8
#3

13 Aug 2019, 09:59

Thank you, Dr. Williams, for your response!! I tried different singleunit options, such as missing, scaled, certainty, but they all produce the same number of obs, which is 24607. I also tried not to include the singleunit part, but I still get 24607 observations. I noticed that if I change the pweight variable, the number of observations change as well. So I am guessing this issue is related to sampling weights. Does STATA use this number of observations to calculate the Standard Error for estimation, although we only have 19771 values on this variable? Looking forward to your reply! Thank you so much!!
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4946
#4

13 Aug 2019, 10:37

I can't reproduce the problem. I will give my usual advice to make sure Stata is up to date.

You could try posting an extract with dataex, but make sure the svyset variables are included and that the example data are enough to replicate the problem.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Menglu Peng

Join Date: Aug 2019

Posts: 8
#5

13 Aug 2019, 11:14

Thank you, Dr. Williams! The thing is I am using one of the NCES restricted-use data, which is the High School Longitudinal Study of 2009. I am not allowed to share the data in any form, so I am guessing extracting some example data to replicate the problem is not allowed either? My only hope is to find information regarding how the number of observations is calculated with survey data in STATA. But I do not know whom I should reach to~~ Do you have any other suggestions? Thank you so much!!
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4946
#6

13 Aug 2019, 11:30

I would write to Stata tech support. Not being able to provide any data will complicate things, but perhaps they will recognize the issue.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Menglu Peng

Join Date: Aug 2019

Posts: 8
#7

13 Aug 2019, 11:42

Thank you, Dr. Williams!
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4946
#8

13 Aug 2019, 11:49

Do you have Any zero pweights, especially when the variable is missing? https://www.stata.com/support/faqs/s...-zero-weights/

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Richard Williams

Join Date: Apr 2014
Posts: 4946

13 Aug 2019, 12:28

pweight = 0 can cause behavior like you are seeing. It doesn't seem to affect the estimates or SEs though:

Code:

. webuse nhanes2f, clear

. mean highlead

Mean estimation                   Number of obs   =      4,942

--------------------------------------------------------------
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    highlead |   .0592877   .0033597      .0527012    .0658743
--------------------------------------------------------------


. svy: mean highlead
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      31       Number of obs   =       4,942
Number of PSUs   =      62       Population size =  56,343,166
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    highlead |   .0617646   .0056843      .0501714    .0733578
--------------------------------------------------------------

. replace finalwgt = 0 if missing(highlead)
(5,395 real changes made)

. svy: mean highlead
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      31       Number of obs   =      10,337
Number of PSUs   =      62       Population size =  56,343,166
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    highlead |   .0617646   .0056843      .0501714    .0733578
--------------------------------------------------------------

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Comment

Menglu Peng

Join Date: Aug 2019

Posts: 8
#10

13 Aug 2019, 13:35

I see!! According to the example you provided above using the nhanes2f data, there are 4942 observations having values on the variable highlead, and 5395 observations missing values on this variable, which also with pweight=0. So 4942+5395=10337. I tested it with my dataset, and the number of observations make sense now !! Basically, the variable I am interested in (mtheff11) has 19771 valid values. Another 4836 observations missing values on this variable and the pweight associated them are equal to zero. So 19771+4836=24607. Now I see where the N=24607 came from. But why is STATA including these 4836 observations which are missing on mtheff11 and with a zero pweight when calculating the weighted mean of mtheff11? Do you have any ideas? THANK YOU so much!!
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4946
#11

13 Aug 2019, 21:12

Other than the reported number of observations, the point estimate and SE estimate were the same whether you had 0 weights or not. So, those missing values are not affecting any calculations. The FAQ I cited earlier has more of an explanation but I’ll admit I didn’t read it super carefully.

I’m more Curious as to why you have nearly 5000 cases with zero Pweights. There may be a good reason but make sure there isn’t a mistake somewhere.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Menglu Peng

Join Date: Aug 2019

Posts: 8
#12

14 Aug 2019, 09:05

Out of the 25206 observations in this national dataset, 8668 students have a zero value on this weight variable. I do not know why~~ But thank you so much, Dr. Williams, for all your responses! They are very helpful
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4946
#13

14 Aug 2019, 09:44

The dataset documentation hopefully explains the zero weights. Presumably there is a reason for them. Furthermore, different weights may be available, and if so you should check when you should use each one.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Announcement

Svy: number of observation??

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment