Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Svy: number of observation??

    Dear Statalists,
    I am using a large scale national dataset for my analyses. Since it is complex survey data, I am using the svyset command to let STATA know the survey design characteristics. Below is my code:
    . svyset psu [pweight=W3W2STUTR],strata [STRATA_ID] vce (linerized) singleunit(centered)

    I have a variable named mtheff11, which has 5435 missing values out of a total number of 25206 observations (showing below)
    Click image for larger version

Name:	1.png
Views:	1
Size:	7.8 KB
ID:	1511635


    When I use svy: prefix to calculate its weighted mean, the number of observations=24607 (showing below). But the thing is the variable mtheff11 has 5435 missing values. Only 19771 observations have value on this variable. Where did the number 24607 come from? Can anybody give me a hint on how STATA calculated the number of obs? I feel confident that I am using the right code to calculate the mean for the variable I am interested in. Thank you so much!!
    Click image for larger version

Name:	2.png
Views:	1
Size:	14.3 KB
ID:	1511636


  • #2
    What happens if you do not use singleunit(centered) or specify some other singleunit option?
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    StataNow Version: 19.5 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      Thank you, Dr. Williams, for your response!! I tried different singleunit options, such as missing, scaled, certainty, but they all produce the same number of obs, which is 24607. I also tried not to include the singleunit part, but I still get 24607 observations. I noticed that if I change the pweight variable, the number of observations change as well. So I am guessing this issue is related to sampling weights. Does STATA use this number of observations to calculate the Standard Error for estimation, although we only have 19771 values on this variable? Looking forward to your reply! Thank you so much!!

      Comment


      • #4
        I can't reproduce the problem. I will give my usual advice to make sure Stata is up to date.

        You could try posting an extract with dataex, but make sure the svyset variables are included and that the example data are enough to replicate the problem.
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        StataNow Version: 19.5 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          Thank you, Dr. Williams! The thing is I am using one of the NCES restricted-use data, which is the High School Longitudinal Study of 2009. I am not allowed to share the data in any form, so I am guessing extracting some example data to replicate the problem is not allowed either? My only hope is to find information regarding how the number of observations is calculated with survey data in STATA. But I do not know whom I should reach to~~ Do you have any other suggestions? Thank you so much!!

          Comment


          • #6
            I would write to Stata tech support. Not being able to provide any data will complicate things, but perhaps they will recognize the issue.
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            StataNow Version: 19.5 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment


            • #7
              Thank you, Dr. Williams!

              Comment


              • #8
                Do you have Any zero pweights, especially when the variable is missing? https://www.stata.com/support/faqs/s...-zero-weights/
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 19.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  pweight = 0 can cause behavior like you are seeing. It doesn't seem to affect the estimates or SEs though:

                  Code:
                  . webuse nhanes2f, clear
                  
                  . mean highlead
                  
                  Mean estimation                   Number of obs   =      4,942
                  
                  --------------------------------------------------------------
                               |       Mean   Std. Err.     [95% Conf. Interval]
                  -------------+------------------------------------------------
                      highlead |   .0592877   .0033597      .0527012    .0658743
                  --------------------------------------------------------------
                  
                  
                  . svy: mean highlead
                  (running mean on estimation sample)
                  
                  Survey: Mean estimation
                  
                  Number of strata =      31       Number of obs   =       4,942
                  Number of PSUs   =      62       Population size =  56,343,166
                                                   Design df       =          31
                  
                  --------------------------------------------------------------
                               |             Linearized
                               |       Mean   Std. Err.     [95% Conf. Interval]
                  -------------+------------------------------------------------
                      highlead |   .0617646   .0056843      .0501714    .0733578
                  --------------------------------------------------------------
                  
                  . replace finalwgt = 0 if missing(highlead)
                  (5,395 real changes made)
                  
                  . svy: mean highlead
                  (running mean on estimation sample)
                  
                  Survey: Mean estimation
                  
                  Number of strata =      31       Number of obs   =      10,337
                  Number of PSUs   =      62       Population size =  56,343,166
                                                   Design df       =          31
                  
                  --------------------------------------------------------------
                               |             Linearized
                               |       Mean   Std. Err.     [95% Conf. Interval]
                  -------------+------------------------------------------------
                      highlead |   .0617646   .0056843      .0501714    .0733578
                  --------------------------------------------------------------
                  -------------------------------------------
                  Richard Williams, Notre Dame Dept of Sociology
                  StataNow Version: 19.5 MP (2 processor)

                  EMAIL: [email protected]
                  WWW: https://www3.nd.edu/~rwilliam

                  Comment


                  • #10
                    I see!! According to the example you provided above using the nhanes2f data, there are 4942 observations having values on the variable highlead, and 5395 observations missing values on this variable, which also with pweight=0. So 4942+5395=10337. I tested it with my dataset, and the number of observations make sense now !! Basically, the variable I am interested in (mtheff11) has 19771 valid values. Another 4836 observations missing values on this variable and the pweight associated them are equal to zero. So 19771+4836=24607. Now I see where the N=24607 came from. But why is STATA including these 4836 observations which are missing on mtheff11 and with a zero pweight when calculating the weighted mean of mtheff11? Do you have any ideas? THANK YOU so much!!

                    Comment


                    • #11
                      Other than the reported number of observations, the point estimate and SE estimate were the same whether you had 0 weights or not. So, those missing values are not affecting any calculations. The FAQ I cited earlier has more of an explanation but I’ll admit I didn’t read it super carefully.

                      I’m more Curious as to why you have nearly 5000 cases with zero Pweights. There may be a good reason but make sure there isn’t a mistake somewhere.
                      -------------------------------------------
                      Richard Williams, Notre Dame Dept of Sociology
                      StataNow Version: 19.5 MP (2 processor)

                      EMAIL: [email protected]
                      WWW: https://www3.nd.edu/~rwilliam

                      Comment


                      • #12
                        Out of the 25206 observations in this national dataset, 8668 students have a zero value on this weight variable. I do not know why~~ But thank you so much, Dr. Williams, for all your responses! They are very helpful

                        Comment


                        • #13
                          The dataset documentation hopefully explains the zero weights. Presumably there is a reason for them. Furthermore, different weights may be available, and if so you should check when you should use each one.
                          -------------------------------------------
                          Richard Williams, Notre Dame Dept of Sociology
                          StataNow Version: 19.5 MP (2 processor)

                          EMAIL: [email protected]
                          WWW: https://www3.nd.edu/~rwilliam

                          Comment

                          Working...
                          X