Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to obtain standard deviation for COUNT data , while using Survey data and subpopulation option

    hello all,

    I have previously posted a question about obtaining SD for mean in survey setting and was able to get answer from statalist(thank you, members of statalist, thank you Steve).

    http://www.statalist.org/forums/foru...ulation-option

    I need standard deviation for the count data and was hoping to get help from the gurus at the statalist. Can you please help me.

    Code:
     
    ********EXAMPLE***START*****
    
    webuse nhanes2, clear
      svy linear, subpop(if region==2): tab agegrp  ,count  se  format(%7.0f)
    
    ********EXAMPLE***END*****
    I need SD for the age group 20-29, 30-39,40-49,50-59,60-69 etc

    I am not well versed in using matrices. Can anyone help me please.

    looks like vecdiag( e(V_row) ) might have the answer... but i am not sure and that is the extent of my matrix language skills.

    Thank you in advance for your help.
    sincerely,
    Ritu

  • #2
    Hi,

    I have tried many different ways. I am not sure if it is possible.
    Sorry,
    Lucas

    Comment


    • #3
      Hi Ritu

      It would not have been possible to obtain intra-group means and standard deviations if you did not know of the ages of individuals within the groups. Luckily, you have a third variable "age".


      Running your code, you obtain


      . svy linear, subpop(if region==2): tab agegrp ,count se format(%7.0f)
      (running tabulate on estimation sample)

      Number of strata = 8 Number of obs = 2774
      Number of PSUs = 16 Population size = 29163797
      Subpop. no. of obs = 2774
      Subpop. size = 29163797
      Design df = 8

      ----------------------------------
      Age |
      groups |
      1-6 | count se
      ----------+-----------------------
      age20-29 | 8543268 402356
      age30-39 | 6021114 451462
      age40-49 | 5352602 349041
      age50-59 | 4194627 252123
      age60-69 | 3712643 174287
      age 70+ | 1339543 137739
      |
      Total | 29163797
      ----------------------------------
      Key: count = weighted counts
      se = linearized standard errors of weighted counts





      where weighted counts simply sum to the population size. Using pweights in place of aweights, you obtain the number of observations as the sum (i.e. 2774). If I get you well, you are interested in finding the mean and standard deviations of the age groups. Proceed as follows:

      *1. generate dummies for the age groups


      . tab agegrp, gen(agr)

      Age groups |
      1-6 | Freq. Percent Cum.
      ------------+-----------------------------------
      age20-29 | 2,320 22.41 22.41
      age30-39 | 1,622 15.67 38.08
      age40-49 | 1,272 12.29 50.37
      age50-59 | 1,291 12.47 62.84
      age60-69 | 2,860 27.63 90.47
      age 70+ | 986 9.53 100.00
      ------------+-----------------------------------
      Total | 10,351 100.00





      *2 Compute the means and standard deviations one by one


      . svy linear, subpop(if region==2): mean age if agr1==1
      (running mean on estimation sample)

      Survey: Mean estimation

      Number of strata = 8 Number of obs = 684
      Number of PSUs = 16 Population size = 8543268
      Subpop. no. obs = 684
      Subpop. size = 8543268
      Design df = 8

      --------------------------------------------------------------
      | Linearized
      | Mean Std. Err. [95% Conf. Interval]
      -------------+------------------------------------------------
      age | 24.25182 .1013685 24.01806 24.48557
      --------------------------------------------------------------
      Note: 23 strata omitted because they contain no subpopulation
      members.

      . di sqrt(e(N) * el(e(V_srssub), 1, 1))
      2.8147608



      . svy linear, subpop(if region==2): mean age if agr2==1
      (running mean on estimation sample)

      Survey: Mean estimation

      Number of strata = 8 Number of obs = 433
      Number of PSUs = 16 Population size = 6021114
      Subpop. no. obs = 433
      Subpop. size = 6021114
      Design df = 8

      --------------------------------------------------------------
      | Linearized
      | Mean Std. Err. [95% Conf. Interval]
      -------------+------------------------------------------------
      age | 34.21679 .1025952 33.98021 34.45338
      --------------------------------------------------------------
      Note: 23 strata omitted because they contain no subpopulation
      members.

      . di sqrt(e(N) * el(e(V_srssub), 1, 1))
      2.773369




      and so on. Therefore, for the first age group, you have 684 observations or 8543268/ 29163797 (approx. 29.3 percent of the pop.n size), with an average age of 24.25 and an SD of 2.81.











      Comment


      • #4
        Hi

        Thank you very much for your time. But, sorry for any confusion.

        I want the SD for the counts(i.e. number of subjects in a particular age group) and not for the age.

        I wanted the SD for the count 8543268 (for agegroup-20-29), 6021114 (for agegroup-30-39), 5352602(for agegroup-40-49), 4194627(for agegroup-50-59) etc.

        Can you please let me know the method to get SD for the count .

        Thank you ,

        Ritu


        Comment


        • #5
          Hi Ritu again

          Apologies for the misinterpretation. I do not know if there is a command that directly gives you the SDs from the aggregated counts and standard errors. However, I think that you can exploit the fact that count varies across age within the age groups and compute the SD. Run the same code with age in place of agegrp

          . svy, subpop(if region==2): tab age ,count format(%7.0f)
          (running tabulate on estimation sample)

          Number of strata = 8 Number of obs = 2774
          Number of PSUs = 16 Population size = 29163797
          Subpop. no. of obs = 2774
          Subpop. size = 29163797
          Design df = 8

          ----------------------
          age in |
          years | count
          ----------+-----------
          20 | 789283
          21 | 1115262
          22 | 877920
          23 | 963528
          24 | 820923
          25 | 941455
          26 | 937758
          27 | 614668
          28 | 699701
          29 | 782770
          30 | 686515
          31 | 669717
          32 | 607647
          33 | 562390
          34 | 591490
          35 | 730997
          36 | 661984
          37 | 631223
          38 | 506167
          39 | 372984
          40 | 599093
          41 | 587845
          42 | 629027
          43 | 562084
          44 | 399805
          45 | 608331
          46 | 435477
          47 | 410525
          48 | 565294
          49 | 555121
          50 | 664851
          51 | 382783
          52 | 470515
          53 | 438982
          54 | 213722
          55 | 442210
          56 | 399075
          57 | 500729
          58 | 357873
          59 | 323887
          60 | 428113
          61 | 412926
          62 | 428644
          63 | 367395
          64 | 402552
          65 | 357441
          66 | 370384
          67 | 254471
          68 | 375861
          69 | 314856
          70 | 342040
          71 | 297259
          72 | 284030
          73 | 235899
          74 | 180315
          |
          Total | 29163797
          ----------------------
          Key: count = weighted counts


          Note: 23 strata omitted because they contain no subpopulation members.





          The first 9 counts sum to 8543268 (for agegroup-20-29), the second 9 sum to 6021114 (for agegroup-30-39), etc. Using the raw counts, you should be able to compute the mean count and SD for each group.

          Comment


          • #6
            Hello Steve Samuel,
            Looks like you have replied to this thread ... but somehow I cannot see your recommendation.
            Can you please post it again
            Thanks
            Ritu

            Comment


            • #7
              Cross-posted http://stackoverflow.com/questions/2...ey-data-and-su

              Please see FAQ Advice for our policy on cross-posting, which is that you should tell us about it.

              Comment


              • #8
                Hi nick I just posted it at other forums yesterday and hence I mentioned at that forum. Thank you , Ritu

                Comment


                • #9
                  Not the point at all! You should tell Statalist about postings to other forums. Indeed, you should also tell other forums about postings to Statalist.

                  Comment


                  • #10
                    Will do , sorry my mistake , Thank you , Ritu

                    Comment


                    • #11
                      Thanks! All this advice is intended to make communication more efficient and effective.

                      Comment


                      • #12

                        I think there is a misunderstanding here related to terminology and to theory. In your earlier post( http://www.statalist.org/forums/foru...ulation-option)

                        you asked how to compute "the standard deviation (for mean)". The standard deviation of the mean is what is universally called the standard error. However it was clear that what you were requesting the the standard deviation for the individual values of log lead exposure (your example) in a subpopulation. Call those values \(X\).

                        Let the population mean for the \(X\)s is
                        \[
                        \overline{X} =
                        \sum X_i /N
                        \]
                        The standard deviation of a measurement \(X\) in a finite population is:
                        \[
                        S_x = \sqrt{\sum (X_i - \overline{X})^2
                        /N}\]
                        This is a fixed attribute of the population; it describes variation of \(X\) within the population.

                        The standard deviation for the sample mean, on the other hand, represents how variable the estimated mean is from sample to sample. It will depend on the sample design and sample size. To avoid confusion with the population standard deviation, it is referred to as the standard error.

                        Your current question:

                        Now you are asking for the "standard deviation" of "count data" in age categories. Fror "count data" there are two related population parameters for a category \(j\): one is the number of people in the category \(N_j\) and the proportion in the category \(P_j=N_j/N \), where \(N\) is the total count in the population. The estimates for these are given by svy: tabulate. You obviously are interested in the "count", as you use that option.

                        The wording of your question implies (in analogy without previous post) that you are really interested in a population standard deviation associated with count. Is this so? If this is the case, then it is easy to derive.

                        The population count for category j is \(T_j\), defined as:
                        \[
                        T_j =
                        \sum_{i=1}^N Y_i^j
                        \]
                        where
                        \[
                        Y_i^j =
                        \begin{cases}
                        1 & \text{element \(i\) is in category \(j\)} \\
                        0 & \text{element \(i\) is not in category \(j\)}
                        \end{cases}
                        \]
                        The \(Y_i\) is the individual "count" variable. The population standard deviation of the \(Y_i^j\) is:
                        \[
                        S_j = P_j \times (1 - P_j)
                        \]
                        This is the same as the true SD for a theoretical binomial random variable with probability of success \( P_j\). In Stata , you can compute these values in many ways. Here is one:

                        Code:
                        webuse nhanes2, clear
                        svy , subpop(if region==2): prop agegrp
                        mata:
                        m = st_matrix("e(b)")'
                        sd = diagonal(sqrt(diag(m)-diag(m*m')))
                        sd
                        end
                        Last edited by Steve Samuels; 18 Jan 2015, 14:45.
                        Steve Samuels
                        Statistical Consulting
                        [email protected]

                        Stata 14.2

                        Comment


                        • #13
                          Hello Steve Samuel,


                          Thank you for the detailed explanation.

                          I am trying to get the SD for the count in EACH group

                          ie : SD for the 8543268 (for age grp;20-29 ), SD for 6021114 (for age grp;30-39 ), SD for 5352602(for age grp; 40-49 ), SD for 4194627 (for age grp;50-59 ), and so on

                          I guess, based on your post: I am getting SD for the proportions.

                          But how would I get SD for the actual count?



                          Code:
                          
                          
                          I ran the following code: 
                          
                          webuse nhanes2, clear
                          svy , subpop(if region==2): tabulate  agegrp, count  se format (%7.0f)
                          svy , subpop(if region==2): prop agegrp
                          mata:
                          m = st_matrix("e(b)")'
                          sd = diagonal(sqrt(diag(m)-diag(m*m')))
                          sd
                          end
                          
                          
                          **** THE RESULTS ARE AS FOLLOWS
                          
                          . svy , subpop(if region==2): tabulate  agegrp, count  se format (%7.0f)
                          (running tabulate on estimation sample)
                          
                          Number of strata   =         8                  Number of obs      =      2774
                          Number of PSUs     =        16                  Population size    =  29163797
                          Subpop. no. of obs =      2774
                          Subpop. size       =  29163797
                          Design df          =         8
                          
                          
                          Age Group       count          se
                          
                          20-29     8543268      402356
                          30-39     6021114      451462
                          40-49     5352602      349041
                          50-59     4194627      252123
                          60-69     3712643      174287
                          70+     1339543      137739
                                     
                          Total    29163797            
                          
                          Key:  count     =  weighted counts
                          se        =  linearized standard errors of weighted counts
                          
                          Note: 23 strata omitted because they contain no subpopulation members.
                          
                          . svy , subpop(if region==2): prop agegrp
                          (running proportion on estimation sample)
                          
                          Survey: Proportion estimation
                          
                          Number of strata =       8        Number of obs    =      2774
                          Number of PSUs   =      16        Population size  =  29163797
                          Subpop. no. obs  =      2774
                          Subpop. size     =  29163797
                          Design df        =         8
                          
                          _prop_1: agegrp = 20-29
                          _prop_2: agegrp = 30-39
                          _prop_3: agegrp = 40-49
                          _prop_4: agegrp = 50-59
                          _prop_5: agegrp = 60-69
                          _prop_6: agegrp = 70+
                          
                          
                          Linearized
                          Proportion   Std. Err.     [95% Conf. Interval]
                          
                          agegrp       
                          _prop_1    .2929409   .0130771      .2637176    .3239775
                          _prop_2    .2064585   .0121694      .1798013    .2359311
                          _prop_3    .1835358   .0111767      .1591498    .2107221
                          _prop_4    .1438299   .0095164       .123246    .1671961
                          _prop_5    .1273031   .0071124       .111784     .144626
                          _prop_6    .0459317   .0047341      .0361695    .0581697
                          
                          Note: 23 strata omitted because they contain no subpopulation
                          members.
                          
                          . mata:
                          mata (type end to exit) ----    --------
                          : m = st_matrix("e(b)")'
                          
                          : sd = diagonal(sqrt(diag(m)-diag(m*m')))
                          
                          : sd
                          1
                          +---------------+
                          1   .4551115421  
                          2    .404763378  
                          3      .3871052  
                          4   .3509172041  
                          5   .3333122444  
                          6   .2093370152  
                          +---------------+
                          
                          : end
                              
                          
                          . 
                          end of do-file



                          Thank you very much for your time.

                          Ritu

                          Comment


                          • #14
                            Correction: The last equation, for the SD of the \(Y_i^j\), omitted the square root. It should be:
                            \[
                            S_j = \sqrt(P_j \times (1 - P_j))
                            \]
                            Last edited by Steve Samuels; 18 Jan 2015, 18:06.
                            Steve Samuels
                            Statistical Consulting
                            [email protected]

                            Stata 14.2

                            Comment


                            • #15
                              You have not read my post carefully. The point was that the only standard deviation of an estimate is its standard error. To avoid confusion between the standard deviation of a population of observations and that of a statistic like the mean or an estimated count, we always refer to the latter as the standard error. This is explained in every statistics text. The standard errors for your problem are contained in the results of your last reply.

                              Code:
                              svy linear, subpop(if region==2): tab agegrp ,count se format(%7.0f)
                              ...
                              
                              Age Group  count          se
                              
                              20-29     8543268      402356
                              30-39     6021114      451462
                              40-49     5352602      349041
                              50-59     4194627      252123
                              60-69     3712643      174287
                              70+       1339543      137739
                              If, in your earlier post, you wanted the "standard deviation" for the mean, then my answer was not correct, and the proper quantity is also the standard error.

                              Code:
                              webuse nhanes2, clear
                              svy, subpop(if region==2): mean loglead
                              
                              Survey: Mean estimation
                              
                              Number of strata =       8        Number of obs    =      1319
                              Number of PSUs   =      16        Population size  =  13933777
                                                                Subpop. no. obs  =      1319
                                                                Subpop. size     =  13933777
                                                                Design df        =         8
                              
                              --------------------------------------------------------------
                                           |             Linearized
                                           |       Mean   Std. Err.     [95% Conf. Interval]
                              -------------+------------------------------------------------
                                   loglead |   2.610591   .0365134      2.526391    2.694791
                              Last edited by Steve Samuels; 18 Jan 2015, 18:42.
                              Steve Samuels
                              Statistical Consulting
                              [email protected]

                              Stata 14.2

                              Comment

                              Working...
                              X