How to obtain standard deviation for COUNT data , while using Survey data and subpopulation option

ritupuri

Join Date: Apr 2014

Posts: 12
#1

How to obtain standard deviation for COUNT data , while using Survey data and subpopulation option

15 Jan 2015, 00:20

hello all,

I have previously posted a question about obtaining SD for mean in survey setting and was able to get answer from statalist(thank you, members of statalist, thank you Steve).

http://www.statalist.org/forums/foru...ulation-option

I need standard deviation for the count data and was hoping to get help from the gurus at the statalist. Can you please help me.

Code:

********EXAMPLE***START***** webuse nhanes2, clear svy linear, subpop(if region==2): tab agegrp ,count se format(%7.0f) ********EXAMPLE***END*****

I need SD for the age group 20-29, 30-39,40-49,50-59,60-69 etc

I am not well versed in using matrices. Can anyone help me please.

looks like vecdiag( e(V_row) ) might have the answer... but i am not sure and that is the extent of my matrix language skills.

Thank you in advance for your help.
sincerely,
Ritu
Tags: None
lucasferreira

Join Date: Apr 2014

Posts: 18
#2

15 Jan 2015, 10:28

Hi,

I have tried many different ways. I am not sure if it is possible.
Sorry,
Lucas
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10195
#3

15 Jan 2015, 14:35

Hi Ritu

It would not have been possible to obtain intra-group means and standard deviations if you did not know of the ages of individuals within the groups. Luckily, you have a third variable "age".

Running your code, you obtain

. svy linear, subpop(if region==2): tab agegrp ,count se format(%7.0f)
(running tabulate on estimation sample)

Number of strata = 8 Number of obs = 2774
Number of PSUs = 16 Population size = 29163797
Subpop. no. of obs = 2774
Subpop. size = 29163797
Design df = 8

----------------------------------
Age |
groups |
1-6 | count se
----------+-----------------------
age20-29 | 8543268 402356
age30-39 | 6021114 451462
age40-49 | 5352602 349041
age50-59 | 4194627 252123
age60-69 | 3712643 174287
age 70+ | 1339543 137739
|
Total | 29163797
----------------------------------
Key: count = weighted counts
se = linearized standard errors of weighted counts

where weighted counts simply sum to the population size. Using pweights in place of aweights, you obtain the number of observations as the sum (i.e. 2774). If I get you well, you are interested in finding the mean and standard deviations of the age groups. Proceed as follows:

*1. generate dummies for the age groups

. tab agegrp, gen(agr)

Age groups |
1-6 | Freq. Percent Cum.
------------+-----------------------------------
age20-29 | 2,320 22.41 22.41
age30-39 | 1,622 15.67 38.08
age40-49 | 1,272 12.29 50.37
age50-59 | 1,291 12.47 62.84
age60-69 | 2,860 27.63 90.47
age 70+ | 986 9.53 100.00
------------+-----------------------------------
Total | 10,351 100.00

*2 Compute the means and standard deviations one by one

. svy linear, subpop(if region==2): mean age if agr1==1
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 8 Number of obs = 684
Number of PSUs = 16 Population size = 8543268
Subpop. no. obs = 684
Subpop. size = 8543268
Design df = 8

--------------------------------------------------------------
| Linearized
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
age | 24.25182 .1013685 24.01806 24.48557
--------------------------------------------------------------
Note: 23 strata omitted because they contain no subpopulation
members.

. di sqrt(e(N) * el(e(V_srssub), 1, 1))
2.8147608

. svy linear, subpop(if region==2): mean age if agr2==1
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 8 Number of obs = 433
Number of PSUs = 16 Population size = 6021114
Subpop. no. obs = 433
Subpop. size = 6021114
Design df = 8

--------------------------------------------------------------
| Linearized
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
age | 34.21679 .1025952 33.98021 34.45338
--------------------------------------------------------------
Note: 23 strata omitted because they contain no subpopulation
members.

. di sqrt(e(N) * el(e(V_srssub), 1, 1))
2.773369

and so on. Therefore, for the first age group, you have 684 observations or 8543268/ 29163797 (approx. 29.3 percent of the pop.n size), with an average age of 24.25 and an SD of 2.81.
Comment
ritupuri

Join Date: Apr 2014

Posts: 12
#4

15 Jan 2015, 17:34

Hi

Thank you very much for your time. But, sorry for any confusion.

I want the SD for the counts(i.e. number of subjects in a particular age group) and not for the age.

I wanted the SD for the count 8543268 (for agegroup-20-29), 6021114 (for agegroup-30-39), 5352602(for agegroup-40-49), 4194627(for agegroup-50-59) etc.

Can you please let me know the method to get SD for the count .

Thank you ,

Ritu
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10195
#5

16 Jan 2015, 04:36

Hi Ritu again

Apologies for the misinterpretation. I do not know if there is a command that directly gives you the SDs from the aggregated counts and standard errors. However, I think that you can exploit the fact that count varies across age within the age groups and compute the SD. Run the same code with age in place of agegrp

. svy, subpop(if region==2): tab age ,count format(%7.0f)
(running tabulate on estimation sample)

Number of strata = 8 Number of obs = 2774
Number of PSUs = 16 Population size = 29163797
Subpop. no. of obs = 2774
Subpop. size = 29163797
Design df = 8

----------------------
age in |
years | count
----------+-----------
20 | 789283
21 | 1115262
22 | 877920
23 | 963528
24 | 820923
25 | 941455
26 | 937758
27 | 614668
28 | 699701
29 | 782770
30 | 686515
31 | 669717
32 | 607647
33 | 562390
34 | 591490
35 | 730997
36 | 661984
37 | 631223
38 | 506167
39 | 372984
40 | 599093
41 | 587845
42 | 629027
43 | 562084
44 | 399805
45 | 608331
46 | 435477
47 | 410525
48 | 565294
49 | 555121
50 | 664851
51 | 382783
52 | 470515
53 | 438982
54 | 213722
55 | 442210
56 | 399075
57 | 500729
58 | 357873
59 | 323887
60 | 428113
61 | 412926
62 | 428644
63 | 367395
64 | 402552
65 | 357441
66 | 370384
67 | 254471
68 | 375861
69 | 314856
70 | 342040
71 | 297259
72 | 284030
73 | 235899
74 | 180315
|
Total | 29163797
----------------------
Key: count = weighted counts

Note: 23 strata omitted because they contain no subpopulation members.

The first 9 counts sum to 8543268 (for agegroup-20-29), the second 9 sum to 6021114 (for agegroup-30-39), etc. Using the raw counts, you should be able to compute the mean count and SD for each group.
Comment
ritupuri

Join Date: Apr 2014

Posts: 12
#6

16 Jan 2015, 17:36

Hello Steve Samuel,
Looks like you have replied to this thread ... but somehow I cannot see your recommendation.
Can you please post it again
Thanks
Ritu
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#7

17 Jan 2015, 04:29

Cross-posted http://stackoverflow.com/questions/2...ey-data-and-su

Please see FAQ Advice for our policy on cross-posting, which is that you should tell us about it.
Comment
ritupuri

Join Date: Apr 2014

Posts: 12
#8

17 Jan 2015, 06:21

Hi nick I just posted it at other forums yesterday and hence I mentioned at that forum. Thank you , Ritu
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#9

17 Jan 2015, 06:28

Not the point at all! You should tell Statalist about postings to other forums. Indeed, you should also tell other forums about postings to Statalist.
Comment
ritupuri

Join Date: Apr 2014

Posts: 12
#10

17 Jan 2015, 06:39

Will do , sorry my mistake , Thank you , Ritu
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#11

17 Jan 2015, 12:49

Thanks! All this advice is intended to make communication more efficient and effective.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#12

18 Jan 2015, 14:28

I think there is a misunderstanding here related to terminology and to theory. In your earlier post( http://www.statalist.org/forums/foru...ulation-option)

you asked how to compute "the standard deviation (for mean)". The standard deviation of the mean is what is universally called the standard error. However it was clear that what you were requesting the the standard deviation for the individual values of log lead exposure (your example) in a subpopulation. Call those values \(X\).

Let the population mean for the \(X\)s is
\[
\overline{X} =
\sum X_i /N
\]
The standard deviation of a measurement \(X\) in a finite population is:
\[
S_x = \sqrt{\sum (X_i - \overline{X})^2
/N}\]
This is a fixed attribute of the population; it describes variation of \(X\) within the population.

The standard deviation for the sample mean, on the other hand, represents how variable the estimated mean is from sample to sample. It will depend on the sample design and sample size. To avoid confusion with the population standard deviation, it is referred to as the standard error.

Your current question:

Now you are asking for the "standard deviation" of "count data" in age categories. Fror "count data" there are two related population parameters for a category \(j\): one is the number of people in the category \(N_j\) and the proportion in the category \(P_j=N_j/N \), where \(N\) is the total count in the population. The estimates for these are given by svy: tabulate. You obviously are interested in the "count", as you use that option.

The wording of your question implies (in analogy without previous post) that you are really interested in a population standard deviation associated with count. Is this so? If this is the case, then it is easy to derive.

The population count for category j is \(T_j\), defined as:
\[
T_j =
\sum_{i=1}^N Y_i^j
\]
where
\[
Y_i^j =
\begin{cases}
1 & \text{element \(i\) is in category \(j\)} \\
0 & \text{element \(i\) is not in category \(j\)}
\end{cases}
\]
The \(Y_i\) is the individual "count" variable. The population standard deviation of the \(Y_i^j\) is:
\[
S_j = P_j \times (1 - P_j)
\]
This is the same as the true SD for a theoretical binomial random variable with probability of success \( P_j\). In Stata , you can compute these values in many ways. Here is one:

Code:

webuse nhanes2, clear svy , subpop(if region==2): prop agegrp mata: m = st_matrix("e(b)")' sd = diagonal(sqrt(diag(m)-diag(m*m'))) sd end

Last edited by Steve Samuels; 18 Jan 2015, 14:45.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

ritupuri

Join Date: Apr 2014
Posts: 12

#13

18 Jan 2015, 15:42

Hello Steve Samuel,

Thank you for the detailed explanation.

I am trying to get the SD for the count in EACH group

ie : SD for the 8543268 (for age grp;20-29 ), SD for 6021114 (for age grp;30-39 ), SD for 5352602(for age grp; 40-49 ), SD for 4194627 (for age grp;50-59 ), and so on

I guess, based on your post: I am getting SD for the proportions.

But how would I get SD for the actual count?

Code:



I ran the following code: 

webuse nhanes2, clear
svy , subpop(if region==2): tabulate  agegrp, count  se format (%7.0f)
svy , subpop(if region==2): prop agegrp
mata:
m = st_matrix("e(b)")'
sd = diagonal(sqrt(diag(m)-diag(m*m')))
sd
end


**** THE RESULTS ARE AS FOLLOWS

. svy , subpop(if region==2): tabulate  agegrp, count  se format (%7.0f)
(running tabulate on estimation sample)

Number of strata   =         8                  Number of obs      =      2774
Number of PSUs     =        16                  Population size    =  29163797
Subpop. no. of obs =      2774
Subpop. size       =  29163797
Design df          =         8


Age Group       count          se

20-29     8543268      402356
30-39     6021114      451462
40-49     5352602      349041
50-59     4194627      252123
60-69     3712643      174287
70+     1339543      137739
           
Total    29163797            

Key:  count     =  weighted counts
se        =  linearized standard errors of weighted counts

Note: 23 strata omitted because they contain no subpopulation members.

. svy , subpop(if region==2): prop agegrp
(running proportion on estimation sample)

Survey: Proportion estimation

Number of strata =       8        Number of obs    =      2774
Number of PSUs   =      16        Population size  =  29163797
Subpop. no. obs  =      2774
Subpop. size     =  29163797
Design df        =         8

_prop_1: agegrp = 20-29
_prop_2: agegrp = 30-39
_prop_3: agegrp = 40-49
_prop_4: agegrp = 50-59
_prop_5: agegrp = 60-69
_prop_6: agegrp = 70+


Linearized
Proportion   Std. Err.     [95% Conf. Interval]

agegrp       
_prop_1    .2929409   .0130771      .2637176    .3239775
_prop_2    .2064585   .0121694      .1798013    .2359311
_prop_3    .1835358   .0111767      .1591498    .2107221
_prop_4    .1438299   .0095164       .123246    .1671961
_prop_5    .1273031   .0071124       .111784     .144626
_prop_6    .0459317   .0047341      .0361695    .0581697

Note: 23 strata omitted because they contain no subpopulation
members.

. mata:
mata (type end to exit) ----    --------
: m = st_matrix("e(b)")'

: sd = diagonal(sqrt(diag(m)-diag(m*m')))

: sd
1
+---------------+
1   .4551115421  
2    .404763378  
3      .3871052  
4   .3509172041  
5   .3333122444  
6   .2093370152  
+---------------+

: end
    

. 
end of do-file

Thank you very much for your time.

Ritu

Comment

Steve Samuels

Join Date: Mar 2014

Posts: 1786
#14

18 Jan 2015, 18:02

Correction: The last equation, for the SD of the \(Y_i^j\), omitted the square root. It should be:
\[
S_j = \sqrt(P_j \times (1 - P_j))
\]

Last edited by Steve Samuels; 18 Jan 2015, 18:06.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Steve Samuels

Join Date: Mar 2014
Posts: 1786

#15

18 Jan 2015, 18:07

You have not read my post carefully. The point was that the only standard deviation of an estimate is its standard error. To avoid confusion between the standard deviation of a population of observations and that of a statistic like the mean or an estimated count, we always refer to the latter as the standard error. This is explained in every statistics text. The standard errors for your problem are contained in the results of your last reply.

Code:

svy linear, subpop(if region==2): tab agegrp ,count se format(%7.0f)
...

Age Group  count          se

20-29     8543268      402356
30-39     6021114      451462
40-49     5352602      349041
50-59     4194627      252123
60-69     3712643      174287
70+       1339543      137739

If, in your earlier post, you wanted the "standard deviation" for the mean, then my answer was not correct, and the proper quantity is also the standard error.

Code:

webuse nhanes2, clear
svy, subpop(if region==2): mean loglead

Survey: Mean estimation

Number of strata =       8        Number of obs    =      1319
Number of PSUs   =      16        Population size  =  13933777
                                  Subpop. no. obs  =      1319
                                  Subpop. size     =  13933777
                                  Design df        =         8

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
     loglead |   2.610591   .0365134      2.526391    2.694791

Last edited by Steve Samuels; 18 Jan 2015, 18:42.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2

Announcement