interpreting results of svy command in STATA

Ishana Balan

Join Date: Jan 2015

Posts: 28
#1

interpreting results of svy command in STATA

04 Oct 2024, 00:48

Hi all,

May be this is a very silly question. I am trying to generate population estimates with sample weights using the svy command in STATA.

I used the following code.

Code:

svyset id [pweight=weight] svy: total id

and got the following output.

Number of strata = 1 Number of obs = 8,245
Number of PSUs = 8,245 Population size = 36,385,946
Design df = 8,244

--------------------------------------------------------------
| Linearized
| Total std. err. [95% conf. interval]
-------------+------------------------------------------------
id | 3.64e+14 2.34e+12 3.60e+14 3.69e+14
--------------------------------------------------------------

The number in the table 3.64e+14 matches the number in population size (36,385,946), which is what one would expect. My problem is when I do this with a different round of data, I get the following output.

Number of strata = 1 Number of obs = 7,859
Number of PSUs = 7,859 Population size = 41,786,760
Design df = 7,858

--------------------------------------------------------------
| Linearized
| Total std. err. [95% conf. interval]
-------------+------------------------------------------------
id | 7.11e+14 7.34e+12 6.97e+14 7.25e+14
--------------------------------------------------------------

Why is the number in the table (7.11e+14) and the number in population size (41,786,760) not matching? For context, this is the NHATS data and 41 million matches with the report. Then what is 71 million? Is this the correct command to generate population estimate? If both the numbers were 71 million, I would certainly think so. But, 41 million is the correct population.

Appreciate any inputs. Thanks in advance!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

04 Oct 2024, 09:26

You are misreading the output, and it also appears that you do not understand the output of the -total- command.

The -total- command calculates the sum of all values of the variable(s) listed in the command. Your variable id is, apparently, a numeric value. So -total- is adding up all the id numbers in the data. Unless the id variable is actually always = 1 in every observation (in which case, it seems like a very odd variable to call id), there is no reason that this number should match, or even be related in any clear way to, the population.

Next, you are misreading the output. Even in your first example, where you were satisfied that the total matched the population, that is not true. The total report is 3.64e+14. This is not even close to a match for the approximately 36 million population figure. 3.64e+14 means 3.64 x 10¹⁴, which is 364 quadrillion (in US reckoning of high powers of 10). It is approximately 10,000,000 times as large as the population figure.

The same is true in your second example. The only difference here is that the leading digits of the total and the population don't match they way they did (by coincidence) in the first example.
Comment
Ishana Balan

Join Date: Jan 2015

Posts: 28
#3

05 Oct 2024, 01:54

Thanks Clyde! That was very stupid of me. So, I guess to generate population estimates (say over gender), we generate a variable n=1 and then use the total command. However, what is still confusing to me is that the 36 million and 3.64e+14 (despite the difference of several zeros) is not a coincidence. When I generate subtotals by age group and gender, all the lead numbers exactly match the report. I wonder why. But, that is probably not a STATA question.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

05 Oct 2024, 08:23

When I generate subtotals by age group and gender, all the lead numbers exactly match the report. I wonder why.

Here, I think, is why. The command you are using provides the sum of the id numbers. If the average value of the id number is approximately a power of 10, then the leading digits of the totals will match the leading digits of the population size. I suspect that is the case. Notice that things changed when you switched to a different data set. I suspect that the average value of the id variable in that other data set is different and not close to a power of 10.
Comment

Announcement

interpreting results of svy command in STATA

Comment

Comment

Comment