Create a rate variable for population as a whole with sample weights

Julia Beach

Join Date: Apr 2019

Posts: 6
#1

Create a rate variable for population as a whole with sample weights

23 Apr 2019, 03:50

Dear Stata users,

I am new to stata and am working on a dataset with 20 mio. people for two years with zipcodes for each person to identify the municipality they live in.

I have the following question:
I want to create a variable that gives me the literacy rate for each municipality (zipcode).
I have a dummy variable that tells me if an individual is literate and I have the total number of people that live in each municipality; I also have a weight for sample expansion (since not the whole population is represented in my dataset).

I tried creating said variable with the following code, but if I summarize my literacy rate variable it does not give me a reasonable number (between 0 and 1).

egen literation = sum(literacy) if literacy == 1, by(zipcode)

gen literacy_rate = literation/popcount

My question is: how do I include the sample weights into the literacy_rate variable?

Thanks in advance for tips where I can read up on this.

Last edited by Julia Beach; 23 Apr 2019, 04:22.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29949
#2

23 Apr 2019, 10:54

I also have a weight for sample expansion (since not the whole population is represented in my dataset).

You need to explain how this weight was arrived at. The commonest situation is where the weight for an observation is the inverse of the probability that that observation would be included in the sample. In effect, if a person has a 1 in 500 chance of being included in the sample, then the weight is 500. One might interpret it as saying that that person "represents" 500 others. This is what Stata commands call a "pweight." If that is what you have you can do this:

Code:

by zipcode, sort: egen literate_count = total(weight_variable*literacy) gen literacy_rate = literate_count/popcount

Note: I have made up a name for the variable containing the weight. Modify the code accordingly.

But the weight you have in your data might be something else. So you need to consult the survey documentation to see what they did to create it. If it's not a "pweight," post back with more information.

Also, in the future, to assure that any code that is suggested will be consistent with all the aspects of your data, you should post an excerpt from your data set. And to assure that that excerpt is usable and has all the necessary information, you should use the -dataex- command to do that. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment

Julia Beach

Join Date: Apr 2019
Posts: 6

23 Apr 2019, 11:26

Code:

clear
input byte literacy long(popcount2000 zipcode) double sample_weight
1   216429 3131307  6.90218675
1   764879 3303500  11.0230544
1    80371 2107506  9.85310828
1 10499133 3550308  7.85093552
2    43449 1505064  9.32490795
2   342264 1100205 10.89297276
1  1437190 2611606  5.83785211
1   454871 4113700  9.67452298
2    26060 5218300 10.58925016
1   171734 4314100 10.23339199
2     9044 3113503  1.90888907
1    41391 4207502  11.9633476
1   548637 3118601  9.25763846
2    30376 4311007  9.95616683
2    10703 3109709  2.56684512
1    16301 3556800  9.41615027
1   817444 2704302  8.95315701

Thank you for helping me, Clyde.
The weight is a household weight that was included in the census data I am using. That's all the info I have - is it even possible to use this for individual level characteristics?

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29949
#4

23 Apr 2019, 11:49

Applying a household weight to individual level data is problematic. Households differ in size and unless your sample is guaranteed to contain every member of any household that is included at all, you will get biased results. Do you also have a household size variable and a household identifier? If so, you can actually calculate an individual-level weight:

Code:

by hhid, sort: gen hh_fraction = _N/household_size gen individual_weight = sample_weight/hh_fraction

One other thing, in the response I gave in #2, I assumed that the literacy variable was coded 1 = Yes, 0 = No. But in your example I see that the codes are 1 and 2 (and I don't know which is which). So the code in #2 is not correct. In general, in Stata, the best coding of yes/no variables is 1 = yes and 0 = no. Unless you have a compelling reason to keep the original coding, I would recode all these variables to 1/0 before doing analyses. It will greatly simplify your life in Stata.
Comment
Julia Beach

Join Date: Apr 2019

Posts: 6
#5

23 Apr 2019, 11:54

Great, thank you very much, Clyde, appreciate your time!
I will try and implement the individual-level weight and then follow the code you provided.
Comment
Julia Beach

Join Date: Apr 2019

Posts: 6
#6

23 Apr 2019, 11:55

Actually, one more thing: Am I right in assuming the household weight is a pweight? So will the individual-level weight be a pweight as well?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29949
#7

23 Apr 2019, 12:01

Well, the only way to be sure is to confirm it in the documentation that should have accompanied your data set (or by contacting whoever provided it). The numbers in the example you show "look like" typical pweights, but that's not dispositive. If it is US Census data, then with very high probability they are pweights, as that is what the Census Bureau normally puts in their data sets when they provide weights. But if it is from the Census Bureau, there should be documentation you can check.
Comment

Announcement

Create a rate variable for population as a whole with sample weights

Comment

Comment

Comment

Comment

Comment

Comment