getting weighted count/sum of a collapsed data set

Mona Elsayed

Join Date: May 2020
Posts: 26

getting weighted count/sum of a collapsed data set

10 Aug 2022, 07:35

Hello everyone,

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str23 indid long four_digit byte usempstp float expan_indiv
"9801000101" 7212 1 2046.778
"9801000304" 7231 1 2046.778
"9801000308" 2330 1 2046.778
"9801000401" 5414 1 2046.778
"9801000501" 8121 1 2046.778
"9801000601" 7233 1 2046.778
"9801000701" 7114 1 2046.778
"9801000901" 9613 1 2046.778
"9801001003" 3119 1 2046.778
"9801001004" 2341 1 2046.778
end
label values four_digit four_digit
label def four_digit 2330 "Secondary Education Teachers", modify
label def four_digit 2341 "Primary School Teachers", modify
label def four_digit 3119 "Physical and Engineering Science Technicians Not Elsewhere Classified", modify
label def four_digit 5414 "Security Guards", modify
label def four_digit 7114 "Concrete Placers, Concrete Finishers and Related Workers", modify
label def four_digit 7212 "Welders and Flame Cutters", modify
label def four_digit 7231 "Motor Vehicle Mechanics and Repairers", modify
label def four_digit 7233 "Agricultural and Industrial Machinery Mechanics and Repairers", modify
label def four_digit 8121 "Metal Processing Plant Operators", modify
label def four_digit 9613 "Sweepers and Related Labourers", modify
label values usempstp empstat
label def empstat 1 "Waged employee", modify

Above is a sample of the data I have for a specific year. It is individual level data with individual Id "indid", his or her four-digit occupation "four_digit", his/her employment status "usempstp" and the expansion weight for the data "expan_indiv". I want to collapse the data by the "four_digit" variable to get the weighted total number of individuals working in each occupation (Note: all observations have a value of 1 for the "usempstp" variable) . I ran the following command:

Code:

collapse (sum) usempstp [aw=expan_indiv], by(four_digit)

However, the collapsed data set gives me unweighted number of individuals for each occupation and I can't figure out why

Tags: None

Joro Kolev

Join Date: Aug 2018

Posts: 3047
#2

10 Aug 2022, 09:18

Can you explain what outcome you are expecting here?
Comment

Mona Elsayed

Join Date: May 2020
Posts: 26

10 Aug 2022, 09:48

I am posting here another sample of the data set as the first one accidently contains observations having the same weight.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str23 indid long four_digit byte usempstp float expan_indiv
"18120100021_180201" 7131 1  7067.008
"18120100021_180301" 7115 1  7890.721
"1812010003101"      8331 1  6896.637
"1812010003103"      7512 1  6909.501
"1812010004101"      1120 1  6422.304
"1812010006101"      7114 1  6046.272
"1812010008103"      5242 1  6390.676
"1812010010101"      3113 1  6724.058
"1812010011101"      7111 1  6968.064
"1812010012101"      7411 1  1230.698
"1812010013101"      1420 1  2180.671
"1812010014101"      7212 1 2917.9604
"1812010015102"      8322 1 1725.8827
"1812010017101"      7115 1 2041.8577
"18120100171_180301" 2651 1  3415.242
"18120100171_180501" 8114 1   1295.58
"1812010024101"      8152 1 1729.2283
"18120100241_180702" 3434 1  866.2662
"1812010025101"      3122 1 3541.7195
"1812010026101"      7111 1  4133.024
end
label values four_digit four_digit
label def four_digit 1120 "Managing Directors and Chief Executives", modify
label def four_digit 1420 "Retail and Wholesale Trade Managers", modify
label def four_digit 2651 "Visual Artists", modify
label def four_digit 3113 "Electrical Engineering Technicians", modify
label def four_digit 3122 "Manufacturing Supervisors", modify
label def four_digit 3434 "Chefs", modify
label def four_digit 5242 "Sales Demonstrators", modify
label def four_digit 7111 "House Builders", modify
label def four_digit 7114 "Concrete Placers, Concrete Finishers and Related Workers", modify
label def four_digit 7115 "Carpenters and Joiners", modify
label def four_digit 7131 "Painters and Related Workers", modify
label def four_digit 7212 "Welders and Flame Cutters", modify
label def four_digit 7411 "Building and Related Electricians", modify
label def four_digit 7512 "Bakers, Pastry-cooks and Confectionery Makers", modify
label def four_digit 8114 "Cement, Stone and Other Mineral Products Machine Operators", modify
label def four_digit 8152 "Weaving and Knitting Machine Operators", modify
label def four_digit 8322 "Car, Taxi and Van Drivers", modify
label def four_digit 8331 "Bus and Tram Drivers", modify
label values usempstp empstat
label def empstat 1 "Waged employee", modify

Comment

Mona Elsayed

Join Date: May 2020

Posts: 26
#4

10 Aug 2022, 10:01

Thanks for your reply Joro!
I am expecting to get a weighted sum of the "usempsp" variable for each occupation. I found the new data is the same as the one I get when I collapse the data without using the weighting option, although my data have different values in the weighting variable.
Comment
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#5

10 Aug 2022, 13:47

When you say a weighted sum, do you mean the sum of useempsp*weight for each occupation?
If you read look at the help file for collapse the section on weights offers information on how different weights are handled for sum.

Code:

help collapse

sum:
unweighted: sum(x_j), the sum of x_j over observations in group i
aweight: sum(v_j*x_j) over observations in group i; v_j = weights normalized to sum to N_i
fweight, iweight, pweight: sum(w_j*x_j) over observations in group i

Based on this, if I am guessing correctly about what output you are expecting, specifying your weight as a pweight rather than an aweight should get you what you want.
Comment
Mona Elsayed

Join Date: May 2020

Posts: 26
#6

10 Aug 2022, 14:44

Yes Sarah, this is what I want to get. I actually tried pw and it gives me a weighted result. However, pw expands to the population size. The idea in aw is that it rescales the weighting variable to expand to the sample size and this this what I want. I want the sum of the "usempstp" after collapsing the data to equal my sample size not the population size.
Comment
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#7

10 Aug 2022, 15:24

The issue is that collapse is normalizing the weights for each by group to the N for that group, but you want to normalize the weight to sum to the total sample N.
It doesn't look like the options of collapse will get you what you want but you could create your own normalized weight and then use that with pweights (or multiply it out yourself before collapsing).

Code:

**manually create the normalized weight sum expan_indiv, meanonly scalar meanw=r(mean) gen normw=expan_indiv/meanw **double check that I've done this right and it sums to sample size egen check=total(normw) sum check count
1 like
Comment
Mona Elsayed

Join Date: May 2020

Posts: 26
#8

10 Aug 2022, 16:14

Many thanks for your time Sarah. It works perfectly.
Comment
Mona Elsayed

Join Date: May 2020

Posts: 26
#9

10 Aug 2022, 16:15

Many thanks for your time and help Sarah. Your code works perfectly.
Comment

Announcement

getting weighted count/sum of a collapsed data set

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment