Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • getting weighted count/sum of a collapsed data set

    Hello everyone,

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str23 indid long four_digit byte usempstp float expan_indiv
    "9801000101" 7212 1 2046.778
    "9801000304" 7231 1 2046.778
    "9801000308" 2330 1 2046.778
    "9801000401" 5414 1 2046.778
    "9801000501" 8121 1 2046.778
    "9801000601" 7233 1 2046.778
    "9801000701" 7114 1 2046.778
    "9801000901" 9613 1 2046.778
    "9801001003" 3119 1 2046.778
    "9801001004" 2341 1 2046.778
    end
    label values four_digit four_digit
    label def four_digit 2330 "Secondary Education Teachers", modify
    label def four_digit 2341 "Primary School Teachers", modify
    label def four_digit 3119 "Physical and Engineering Science Technicians Not Elsewhere Classified", modify
    label def four_digit 5414 "Security Guards", modify
    label def four_digit 7114 "Concrete Placers, Concrete Finishers and Related Workers", modify
    label def four_digit 7212 "Welders and Flame Cutters", modify
    label def four_digit 7231 "Motor Vehicle Mechanics and Repairers", modify
    label def four_digit 7233 "Agricultural and Industrial Machinery Mechanics and Repairers", modify
    label def four_digit 8121 "Metal Processing Plant Operators", modify
    label def four_digit 9613 "Sweepers and Related Labourers", modify
    label values usempstp empstat
    label def empstat 1 "Waged employee", modify

    Above is a sample of the data I have for a specific year. It is individual level data with individual Id "indid", his or her four-digit occupation "four_digit", his/her employment status "usempstp" and the expansion weight for the data "expan_indiv". I want to collapse the data by the "four_digit" variable to get the weighted total number of individuals working in each occupation (Note: all observations have a value of 1 for the "usempstp" variable) . I ran the following command:

    Code:
    collapse (sum) usempstp [aw=expan_indiv], by(four_digit)
    However, the collapsed data set gives me unweighted number of individuals for each occupation and I can't figure out why


  • #2
    Can you explain what outcome you are expecting here?

    Comment


    • #3
      I am posting here another sample of the data set as the first one accidently contains observations having the same weight.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str23 indid long four_digit byte usempstp float expan_indiv
      "18120100021_180201" 7131 1  7067.008
      "18120100021_180301" 7115 1  7890.721
      "1812010003101"      8331 1  6896.637
      "1812010003103"      7512 1  6909.501
      "1812010004101"      1120 1  6422.304
      "1812010006101"      7114 1  6046.272
      "1812010008103"      5242 1  6390.676
      "1812010010101"      3113 1  6724.058
      "1812010011101"      7111 1  6968.064
      "1812010012101"      7411 1  1230.698
      "1812010013101"      1420 1  2180.671
      "1812010014101"      7212 1 2917.9604
      "1812010015102"      8322 1 1725.8827
      "1812010017101"      7115 1 2041.8577
      "18120100171_180301" 2651 1  3415.242
      "18120100171_180501" 8114 1   1295.58
      "1812010024101"      8152 1 1729.2283
      "18120100241_180702" 3434 1  866.2662
      "1812010025101"      3122 1 3541.7195
      "1812010026101"      7111 1  4133.024
      end
      label values four_digit four_digit
      label def four_digit 1120 "Managing Directors and Chief Executives", modify
      label def four_digit 1420 "Retail and Wholesale Trade Managers", modify
      label def four_digit 2651 "Visual Artists", modify
      label def four_digit 3113 "Electrical Engineering Technicians", modify
      label def four_digit 3122 "Manufacturing Supervisors", modify
      label def four_digit 3434 "Chefs", modify
      label def four_digit 5242 "Sales Demonstrators", modify
      label def four_digit 7111 "House Builders", modify
      label def four_digit 7114 "Concrete Placers, Concrete Finishers and Related Workers", modify
      label def four_digit 7115 "Carpenters and Joiners", modify
      label def four_digit 7131 "Painters and Related Workers", modify
      label def four_digit 7212 "Welders and Flame Cutters", modify
      label def four_digit 7411 "Building and Related Electricians", modify
      label def four_digit 7512 "Bakers, Pastry-cooks and Confectionery Makers", modify
      label def four_digit 8114 "Cement, Stone and Other Mineral Products Machine Operators", modify
      label def four_digit 8152 "Weaving and Knitting Machine Operators", modify
      label def four_digit 8322 "Car, Taxi and Van Drivers", modify
      label def four_digit 8331 "Bus and Tram Drivers", modify
      label values usempstp empstat
      label def empstat 1 "Waged employee", modify

      Comment


      • #4
        Thanks for your reply Joro!
        I am expecting to get a weighted sum of the "usempsp" variable for each occupation. I found the new data is the same as the one I get when I collapse the data without using the weighting option, although my data have different values in the weighting variable.

        Comment


        • #5
          When you say a weighted sum, do you mean the sum of useempsp*weight for each occupation?
          If you read look at the help file for collapse the section on weights offers information on how different weights are handled for sum.

          Code:
          help collapse
          sum:
          unweighted: sum(x_j), the sum of x_j over observations in group i
          aweight: sum(v_j*x_j) over observations in group i; v_j = weights normalized to sum to N_i
          fweight, iweight, pweight: sum(w_j*x_j) over observations in group i
          Based on this, if I am guessing correctly about what output you are expecting, specifying your weight as a pweight rather than an aweight should get you what you want.

          Comment


          • #6
            Yes Sarah, this is what I want to get. I actually tried pw and it gives me a weighted result. However, pw expands to the population size. The idea in aw is that it rescales the weighting variable to expand to the sample size and this this what I want. I want the sum of the "usempstp" after collapsing the data to equal my sample size not the population size.

            Comment


            • #7
              The issue is that collapse is normalizing the weights for each by group to the N for that group, but you want to normalize the weight to sum to the total sample N.
              It doesn't look like the options of collapse will get you what you want but you could create your own normalized weight and then use that with pweights (or multiply it out yourself before collapsing).

              Code:
              **manually create the normalized weight
              sum expan_indiv, meanonly
              scalar meanw=r(mean)
              gen normw=expan_indiv/meanw
              
              **double check that I've done this right and it sums to sample size
              egen check=total(normw)
              sum check
              count

              Comment


              • #8
                Many thanks for your time Sarah. It works perfectly.

                Comment


                • #9
                  Many thanks for your time and help Sarah. Your code works perfectly.

                  Comment

                  Working...
                  X