Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Summing total students based on different age cateogry and education level

    Hello,

    the line column gives you full time and part time student numbers depending on the column lstudy which has three different values for undergraduate , first time professional and graduate. The vlaue of lstudy ==2 doesn't show up in this data sample because this unitid doesn't have lstuddy==2.

    how can i sum the number of students for different age category of different values of lstudy for each unique unitid ?

    label def label_line 1 "Full-time, under 18", modify: This line indicates that the label "Full-time, under 18" corresponds to the row representing the number of full-time students under 18 years old in that unique unitid.

    label def label_line 3 "Full-time, 20-21", modify: This label represents the row for the number of full-time students aged 20-21 years old.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long unitid byte(line lstudy xefage01) int efage01 byte xefage02 int efage02
    100654  1 1 10   36 10   62
    100654  2 1 10  814 10  886
    100654  3 1 10  617 10  763
    100654  4 1 10  366 10  348
    100654  5 1 10  139 10  103
    100654  6 1 10   31 10   51
    100654  7 1 10   11 10   22
    100654  8 1 10   15 10   37
    100654  9 1 10    2 10    5
    100654 10 1 10    2 10    0
    100654 11 1 10    3 10    1
    100654 12 1 10 2036 10 2278
    100654 13 1 10    2 10    0
    100654 14 1 10   13 10   10
    100654 15 1 10   26 10   25
    100654 16 1 10   38 10   40
    100654 17 1 10   31 10   32
    100654 18 1 10   16 10   17
    100654 19 1 10    8 10   25
    100654 20 1 10   19 10   25
    100654 21 1 10    6 10   10
    100654 23 1 10    5 10    9
    100654 24 1 10  164 10  193
    100654 25 1 10 2200 10 2471
    100654  3 3 10    0 10    1
    100654  4 3 10   47 10   92
    100654  5 3 10   46 10  115
    100654  6 3 10   30 10   56
    100654  7 3 10   15 10   25
    100654  8 3 10   24 10   52
    100654  9 3 10    7 10   12
    100654 10 3 10    0 10    1
    100654 11 3 10    1 10    7
    100654 12 3 10  170 10  361
    100654 15 3 10    1 10    0
    100654 16 3 10   18 10   47
    100654 17 3 10   56 10  112
    100654 18 3 10   41 10   84
    100654 19 3 10   32 10   62
    100654 20 3 10   36 10  113
    100654 21 3 10    7 10   36
    100654 23 3 10    2 10    0
    100654 24 3 10  193 10  454
    100654 25 3 10  363 10  815
    end
    label values line label_line
    label def label_line 1 "Full-time, under 18", modify
    label def label_line 2 "Full-time, 18-19", modify
    label def label_line 3 "Full-time, 20-21", modify
    label def label_line 4 "Full-time, 22-24", modify
    label def label_line 5 "Full-time, 25-29", modify
    label def label_line 6 "Full-time, 30-34", modify
    label def label_line 7 "Full-time, 35-39", modify
    label def label_line 8 "Full-time, 40-49", modify
    label def label_line 9 "Full-time, 50-64", modify
    label def label_line 10 "Full-time, 65 and over", modify
    label def label_line 11 "Full-time, age unknown", modify
    label def label_line 12 "Total full-time", modify
    label def label_line 13 "Part-time, under 18", modify
    label def label_line 14 "Part-time, 18-19", modify
    label def label_line 15 "Part-time, 20-21", modify
    label def label_line 16 "Part-time, 22-24", modify
    label def label_line 17 "Part-time, 25-29", modify
    label def label_line 18 "Part-time, 30-34", modify
    label def label_line 19 "Part-time, 35-39", modify
    label def label_line 20 "Part-time, 40-49", modify
    label def label_line 21 "Part-time, 50-64", modify
    label def label_line 23 "Part-time, age unknown", modify
    label def label_line 24 "Total part-time", modify
    label def label_line 25 "Grand total, all students", modify
    label values lstudy label_lstudy
    label def label_lstudy 1 "Undergraduate", modify
    label def label_lstudy 3 "Graduate", modify

  • #2
    I'm sorry, but I don't understand how what you want relates to the data you have. In the example data shown, the combination of unitid, line, and lstudy uniquely identifies observations. Then you have these other variables, xefage01, efage01, xefage02, and efage02, which you do not explain and whose names do not suggest to me what they might be (except that perhaps they have something to do with age.) At a guess, do those variables give number of students who are in the unitid, line, and lstudy grouping and have some additional attributes such that the definitions of these *age* variables make them mutually exclusive and exhaustive? If so, then I think all you need is:
    Code:
    egen wanted = rowtotal(*age*)
    If that doesn't do what you want, please post back with a clearer explanation of how you might derive what you are looking for from the data you have if you were doing it by hand.

    Comment


    • #3
      I'm sorry. Let me gather my thoughts and get back to you with clear explanation of what I'm trying to do. apologies this the data from 2002-2008 is much chearly defined and I dealth it with myself but this data is from 2001. Unfortunately this is all over the place

      Comment


      • #4

        efage01 stabnds for number of male students in that age category and efage02 stands for number of female students in that row or age category.
        xefage01 and xefage02 are not necessary for this calculation.


        Code:
        egen wanted = total(efage09), by(efbage unitid)
        
        tabdisp efbage unitid, c(wanted)
        
        -----------------------------------------
        |                unitid
        efbage | 100654 100663
        -------------------------+---------------
        All age categories total | 12152 33122
        Age under 25 total | 8772 19074
        Age under 18 | 1886 22
        Age 18-19 | 3398 4652
        Age 20-21 | 1778 6102
        Age 22-24 | 1710 8298
        Age 25 and over total | 2670 14048
        Age 25-29 | 1096 6954
        Age 30-34 | 520 2956
        Age 35-39 | 328 1578
        Age 40-49 | 524 899
        Age 50-64 | 198
        Age 65 and over | 4
        Age unknown | 710
        -----------------------------------------

        I want to get the total number of students ( male + female ) for specified age category. The trick part here the line column (besides the age category ) includes the part time and full time here so that makes it difficult. Otherwise i cold have used the following command

        Code:
        egen wanted = total(efage09), by(efbage unitid)
        Last edited by Tariq Abdullah; 06 Apr 2024, 23:06.

        Comment


        • #5
          This is my another post where my dataset way a little more clean. This particular data posted here is little all over the place.

          https://www.statalist.org/forums/for...t-age-category

          Comment


          • #6
            For the dataset in my another post this is what helped me to get the eventual form I wanted to make for my intended goal of all the ages for each unique unit id

            Code:
            drop if               efbage == 1
            drop if               efbage == 2 
            
            drop if               efbage == 7 
            
            
            * Sort the data by unitid and efbage
            
            sort unitid efbage
            
            *Create a new variable to store the sum for each unitid and age category
            
            egen sum_efage09 = total(efage09), by(unitid efbage)
            
            
            * Now, for each unique unitid, calculate the sum of efage09 for each age category
            
            bysort unitid: egen total_age_under_18 = total(efage09) if efbage == 3 | efbage == 4
            
            
            bysort unitid: egen total_age_20_to_24 = total(efage09) if efbage == 5 | efbage == 6
            
            
            bysort unitid: egen total_age_25_to_29 = total(efage09) if efbage == 8
            
            
            bysort unitid: egen total_age_30_to_34 = total(efage09) if efbage == 9
            
            
            bysort unitid: egen total_age_35_to_39 = total(efage09) if efbage == 10
            
            
            bysort unitid: egen total_age_40_to_49 = total(efage09) if efbage == 11
            
            
            bysort unitid: egen total_age_50_to_64 = total(efage09) if efbage == 12
            
            
            bysort unitid: egen total_age_65_and_over = total(efage09) if efbage == 13
            
            
            drop efbage            lstudy efage09
            drop                      sum_efage09                   
            *year
            
            
            collapse (sum) total_age_under_18 total_age_20_to_24 total_age_25_to_29 total_age_30_to_34 total_age_35_to_39 total_age_40_to_49 total_age_50_to_64 total_age_65_and_over, by(unitid)

            Comment


            • #7
              The main problem here is that you don't have a variable that is purely age group. You have it inextricably attached to full-time/part-time status. Separating those working only from the numeric codes in label_line would be feasible, but tedious and error-prone. Fortunately, the labels have a highly regular structure, with the age group offset from the full/part-time status part by a comma. So converting line to a string variable makes it simpler here.

              The observations where the value of line is a total or grand total are also in the way and have to be excluded from the calculations to avoid double counting. (More on this below.)


              Starting from the data example originally posted in #1:
              Code:
              decode line, gen(edit)
              replace edit = "" if strpos(lower(edit), "total")
              split edit, parse(",") gen(edits)
              drop edit edits1
              rename edits2 age_group
              by age_group, sort: egen wanted = total(efage01 + efage02)
              replace wanted = . if missing(age_group)
              sort unitid lstudy line
              Note: This Stata data set is organized like a spreadsheet, and that contributes to the difficulties here. It is also a trap, setting you up for mistakes when you try to calculate things. In particular, a Stata data set should never contain both raw data observations and summary observations like totals and subtotals, because then you have to exclude those totals and subtotals from all the calculations in order to avoid double-counting things. (That -replace edit = "" if ...- command is a workaround for this problem in this case.)

              Comment


              • #8
                Mr. Clyde,

                Thanks so much for helping me neavigate these very complicated step. I was simply at a loss how to proceed with these messy data and your coding really helped resolving the issue.

                Again, much appreciate the mentoring and kindness!

                Comment

                Working...
                X