Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Family composition data cleaning- any advice for writing fancier code?

    Hi, I have household data with household member characteristics.
    The variables goes like below.

    hhid member1_sex member1_age member2_sex member2_age member3_sex member3_age...

    1000 1 15 0 25 1 35 ...
    1001 1 20 1 35 0 40 ....

    I want to create new categorized variables with number of family member within certain age range in accordance with sex.
    For example,
    hhid male_under19 male_19-24 male_25-34 male_35-49.... female_under19 female_19-24 female_25-34...

    1000 1 0 1 2 ....
    1001 0 0 1 1 ....
    1002 1 0 2 1 ....

    Can you give me an advice on how I can write cool and short code for doing this?
    I think I can write long and messy code to work on this, but I am pretty sure there is a better way to do.

    Thanks in advance!
    Sunny Jaiwon Lee



  • #2
    I'm afraid I don't know of a nice elegant solution here. Best I can think to do is to use reshape to avoid nested for-loops. This code appears to work for your short example data structure, but it is hard to test properly on this very limited test case. Your example expected solution also doesn't appear to follow from your data example. You should really provide data examples generated with -dataex-, and if you want to show us what the result should look like (much appreciated most of the time), it should ideally match your data example.

    All that said, I think you are looking for something like this:

    Code:
    clear
    input int(hhid member1_sex member1_age member2_sex member2_age member3_sex member3_age)
    1000 1 15 0 25 1 35
    1001 1 20 1 35 0 40
    end
    
    rename (member*_age) (member_age*)
    rename (member*_sex) (member_sex*)
    
    reshape long member_age member_sex, i(hhid) j(memnum)
    
    bysort hhid: egen male_19_or_under = total(member_age < 20 & member_sex == 1)
    bysort hhid: egen female_19_or_under = total(member_age < 20 & member_sex == 0)
    
    bysort hhid: egen male_20_24 = total(member_age >= 20 ///
                      & member_age < 25 & member_sex == 1)
    bysort hhid: egen female_20_24 = total(member_age >= 20 ///
                      & member_age < 25 & member_sex == 0)
    
    local range = 10
    forv lower = 25(`range')85 {
        local upper = `lower' + `range'
        local upper_m_1 = `upper' - 1
        bysort hhid: egen male_`lower'_`upper_m_1' = total(member_age >= `lower' ///
                      & member_age < `upper' & member_sex == 1)
        bysort hhid: egen female_`lower'_`upper_m_1' = total(member_age >= `lower' ///
                      & member_age < `upper' & member_sex == 0)
    }
    bysort hhid: egen male_85_or_older = total(member_age >= 85 & member_sex == 1)
    bysort hhid: egen female_85_or_older = total(member_age >= 85 & member_sex == 0)
    
    reshape wide member_age member_sex, i(hhid) j(memnum)
    
    list hhid male_under_19 male_20_24 male_25_34 male_35_44
    But maybe someone knows of something more elegant. I'm a little surprised you have the age range 19-24 (should be 20-24?) for the first range, since every subsequent age range is 10 years. Seems to me that might make it a little more difficult to compare across categories.

    ... actually now that I take another look, it looks like you want a category male_35-49. 15 years? Is that a typo?

    Comment

    Working...
    X