Categorizing ages from a varlist

Taylor Vail

Join Date: Oct 2018

Posts: 17
#1

Categorizing ages from a varlist

05 Nov 2018, 16:46

Hello!

Each row of my dataset represents a family, and within the family I have each family member's age listed under the variables age1 - age14. What I'd like to do is summarize how many people in the total dataset are under 5 years old, how many are 6 - 10 years, 11 - 15 etc. and compile it under one new variable called agegrp. Through past posts I was able to figure out the following command to successfully generate a new variable for all individuals in the sample that are under 6, but to generate a new variable for each age group seems like an inefficient way to go about this.

egen agegrp1 = anycount(age*), value(0,5)

I feel like this is likely a job for a loop of some sort but I can't seem to figure it out. Any guidance would be greatly appreciated!
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

05 Nov 2018, 18:56

Perhaps someone else will be able to address your question. I'm not quite able to figure out what you want the variable agegrp to be. It seems to me you are thinking of your data as thought it were a spreadsheet, and off to the right you want to add a column with, say, the first 20 rows having counts by age group. This is not in general a productive way to use data in Stata. It's hard to figure out what to do in this case, because you don't tell us how you intend to make use of these counts by age group.

But if you get no advice, please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post, looking especially at sections 9-12 on how to best pose your question. It would be particularly helpful to post a small sample of your data, including the family ID and a few of the age variables (maybe just age1-age5). In particular, please read FAQ #12 and use dataex when posting sample data to Statalist.
Comment
Taylor Vail

Join Date: Oct 2018

Posts: 17
#3

06 Nov 2018, 08:02

Hi William, thank you for highlighting the room for improvement in my post, hopefully the following clarifies what I'm trying to do?

In discussing the demographics of my sample, I would just like to present a table demonstrating how many children I have under 5, how many are between the ages of 6 - 10, 11 - 15, 16 - 20 etc.

Right now, my data looks something like this, where each _id represents a family with multiple family members of varying ages:

_id age1 age2 age3 age4 age5
1 50 23 10 5 2
2 34 100 22 18 15
3 67 18 59 21 11
4 45 24 23 14 16
5 78 57 28 24 3

Ideally, I'd like one variable called agegrp, where agegrp=1 is a sum of age1 - age5 that are less than or equal to 5; agegrp=2 is is a sum of all age1 through age5 that are between 6 to 10. Such that I could tab agegrp and get a count of how many individuals in my sample are children under 5, children 6 to 10 years old,...,up through adults greater than 70 years old.

Using:

egen agegrp1 = anycount(age*), value(0,5)

This successfully counted all children between 0 and 5, but following this methodology, I'd need to create a new variable for each age group of interest which seems like an inefficient way to go about this. Furthermore, my attempts of using the count function and creating a loop doesn't work as the count function doesn't appear to work with variable lists and only with explicit variables.

Any thoughts or perhaps something I need to clarify further?

Thank you!
Comment

Marcos Almeida

Join Date: Apr 2014
Posts: 4047

06 Nov 2018, 08:11

First, that would be great to present data under code delimiters, hence I did it for you.

Second, I gather you need to reshape long beforehand. Please see below:

Code:

. input _id age1 age2 age3 age4 age5

           _id       age1       age2       age3       age4       age5
  1. 1 50 23 10 5 2
  2. 2 34 100 22 18 15
  3. 3 67 18 59 21 11
  4. 4 45 24 23 14 16
  5. 5 78 57 28 24 3
  6. end

. reshape long age, i(_id) j(subjects)
(note: j = 1 2 3 4 5)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        5   ->      25
Number of variables                   6   ->       3
j variable (5 values)                     ->   subjects
xij variables:
                     age1 age2 ... age5   ->   age
-----------------------------------------------------------------------------

. list

     +----------------------+
     | _id   subjects   age |
     |----------------------|
  1. |   1          1    50 |
  2. |   1          2    23 |
  3. |   1          3    10 |
  4. |   1          4     5 |
  5. |   1          5     2 |
     |----------------------|
  6. |   2          1    34 |
  7. |   2          2   100 |
  8. |   2          3    22 |
  9. |   2          4    18 |
 10. |   2          5    15 |
     |----------------------|
 11. |   3          1    67 |
 12. |   3          2    18 |
 13. |   3          3    59 |
 14. |   3          4    21 |
 15. |   3          5    11 |
     |----------------------|
 16. |   4          1    45 |
 17. |   4          2    24 |
 18. |   4          3    23 |
 19. |   4          4    14 |
 20. |   4          5    16 |
     |----------------------|
 21. |   5          1    78 |
 22. |   5          2    57 |
 23. |   5          3    28 |
 24. |   5          4    24 |
 25. |   5          5     3 |
     +----------------------+

recode age 0/5 = 0 6/10 = 1 11/15=2 16/20=3 21/25=4 26/max=5 , gen(agegrp)
tab agegrp

  RECODE of |
        age |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          3       12.00       12.00
          1 |          1        4.00       16.00
          2 |          3       12.00       28.00
          3 |          3       12.00       40.00
          4 |          6       24.00       64.00
          5 |          9       36.00      100.00
------------+-----------------------------------
      Total |         25      100.00

Hopefully that helps

Last edited by Marcos Almeida; 06 Nov 2018, 08:39.

Best regards,

Marcos

Comment

Marcos Almeida

Join Date: Apr 2014
Posts: 4047

06 Nov 2018, 08:48

You may also label it in the same process:

Code:

drop agegrp
recode age (0/5 = 0 up_to_five_years) (6/10 = 1 from_6_to_10) (11/15=2 from_11_to_15) (16/20=3 from_16_to_20) (21/25=4 from_21_to_25) (26/max=5 over_26_years) , gen(agegrp)
tab agegrp

   RECODE of age |      Freq.     Percent        Cum.
-----------------+-----------------------------------
up_to_five_years |          3       12.00       12.00
    from_6_to_10 |          1        4.00       16.00
   from_11_to_15 |          3       12.00       28.00
   from_16_to_20 |          3       12.00       40.00
   from_21_to_25 |          6       24.00       64.00
   over_26_years |          9       36.00      100.00
-----------------+-----------------------------------
           Total |         25      100.00

Best regards,

Marcos

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35696
#6

06 Nov 2018, 09:09

I would just add that a variable of the form

Code:

gen age5 = 5 * ceil(age/5)

returns values that are 0, 5, 10, etc. without need for multiple operations, value labels, or whatever.
3 likes
Comment
Taylor Vail

Join Date: Oct 2018

Posts: 17
#7

06 Nov 2018, 09:15

Amazing, this was exactly what I was looking for! Sorry for the extra trouble you had to go through to help me. Many thanks, Marcos!
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#8

27 Feb 2019, 17:53

Hi. I was wondering if there was a more efficient way to code age into groups as I have done below. To avoid the dummy trap, I need to make one group a reference group (probably the first group). Also, the last group is for 65 and above. I'm not sure if there is a convention of number of years per group, so I did 10.

gen age1524 = 0
replace age1524 = 1 if age>14 & age<25
gen age2534 = 0
replace age2534 = 1 if age>24 & age<35
gen age3544 = 0
replace age3544 = 1 if age>34 & age<45
gen age4554 = 0
replace age4554 = 1 if age>44 & age<55
gen age5564 = 0
replace age5564 = 1 if age>54 & age<65
gen age65plus = 0
replace age65plus = 1 if age>64

Thank you in advance.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#9

27 Feb 2019, 19:08

Making your own 0/1 variables is worthwhile when you are learning what dummy (indicator) variables are and how they work, but once you're past that, the best practice is to let Stata do that for you, through the mechanism of so-called factor variables, notated as "i.MyXVariable" See -help fvvarlist-. By default, Stata will omit the lowest-number category of a categorical variable referenced with the i.MyXVariable notation, but you can choose other categories.

And, as a side note, there isn't any "trap" regarding indicator variables. Any modern statistical software (i.e., probably anything post-1980) will drop out one of your indicator variables if you include a redundant list of them while using your do-it-yourself approach to them. It might not drop out the one you want, but it's not a trap.

A Stata-ish way to do what you want is:

Code:

// See -help recode- recode age (15/24= 1 "15-24") (25/34=2 "25-34") ...[you fill in the rest]... (65/max = .. "65 +" ), generate(agecat) regress y i.agecat

An example:

Code:

sysuse auto regress price i.rep78
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#10

28 Feb 2019, 01:28

Mike Lacy gives excellent advice. Here's some more. Purely as a matter of technique the code in #8 could be re-written as say

Code:

local HI 24 34 44 54 64 200 forval lo = 15(10)65 { gettoken hi HI : HI gen age`lo'`hi' = inrange(age, `lo', `hi') } rename age65200 age65plus

At the same time, ages split by arbitrary breaks are just that. Perhaps there is a parameterisation using a polynomial or spline that matches the underlying process.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#11

05 Mar 2019, 17:34

Thank you Mike Mike Lacy that was very helpful. I appreciate your time.
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#12

05 Mar 2019, 17:47

Thank you very much Nick Nick Cox. I'm just learning how to do loops ... Is "lo" "hi" there to describe low age, high age? If I wanted to add labels to each age category, should I do that as a separate line afterwards (e.g. below) or is there a way to incorporate that in the loop code you provided?

"label define agelab 1 "[1] 1-24" 2 "[2] 25-39" 3 "[3] 40-54" 4 "[4] 55-69" 5 "[5] 70-plus", modify"
"label values age agelab"
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#13

06 Mar 2019, 00:16

Defining and attaching the labels through separate commands is fine.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#14

07 Mar 2019, 17:53

Thank you Nick Nick Cox
Comment
Bright Tree

Join Date: Mar 2020

Posts: 85
#15

15 Apr 2020, 19:53

Dear friends,

I want to create a group var if the age is in the ranges.My code is below,

Code:

local HI 9 19 29 39 49 59 69 79 forval lo = 0 ( 10)80 { gettoken hi HI : HI gen gr = 1 if age= inrange ( age, `lo ' , `hi ') replace gr= 2 if if age= inrange ( age, `lo'+1, `hi'+1) }}

It is like,

gen gr= 0

replace gr= 1 if age>=0 & age<10

replace gr = 2 if age>=10 & age<20
......

Thank you!

Last edited by Bright Tree; 15 Apr 2020, 19:55.
Comment

Announcement

Categorizing ages from a varlist

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment