Constructing new grouping variables

Duy To

Join Date: Sep 2020

Posts: 43
#1

Constructing new grouping variables

13 Aug 2021, 18:42

Dear everyone,

I have a repeated cross-sectional data with 5792 observations, 24 birthplaces and 8 years of surveys. How do I create 192 groups based on birthplaces, which are defined such
that each individual is a member of exactly one cohort, which is the same for all periods?

I have tried

Code:

egen group_ID=group(year birth_place),

however, it seems like it is incorrect because the one cohort only is available for 1 year, which is not what I am looking for.

Thank you in advance, and stay safe.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

13 Aug 2021, 18:56

I'm a bit confused by your question. But here's my guess at what you mean. The term cohort is normally used in research to refer to a group of people (or other entities serving as units of analysis) who enter a study at the same time and have a common "exposure." I'll guess that for your purposes, the birthplace is the common exposure. You don't explain your variable year, but I'm going to guess that it represents the year of the survey: some people start in the study at different years, and then (may or may not) continue in subsequent years. So I'm thinking that you want to classify people according to the birthday and their first year in the study, not the year variable.

I assume that your data also contains some variable that serves as a person ID that stays with the person throughout his/her participation in the study. Let's call that variable person_id.

Code:

by person_id (year), sort: gen int first_year_in_study = year[1] egen int cohort = group(birth_place first_year_in_study)

If that's not it, try to give a clearer explanation of what these 192 groups are supposed to be, and also show example data, using the -dataex- command. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
Duy To

Join Date: Sep 2020

Posts: 43
#3

13 Aug 2021, 19:20

Dear Clyde,

I am sorry for the inconvenience caused. You have guessed all correctly. Here is my example of data

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int pernum double serial long sample int(year bpl) 1 13 201101 2011 1 1 37 201101 2011 13 1 39 201101 2011 1 1 46 201101 2011 1 1 65 201101 2011 27 1 72 201101 2011 29 2 72 201101 2011 29 1 74 201101 2011 1 1 82 201101 2011 1 1 87 201101 2011 13 end label values sample sample_lbl label def sample_lbl 201101 "2011 ACS", modify label values year year_lbl label def year_lbl 2011 "2011", modify label values bpl bpl_lbl label def bpl_lbl 1 "Alabama", modify label def bpl_lbl 13 "Georgia", modify label def bpl_lbl 27 "Minnesota", modify label def bpl_lbl 29 "Missouri", modify

Yes, "Year" is the survey year, "bpl" is the birthplace. The cohort here means that I want to put anyone who was born in the same place into 1 group, and I want to do that for every survey year. So for each year, I have 24 cohorts. The survey years in the data are from 2011 to 2018; therefore, in total I have 192 cohorts.

The data do not have unique ID for each individual. According to the instructions from the data source, "pernum" can be uniquely identified each person by combining with "sample" and "serial". So to apply your code, I would need to create those unique IDs, is it correct?

Since "serial" ranges from 2 to 1,410,974, and "pernum" ranges from 1 to 20, so here are my try to create the unique IDs. Please correct me if I am wrong.

Code:

gen unique_ID=pernum*10^7 + serial + sample

If that code is correct, I will be able to follow your solution, is at correct?

Thank you for your reply.

Last edited by Duy To; 13 Aug 2021, 19:35.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

13 Aug 2021, 20:41

Your code for creating the person ID will not work. The problem is that the combination involves too many digits to store in a float. So you will end up with loss of precision, which means that sometimes different people will get the same ID.

The best bet is to make a string variable as the person ID or to use -egen, group()-. So, I see two good ways to go:

Code:

// CREATES A 12 CHARACTER STRING VARIABLE egen person_id = concat(pernum serial sample), punct(-) // OR // CREATES A SEQUENTIAL IDENTIFIER STARTING AT 1 // AND COUNTING UP, BUT LABELED TO LOOK LIKE THE STRING IDENTIFIER egen long person_id = group(pernum serial sample), label

I am generally partial to string variables as identifiers, as one usually does not want to do any arithmetic with them, and having them as string variables assures that Stata won't do something stupid if you make a mistake and use the person_id in a context where calculations are being done. BUT, sometimes you need to use a numeric variable for the purpose. For example, the -xtset- command will not allow you to use a string variable as a panel identifier; it must be numeric. One caution: if you have more than 65,536 distinct persons in the data set, you cannot use the -label- option with -egen, group()- because a value label cannot hold more than 65,536 distinct values. If you want or need a numeric identifier and have more than that, just leave out the -label- option from the command and you will get unlabeled integers counting up from 1.

So either approach is workable; which is better depends on what you will be doing going forward
Comment
Duy To

Join Date: Sep 2020

Posts: 43
#5

13 Aug 2021, 21:28

Hi Clyde,

Thank you for your reply!

Does my description of the cohorts makes sense to you? is it compatible with your codes?

Thanks again
Comment
Duy To

Join Date: Sep 2020

Posts: 43
#6

13 Aug 2021, 23:07

Hi Clyde,

I hope one last question I hope you do not mind. In the same context as above, how I can generate the cohorts by a distance between birth years? For example, birth years in my data are from 1960 to 1990; for anyone who was born from 1960-1962, I put in 1 cohort; anyone who was born from 1963-1965, I put in 1 cohort and so on.

Thank you
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

14 Aug 2021, 01:29

#6

I note that 1980 is not divisible by 3, but 1959 is. The sum of the digits of 1959 is itself divisible by 3. A one-liner follows. For various tricks in this territory, see https://www.stata-journal.com/articl...article=dm0095 or https://journals.sagepub.com/doi/pdf...867X1801800311 If this is behind a pay wall as far as you are concerned, know that it will be publicly visible on the publication of Stata Journal 21(3) in about 6 weeks' time.

A general strategy for bins of width k is something like

Code:

a +  k *  floor((youhave - a) / k)

to get self-describing bins. Here, each 3-year interval is described by its first year. Other tricks make use of ceil(youhave / 3) or round(youhave, 3) possibly with extra additions or subtractions.

Code:

. clear

. set obs 15
Number of observations (_N) was 0, now 15.

. range youhave 1960 1974 

. gen wanted = 1 + 3 * floor((youhave - 1)/3)

. tab youhave wanted

           |                         wanted
   youhave |      1960       1963       1966       1969       1972 |     Total
-----------+-------------------------------------------------------+----------
      1960 |         1          0          0          0          0 |         1 
      1961 |         1          0          0          0          0 |         1 
      1962 |         1          0          0          0          0 |         1 
      1963 |         0          1          0          0          0 |         1 
      1964 |         0          1          0          0          0 |         1 
      1965 |         0          1          0          0          0 |         1 
      1966 |         0          0          1          0          0 |         1 
      1967 |         0          0          1          0          0 |         1 
      1968 |         0          0          1          0          0 |         1 
      1969 |         0          0          0          1          0 |         1 
      1970 |         0          0          0          1          0 |         1 
      1971 |         0          0          0          1          0 |         1 
      1972 |         0          0          0          0          1 |         1 
      1973 |         0          0          0          0          1 |         1 
      1974 |         0          0          0          0          1 |         1 
-----------+-------------------------------------------------------+----------
     Total |         3          3          3          3          3 |        15

Naturally, there is always a slow but sure way with the flavour

Code:

gen bin = 1 if inrange(youhave, 1960, 1962) 
replace bin = 2 if inrange(youhave, 1963, 1965)

but once you have worked out how to do it one line you will never prefer that solution

Comment

Duy To

Join Date: Sep 2020

Posts: 43
#8

14 Aug 2021, 02:49

Thank you for the reply, Nick. Just to want to clarify, the value of "a" and "k" would depend on the min value of the range, right? For example, if my min year is 1962, then a=2 and k=4? And if the min year is already divisible by 3, eg 1956, then a=0 and k=3, right?

Last edited by Duy To; 14 Aug 2021, 03:03.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

14 Aug 2021, 03:43

k is the bin width desired. a is whatever is needed to make the recipe work. If you are lucky a is zero and can be omitted. If your years had been 1959 on, for example, this gives a clean result.

Note how you can experiment in Mata with binning rules.

Code:

. mata
------------------------------------------------- mata (type end to exit) -----
:
: y = (1959..1967)

:: wanted = 3 :* floor(y :/ 3)

: y \ wanted
          1      2      3      4      5      6      7      8      9
    +----------------------------------------------------------------+
  1 |  1959   1960   1961   1962   1963   1964   1965   1966   1967  |
  2 |  1959   1959   1959   1962   1962   1962   1965   1965   1965  |
    +----------------------------------------------------------------+

Comment

Duy To

Join Date: Sep 2020

Posts: 43
#10

14 Aug 2021, 04:14

Thank you. I understand it throughoutly.

Thank you for both of you. And stay safe.
Comment

Announcement