Personal identifier variable for survey data

Meabh Cairns

Join Date: Jan 2024

Posts: 4
#1

Personal identifier variable for survey data

28 Jan 2024, 12:01

hello users,

I wish to generate a uniquely identifiable variable. the Uk Labour Force Survey that I am using has a variable to identify the household called "caseid" but there can be multiple observations under this id. there is a variable "persn" to identify which person they are of the household - 1 being the head of the household.

i previously did the following:

gen unique_id = caseid * 1000 + persn
format unique_id %12.0f

but when appending the datasets stata said that unique_id did not uniquely identify individuals.

I also did the following:
egen id = group(caseid)

this did not give the desired outcome. i need the id numbers obviously to be consistent for when i append the data, so that the same individual has the same id in each dataset.

any help greatly appreciated.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

28 Jan 2024, 13:01

You do not show example data, so I am left to speculate. But I think the root of your problem is numeric precision.

Both -gen- and -egen-, by default, store their results as -float- data types. However, a -float- data type is only capable of accurately holding 7 decimal digits. If the number of distinct IDs you are creating is such that you need more digits than that (as I suspect would be the case in a survey of this type), then inevitably some "distinct" values of your ID variable are going to get smashed together. You will need to store your ID's as either longs or doubles to get enough precision. Do read -help data_types- for more details about this.

The easiest way to do this, however, is, whenever you create an identifier variable (one that will need a distinct value for every observation in your data set) use `c(obs_t)' as your storage type. This is a built-in Stata function that checks the size of the data set and determines the smallest data type that will have enough room to accommodate as many values as you need. You don't even need to know at coding-time how big the data set will be--it gets evaluated at run-time.

Code:

egen `c(obs_t)' id = group(caseid persn)
Comment
Meabh Cairns

Join Date: Jan 2024

Posts: 4
#3

31 Jan 2024, 03:52

Hi Clyde. Thank you for your reply!
I have done this and then tried to append two datasets from two different years. However when I ask Stata if this new id uniquely identifies the observations (via the isid command) it says it does not. When looking at the id, I can see that there are inconsistencies. for example there are two observations for id=1 (one for each year, yes makes sense) however the id=1 for 1999 is male and the id=1 for 2000 is female. This is common throughout.

And ideas?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

31 Jan 2024, 08:59

The approach I showed in #2 cannot be used in the way you describe in #3. That's because when creating the id variable in one data set, Stata does not know anything about the other data set. When you append them you may well find, as you did, that the same id has been assigned to two different people in the two data sets.

In this situation you need to append the two data sets first, and then use the approach in #2 on the combined data. That will assure that the same id is always assigned to the same person in both sets. Of course, with the combined data sets, the id will no longer uniquely identify observations. But the combination of id and year, I presume, will.
Comment

Announcement

Personal identifier variable for survey data

Comment

Comment

Comment