Creating a Unique Identifier Variable.

George Kariuki

Join Date: Jul 2015

Posts: 93
#1

Creating a Unique Identifier Variable.

13 Aug 2015, 09:32

Hallo good people,
I have a data set from a survey. Sampling was done by clusters where each randomly sampled cluster had 10 households in it again randomly assigned. The problem is that whereas the clusters are uniquely identified/serialized from 1 to the last, the household are not. They are serialized 1-10 for each cluster such that in the dataset, there are as many households serialized 1 as there are clusters. (each cluster has a household serialized 1). It becomes a problem to uniquely identify households and even more difficult to merge different sections of the dataset. Can anyone give me a starter advice that I can build on to have a unique identifier of each hh?

Merci
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

13 Aug 2015, 09:55

It seems to me that much could be done by specifying two variables - cluster and household - as your identifiers, for example when merging (merge 1:1 cluster household using ... for example). All that is required is that the combination of cluster and household be unique. Is there is something you need to do that cannot be done that way? This is a very common situation and most commands accommodate having an identifier that consists of several variables.
Comment
George Kariuki

Join Date: Jul 2015

Posts: 93
#3

13 Aug 2015, 12:12

Sure all data subsets have this two variables. The household and the cluster and their combination is unique. Should I try to merge one to many since some data subsets have multiple responses that I may wish to carry along.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

13 Aug 2015, 19:00

Yes, you will either need one to many (1:M) if your master data (data in memory) is by cluster and your using data is by cluster and household. If you have it the other way around, you use many to one (M:1) if your master data is by cluster and household and your using data is by cluster.
Comment
Evelyn Ersanilli

Join Date: May 2014

Posts: 43
#5

14 Aug 2015, 07:44

While the suggestion in post#4 will help you merge data without problems, I would recommend creating a new household identifier that is unique within the merged dataset.
This is in case you want to generate variables that measures a within household quantity (e.g. number of children), which (if your dataset is 'long' format) can easily be done using 'egen....,by(hhid)' if your hhid is unique.

There are multiple ways of creating a unique hhid.

Code:

egen concat hhid_new=(clusterid hhid)

This however generates a string variable

What is probably nicer is

Code:

gen double hhid_new=clusterid*100+hhid

You may need to adjust the number of 0s (if there are clusters with more than 10 households you need to adjust the syntax. E.g. multiply by 1000 instead of 100).
I've added 'double' because if the clusterids are very long(large) you can run into issues unless you specify variable type as double.

I would recommend running this code for each cluster dataset before merging.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

14 Aug 2015, 08:59

I do not find the argument in post #5 for creating a new household identifier convincing. The by option on egen will accept more than one variable, and will generate results for each unique combination, as the example below demonstrates.

This is not to say that there are never reasons to create an identifier as a single variable, but rather, that the example presented doesn't support the recommendation. As I suggested in post #2, multi-level identifiers are common in survey data, and Stata tends to make it straightforward to accommodate them.

Code:

clear
input cluster hh child
101 1 0
101 1 0
101 1 1
101 1 1
101 2 0
101 2 1
102 1 0
102 1 0
102 2 0
102 2 0
102 2 1
end
egen nchild = sum(child), by(cluster hh)
list, sepby(hh)

Code:

     +-------------------------------+
     | cluster   hh   child   nchild |
     |-------------------------------|
  1. |     101    1       0        2 |
  2. |     101    1       0        2 |
  3. |     101    1       1        2 |
  4. |     101    1       1        2 |
     |-------------------------------|
  5. |     101    2       0        1 |
  6. |     101    2       1        1 |
     |-------------------------------|
  7. |     102    1       0        0 |
  8. |     102    1       0        0 |
     |-------------------------------|
  9. |     102    2       0        1 |
 10. |     102    2       0        1 |
 11. |     102    2       1        1 |
     +-------------------------------+

Comment

Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#7

14 Aug 2015, 09:02

In addition to Evelyn's good suggestions, also consider:

Code:

egen long newid=group(clusterid hhid)

Note that the concat() as recommended will not work well in general, though may be fine for your specific case (household numbers are in 1-10 range).
Here is an example where it will not work well:

Code:

clear input cluster hh 1 1 1 2 1 3 1 10 1 11 2 1 2 2 10 1 11 1 end egen id1=concat(cluster hh) generate id2=cluster*100+hh egen id3=group(cluster hh) list isid id3 isid id2 isid id1

Best, Sergiy Radyakin
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#8

14 Aug 2015, 09:17

Note that (Evelyn #5)

Code:

egen concat hhid_new=(clusterid hhid)

is not legal syntax. You need

Code:

egen hhid_new = concat(clusterid hhid)

except that adding spaces is a good idea. For some discussion see (e.g.) http://www.stata-journal.com/sjpdf.h...iclenum=dm0034
Comment
George Kariuki

Join Date: Jul 2015

Posts: 93
#9

14 Aug 2015, 09:34

Evelyn, this seems a plausible way of having all the data subsets have the same unique Identifier. To ensure I get maximum from you which will be helpful in the analysis as you have pointed out, let me describe the dataset more succinctly. The dataset is a budget survey dataset. The questions were designed under "sections" to capture various categories of household attributes i.e. section E would be capturing "sanitation" which has a whole range of sanitation related variables such as access to drinking water, nature of human waste disposal issues bla bla.. Now that section is a sub dataset file of the entire Survey dataset. There are several of such files which once combined will make the complete dataset as one. The beauty of it is that all sub datasets have this two variable "id_clust" and "id_hh" which are unique serial numbers for clusters and households in clusters. There are a total of 1339 clusters where each cluster has 10 households in it numbered neatly as 1-10.

Following you advice, I though I would use such a creative algorithm code to create a unique identifier before doing the merging. I suppose this will make things easier for me because am not aiming to merge the entire dataset but just variables capturing household assets only.

The earlier suggestion worked out on a merge 1:m command but as you say, its hard for me to do an analysis of household level particulars. Am I visioning right on your advice?
Comment
Evelyn Ersanilli

Join Date: May 2014

Posts: 43
#10

15 Aug 2015, 09:57

Apologies for posting an illegal syntax earlier.

Yes I would create the unique household ID in each section of the dataset before merging.
Given that you write that there are never more than 10 households in a cluster, the syntax

Code:

gen hhid_new=clusterid*100+hhid should work.

(given that you have fewer than 10,000 clusters you don't need 'double' type).
So would of course the syntax suggested by Nick (if you include the decode option as explain in the Stata journal article he refers to) and Sergiy.
Though the potential downside of Sergiy's suggestion is that the hhid no longer includes the clusterid.

Do check whether this actually leads to unique identifiers

Code:

isid hhid_new

If this reveals there are problems I think you need to make a new variable to identify the problem cases

Code:

by clusterid hhid, sort: gen hhcounter= _N list clusterid hhid if hhcounter>1

or

Code:

sort clusterid hhid by clusterid, sort: gen hhcounter= _n list clusterid hhid if hhcounter!=hhid

The first code will return cases of multiple households in a cluster that have the same hhid
The second code will return households in clusters where household numbers are missing. E.g. they are numbered 1,3,4,6,7,8,9,10
Another source of problems could be clusters in which the hhid exceeds 10
You can check for these by adding "| hhid>10" to the list command above.
Comment
George Kariuki

Join Date: Jul 2015

Posts: 93
#11

15 Aug 2015, 13:59

I tried the "illegal" command but never worked as before. Probably for that earlier error. I have tried the diagnostics for possible repetition but I cant find repeated observations. To be more certain on this crucial supposition, I have curved out three variables + the one I have created as a unique Identifier in this dataset . The variables are

id_clust,
id_hh
hhsize.
hhid

The first two appear in all files. The third could be useful in aiding the diagnostics (Confirming). Kindly check if I can move on with the new variable hhid as my unique household identifier.

Attached Files

unique identifier.dta (310.8 KB, 1 view)
Comment
Joshua D Merfeld

Join Date: Jun 2015

Posts: 86
#12

15 Aug 2015, 19:33

If you create a single unique id using the -egen... group- command as suggested, the households and clusters must be in the same order in every single dataset otherwise the created variables will not be matched across datasets. However, I am with William; I do not think there is any need to create a single unique identifier. Merging on two variables is very common (and merging on more than two is also common; see the Living Standards Measurement Survey, for example) and most (all?) commands that allow -bys- should allow for more than one variable.
3 likes
Comment

Announcement