Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a Unique Identifier Variable.

    Hallo good people,
    I have a data set from a survey. Sampling was done by clusters where each randomly sampled cluster had 10 households in it again randomly assigned. The problem is that whereas the clusters are uniquely identified/serialized from 1 to the last, the household are not. They are serialized 1-10 for each cluster such that in the dataset, there are as many households serialized 1 as there are clusters. (each cluster has a household serialized 1). It becomes a problem to uniquely identify households and even more difficult to merge different sections of the dataset. Can anyone give me a starter advice that I can build on to have a unique identifier of each hh?

    Merci

  • #2
    It seems to me that much could be done by specifying two variables - cluster and household - as your identifiers, for example when merging (merge 1:1 cluster household using ... for example). All that is required is that the combination of cluster and household be unique. Is there is something you need to do that cannot be done that way? This is a very common situation and most commands accommodate having an identifier that consists of several variables.

    Comment


    • #3
      Sure all data subsets have this two variables. The household and the cluster and their combination is unique. Should I try to merge one to many since some data subsets have multiple responses that I may wish to carry along.

      Comment


      • #4
        Yes, you will either need one to many (1:M) if your master data (data in memory) is by cluster and your using data is by cluster and household. If you have it the other way around, you use many to one (M:1) if your master data is by cluster and household and your using data is by cluster.

        Comment


        • #5
          While the suggestion in post#4 will help you merge data without problems, I would recommend creating a new household identifier that is unique within the merged dataset.
          This is in case you want to generate variables that measures a within household quantity (e.g. number of children), which (if your dataset is 'long' format) can easily be done using 'egen....,by(hhid)' if your hhid is unique.

          There are multiple ways of creating a unique hhid.

          Code:
          egen concat hhid_new=(clusterid hhid)
          This however generates a string variable

          What is probably nicer is
          Code:
          gen double hhid_new=clusterid*100+hhid
          You may need to adjust the number of 0s (if there are clusters with more than 10 households you need to adjust the syntax. E.g. multiply by 1000 instead of 100).
          I've added 'double' because if the clusterids are very long(large) you can run into issues unless you specify variable type as double.

          I would recommend running this code for each cluster dataset before merging.

          Comment


          • #6
            I do not find the argument in post #5 for creating a new household identifier convincing. The by option on egen will accept more than one variable, and will generate results for each unique combination, as the example below demonstrates.

            This is not to say that there are never reasons to create an identifier as a single variable, but rather, that the example presented doesn't support the recommendation. As I suggested in post #2, multi-level identifiers are common in survey data, and Stata tends to make it straightforward to accommodate them.

            Code:
            clear
            input cluster hh child
            101 1 0
            101 1 0
            101 1 1
            101 1 1
            101 2 0
            101 2 1
            102 1 0
            102 1 0
            102 2 0
            102 2 0
            102 2 1
            end
            egen nchild = sum(child), by(cluster hh)
            list, sepby(hh)
            Code:
                 +-------------------------------+
                 | cluster   hh   child   nchild |
                 |-------------------------------|
              1. |     101    1       0        2 |
              2. |     101    1       0        2 |
              3. |     101    1       1        2 |
              4. |     101    1       1        2 |
                 |-------------------------------|
              5. |     101    2       0        1 |
              6. |     101    2       1        1 |
                 |-------------------------------|
              7. |     102    1       0        0 |
              8. |     102    1       0        0 |
                 |-------------------------------|
              9. |     102    2       0        1 |
             10. |     102    2       0        1 |
             11. |     102    2       1        1 |
                 +-------------------------------+

            Comment


            • #7
              In addition to Evelyn's good suggestions, also consider:
              Code:
              egen long newid=group(clusterid hhid)
              Note that the concat() as recommended will not work well in general, though may be fine for your specific case (household numbers are in 1-10 range).
              Here is an example where it will not work well:
              Code:
              clear
              input cluster hh
              1 1
              1 2
              1 3
              1 10
              1 11
              2 1
              2 2
              10 1
              11 1
              end
              
              egen id1=concat(cluster hh)
              generate id2=cluster*100+hh
              egen id3=group(cluster hh)
              
              list
              
              isid id3
              isid id2
              isid id1
              Best, Sergiy Radyakin

              Comment


              • #8
                Note that (Evelyn #5)

                Code:
                  
                 egen concat hhid_new=(clusterid hhid)
                is not legal syntax. You need

                Code:
                  
                 egen hhid_new = concat(clusterid hhid)
                except that adding spaces is a good idea. For some discussion see (e.g.) http://www.stata-journal.com/sjpdf.h...iclenum=dm0034

                Comment


                • #9
                  Evelyn, this seems a plausible way of having all the data subsets have the same unique Identifier. To ensure I get maximum from you which will be helpful in the analysis as you have pointed out, let me describe the dataset more succinctly. The dataset is a budget survey dataset. The questions were designed under "sections" to capture various categories of household attributes i.e. section E would be capturing "sanitation" which has a whole range of sanitation related variables such as access to drinking water, nature of human waste disposal issues bla bla.. Now that section is a sub dataset file of the entire Survey dataset. There are several of such files which once combined will make the complete dataset as one. The beauty of it is that all sub datasets have this two variable "id_clust" and "id_hh" which are unique serial numbers for clusters and households in clusters. There are a total of 1339 clusters where each cluster has 10 households in it numbered neatly as 1-10.

                  Following you advice, I though I would use such a creative algorithm code to create a unique identifier before doing the merging. I suppose this will make things easier for me because am not aiming to merge the entire dataset but just variables capturing household assets only.

                  The earlier suggestion worked out on a merge 1:m command but as you say, its hard for me to do an analysis of household level particulars. Am I visioning right on your advice?

                  Comment


                  • #10
                    Apologies for posting an illegal syntax earlier.

                    Yes I would create the unique household ID in each section of the dataset before merging.
                    Given that you write that there are never more than 10 households in a cluster, the syntax
                    Code:
                    gen hhid_new=clusterid*100+hhid should work.
                    (given that you have fewer than 10,000 clusters you don't need 'double' type).
                    So would of course the syntax suggested by Nick (if you include the decode option as explain in the Stata journal article he refers to) and Sergiy.
                    Though the potential downside of Sergiy's suggestion is that the hhid no longer includes the clusterid.

                    Do check whether this actually leads to unique identifiers
                    Code:
                    isid hhid_new
                    If this reveals there are problems I think you need to make a new variable to identify the problem cases
                    Code:
                    by clusterid hhid, sort: gen hhcounter= _N
                    list clusterid hhid if hhcounter>1
                    or
                    Code:
                    sort clusterid hhid
                    by clusterid, sort: gen hhcounter= _n
                    list clusterid hhid if hhcounter!=hhid
                    The first code will return cases of multiple households in a cluster that have the same hhid
                    The second code will return households in clusters where household numbers are missing. E.g. they are numbered 1,3,4,6,7,8,9,10
                    Another source of problems could be clusters in which the hhid exceeds 10
                    You can check for these by adding "| hhid>10" to the list command above.

                    Comment


                    • #11
                      I tried the "illegal" command but never worked as before. Probably for that earlier error. I have tried the diagnostics for possible repetition but I cant find repeated observations. To be more certain on this crucial supposition, I have curved out three variables + the one I have created as a unique Identifier in this dataset . The variables are

                      id_clust,
                      id_hh
                      hhsize.
                      hhid

                      The first two appear in all files. The third could be useful in aiding the diagnostics (Confirming). Kindly check if I can move on with the new variable hhid as my unique household identifier.
                      Attached Files

                      Comment


                      • #12
                        If you create a single unique id using the -egen... group- command as suggested, the households and clusters must be in the same order in every single dataset otherwise the created variables will not be matched across datasets. However, I am with William; I do not think there is any need to create a single unique identifier. Merging on two variables is very common (and merging on more than two is also common; see the Living Standards Measurement Survey, for example) and most (all?) commands that allow -bys- should allow for more than one variable.

                        Comment

                        Working...
                        X