Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating unique group ID variable

    Dear all

    I have a hopefully trivial question that I can't get my head around right now. My data set is clustered and consists of neighborhoods, households, and household members:

    Code:
    clear
    input neighID hhID hhmemberID
    1 1 1
    1 1 2
    1 2 1
    1 2 2
    1 2 3
    2 1 1
    2 1 2
    2 1 3
    2 2 1
    2 3 1
    end
    I.e. hhmemberID is only unique within hhID, hhID is only unique within neighID. How do I get hhID to be unique overall? I.e., like this:

    Code:
    clear
    input neighID hhID hhmemberID
    1 1 1
    1 1 2
    1 2 1
    1 2 2
    1 2 3
    2 3 1
    2 3 2
    2 3 3
    2 4 1
    2 5 1
    end
    One option would be to convert neighID and hhID to strings and to concatenate them (or whatever the correct term for banging them into one string variable is), but I wonder whether there is a less haphazard option?

    Cheers
    Go

  • #2
    Code:
    egen unique_hhID = group(neighID hhID), label
    egen unique_hhmemberID = group(neighID hhid hhmemberID), label

    Comment


    • #3
      Amazing, Clyde, thanks. I did think of group() myself, but then thought that couldn't be the solution, because was I not trying to ungroup something? Thanks again, you helped me a lot.

      Comment


      • #4
        Clyde, after waiting 95 minutes, egen group tells me:

        Code:
        too many values
        r(134);
        My data set has 1.2M hhmemberID, probably a few 100K hhID, and a few dozen neighID. Is there any way out of this? (Reading the error message, I also feel that my initial idea of concatenating two strings and then encoding them won't work.)

        Comment


        • #5
          Code:
          by neighID hhID, sort: gen unique_hhid = 1 if _n == 1
          replace unique_hhid = sum(unique_hhid)
          The drawback to this approach is that the unique_hhid will be a sequential number from 1 to however many distinct household id's there are, and it will not be labeled to show the values of the original neighID and hhID variables. But given the large numbers of values involved here, I don't think there is any easy way around it.

          If having the unique hhid show its origins in neighID and hhID is crucial for you, then your concatenation idea will work:
          Code:
          egen unique_hhid = concat(neighID hhID), punct(#)
          But the drawback to this approach is that you will not be able to use thie unique_hhid in many situations where a numeric variable is required.

          Comment

          Working...
          X