Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unique Identifiers & xtset

    Hello All -

    I am a Stata novice working on my dissertation proposal. My chair left the university, and I do not have other committee members who work with Stata, so I am incredibly grateful in advance for any help provided.

    I am constructing a monthly panel dataset using four annual waves of SIPP (Survey of Income and Participation) data at the individual level. My subset is about 435,000 individuals each with 48 monthly observations. The unique identifier is based on the main sample unit identifier (ssuid) and individual within the household (pnum) - requiring 15 characters. If I convert the string variables to numeric - the unique id because something like "1.143e+11", the rounding creates an absurd number of duplicates.

    How can I create a unique identifier that will work with xtset?

    [CODE]
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str15 id str12 ssuid str3(shhadid pnum) byte(rrel1 esex trace tage eeduc eed_scrnr ecert monthcode) float new_month byte(eedcred thhldstatus tehc_metro) str2 tehc_st long tjb1_msum
    "000114285070101" "000114285070" "011" "101" 99 2 1 68 44 2 2 1 1 . 2 1 "20" .
    "000114285070101" "000114285070" "011" "101" 99 2 1 68 44 2 2 2 2 . 2 1 "20" .
    "000114285070101" "000114285070" "011" "101" 99 2 1 68 44 2 2 3 3 . 2 1 "20" .
    "000114285070101" "000114285070" "011" "101" 99 2 1 68 44 2 2 4 4 . 2 1 "20" .
    "000114285070101" "000114285070" "011" "101" 99 2 1 68 44 2 2 5 5 . 2 1 "20" .
    "000114285070101" "000114285070" "011" "101" 99 2 1 68 44 2 2 6 6 . 2 1 "20" .
    "000114285070101" "000114285070" "011" "101" 99 2 1 68 44 2 2 7 7 . 2 1 "20" .
    "000114285070101" "000114285070" "011" "101" 99 2 1 68 44 2 2 8 8 . 2 1 "20" .
    "000114285070101" "000114285070" "011" "101" 99 2 1 68 44 2 2 9 9 . 2 1 "20" .
    "000114285070101" "000114285070" "011" "101" 99 2 1 68 44 2 2 10 10 . 2 1 "20" .

    I did look at the manuals and forums, but the unique IDs in the examples were shorter, or they used gen id =_n - which would not work for my data setup.

    Cheers,
    Jaime

  • #2
    Code:
    egen long new_id = group(id) 
    might help.

    Comment


    • #3
      What do you mean when you say that -gen id = _n- would not work in your data. The only reason it would fail is that your data set has too many distinct id's for a float to handle all those values without loss of precision. But as Nick suggests, if you force it into -long- data type you will be fine. Actually, even safer would be -`c(obs_t)'- instead of long: that way if there are a surprisingly large number of distinct id's Stata will store them as doubles, which allows you even more digits than long without loss of precision.

      If I convert the string variables to numeric - the unique id because something like "1.143e+11", the rounding creates an absurd number of duplicates.
      Actually, not The "duplicates" you see are not, in fact, duplicates unless the original values of id, ssuid are themselves originally duplicates. They just look like that because Stata assumes you are not interested in seeing a zillion digits in a long number, so it displays it in scientific notation. But that's just what Stata is showing you; internally it retains all the detail of the original variables (when -destring- encounters a situation where it can't retain all the detail of the originals, it halts with an error message and doesn't do the -destring-ing.) Always remember that in Stata, what you see is not always what you get: Stata does things to make the data visually better, and in doing that, it sometimes hides things from you. But you can override Stata's instincts when you want to. To see that Stata really has not lost the detail when you convert these string ID's to numeric, do this:
      Code:
      destring *id pnum, replace
      format *id %015.0f
      browse *id pnum
      In fact, one other generic fact about Stata's native commands is that they never discard information without warning you, or forcing you to modify the command in someway to force it to do so. So, if it looks to you like information has been lost, it is always recoverable, usually just by changing display formats.

      Now, let me also say that not onlycan you change these string IDs to numeric, I strongly encourage you to do so, especially for id and ssuid. Those variables are 15 and 12 character strings, respectively. If you convert them to numeric they will become, respectively, -double- (8 bytes) and -long- (4 bytes). So you will have reduced 27 bytes to 12, saving 15 bytes for each observation. Since your data set is going to have 435,000*48 = 20,880,000 observations, that's a net saving of 313,200,000 bytes (minus a negligible number of bytes in the data set to store the display format.) While 313 MB is not likely to be the difference between having and not having enough memory to work in, by reducing the data set in this way, much of your data management and analysis will go appreciably faster: every -use- and -save- will be faster. And the time savings will be even more apparent with any command that -sort-s the data, of which you will probably encounter many (including among others, -xtset- and many uses of -by-) in your data management. And your regression analyses will also run more quickly in a smaller data set.

      Comment

      Working...
      X