Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with gen id = _n

    Dear All,

    I have a dataset with 52,145,974 observations and 93 variables. I want to create a unique id, so I executed the following command.

    gen id = _n

    However, this did not work. the generated variable id did not identify the observations uniquely.

    Shoummo

  • #2
    Shoummo:
    without any example/excerpt of your dataset (that you can easily provide via -dataex-), it is difficult to reply positively.
    That said, the basic question that crosses my mind is: are you dealing with a cross-sectional or panel dataset?
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      If you are dealing with a cross-sectional data, then try

      Code:
      gen long id = _n

      Comment


      • #4
        Shoummo Sen Gupta the problem is that Stata would by default have created a variable of type float, which have a precision of only about seven digits. You can see the data type if you do
        Code:
        describe id
        .

        Since your ID number has eight digits, you need to tell Stata to create a long variable instead, exactly as suggested in #3. If you had more than 9 digits (up to 16), you would have needed double.

        You might want to look at
        Code:
        help data types

        Comment


        • #5
          Originally posted by Fei Wang View Post
          If you are dealing with a cross-sectional data, then try

          Code:
          gen long id = _n
          Thank you. This solved the problem.

          Comment


          • #6
            Originally posted by Hemanshu Kumar View Post
            Shoummo Sen Gupta the problem is that Stata would by default have created a variable of type float, which have a precision of only about seven digits. You can see the data type if you do
            Code:
            describe id
            .

            Since your ID number has eight digits, you need to tell Stata to create a long variable instead, exactly as suggested in #3. If you had more than 9 digits (up to 16), you would have needed double.

            You might want to look at
            Code:
            help data types
            Thank you

            Comment


            • #7
              Originally posted by Fei Wang View Post
              If you are dealing with a cross-sectional data, then try

              Code:
              gen long id = _n
              This code works. It can be somewhat improved by using the built-in macro which will automatically select the correct storage type for the number of observations in your dataset.

              Code:
               gen `c(obs_t)' id = _n

              Comment


              • #8
                Leonardo Guizzetti : How do you know this black magic?

                When I type

                Code:
                creturn list
                it does not seem to give me c(obs_t)...

                Comment


                • #9
                  here is what is in the help file (h creturn):
                  Code:
                   c(obs_t) returns a string equal to the optimal data type for storing _n.  This allows you to
                          code
                  
                              generate `c(obs_t)' index = _n
                  
                          and know that index will go from 1 to _N without roundoff errors and without wasting any
                          space.

                  Comment


                  • #10
                    What Rich Goldstein writes technically answers the question as I posed it...

                    But I was wondering more Where is this thing c(obs_t) hiding, and why can I not see it when I type -creturn list-?

                    Comment


                    • #11
                      Originally posted by Joro Kolev View Post
                      Leonardo Guizzetti : How do you know this black magic?

                      When I type

                      Code:
                      creturn list
                      it does not seem to give me c(obs_t)...
                      I think I found it perusing -help creturn- and started incorporating it. Of course, once you know it, you have to remember that it exists. :D Perhaps it was omitted from -creturn list- by accident.

                      Comment


                      • #12
                        Uhhh why not just use egen's group function?

                        Comment


                        • #13
                          Re #12. You could. But unless you specify the -autotype- option, you will get the same problem as O.P. had in #1. -egen, group()- will, by default, generate a float. Also, what variable's would you use as the varlist argument for -group()-. Absent knowledge in advance of some small set of variables that uniquely identify observations, you would have to make it -egen obs_no = group(_all), autotype-. And I think for reasons pointed out in the next paragraph, the performance of that would be pretty poor.

                          Moreover, take a look at the code for -_ggroup.ado-. There's a lot of overhead in there, and it also sorts the data. When you want an "identifier" that incorporates more than one variable, it's probably worth it, but when you just want to identify individual observations, you can't beat -gen appropriate_data_type id = _n- for efficiency. In a very long data set, I imagine the performance difference would be noticeable, though I've never tried it.
                          Last edited by Clyde Schechter; 01 Sep 2022, 15:57.

                          Comment


                          • #14
                            I do not think that -egen, group(_all)- will do the trick. There might be multiple observations which share the same values of the variables.

                            If we are looking for exotic solutions

                            Code:
                            egen id = seq(), from(1)
                            will do the job. And I looked through the code, -egen, seq()- automatically employs the device Leonardo showed.

                            Comment


                            • #15
                              And of course -egen, seq()- is much slower than the native solution. Here:

                              Code:
                              . clear
                              
                              . set obs 52145974
                              Number of observations (_N) was 0, now 52,145,974.
                              
                              . gen n = rnormal()
                              
                              . timer clear
                              
                              . timeit 1: egen id1 = seq(), from(1)
                              
                              . timeit 2: gen long id2 = _n
                              
                              . timer list
                                 1:     11.22 /        1 =      11.2190
                                 2:      0.85 /        1 =       0.8540
                              
                              . assert id1 == id2
                              
                              .

                              Comment

                              Working...
                              X