Problem with gen id = _n

Shoummo Sen Gupta

Join Date: Oct 2018

Posts: 32
#1

Problem with gen id = _n

01 Sep 2022, 00:50

Dear All,

I have a dataset with 52,145,974 observations and 93 variables. I want to create a unique id, so I executed the following command.

gen id = _n

However, this did not work. the generated variable id did not identify the observations uniquely.

Shoummo
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#2

01 Sep 2022, 00:57

Shoummo:
without any example/excerpt of your dataset (that you can easily provide via -dataex-), it is difficult to reply positively.
That said, the basic question that crosses my mind is: are you dealing with a cross-sectional or panel dataset?

Kind regards,
Carlo
(Stata 19.0)
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#3

01 Sep 2022, 01:03

If you are dealing with a cross-sectional data, then try

Code:

gen long id = _n
1 like
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1396
#4

01 Sep 2022, 03:29

Shoummo Sen Gupta the problem is that Stata would by default have created a variable of type float, which have a precision of only about seven digits. You can see the data type if you do

Code:

describe id

.

Since your ID number has eight digits, you need to tell Stata to create a long variable instead, exactly as suggested in #3. If you had more than 9 digits (up to 16), you would have needed double.

You might want to look at

Code:

help data types
Comment
Shoummo Sen Gupta

Join Date: Oct 2018

Posts: 32
#5

01 Sep 2022, 04:57

Originally posted by Fei Wang View Post

If you are dealing with a cross-sectional data, then try

Code:

gen long id = _n

Thank you. This solved the problem.
Comment
Shoummo Sen Gupta

Join Date: Oct 2018

Posts: 32
#6

01 Sep 2022, 04:58

Originally posted by Hemanshu Kumar View Post

Shoummo Sen Gupta the problem is that Stata would by default have created a variable of type float, which have a precision of only about seven digits. You can see the data type if you do

Code:

describe id

.

Since your ID number has eight digits, you need to tell Stata to create a long variable instead, exactly as suggested in #3. If you had more than 9 digits (up to 16), you would have needed double.

You might want to look at

Code:

help data types

Thank you
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#7

01 Sep 2022, 06:50

Originally posted by Fei Wang View Post

If you are dealing with a cross-sectional data, then try

Code:

gen long id = _n

This code works. It can be somewhat improved by using the built-in macro which will automatically select the correct storage type for the number of observations in your dataset.

Code:

gen `c(obs_t)' id = _n
3 likes
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#8

01 Sep 2022, 14:12

Leonardo Guizzetti : How do you know this black magic?

When I type

Code:

creturn list

it does not seem to give me c(obs_t)...
Comment

Rich Goldstein

Join Date: Mar 2014
Posts: 4462

01 Sep 2022, 14:20

here is what is in the help file (h creturn):

Code:

 c(obs_t) returns a string equal to the optimal data type for storing _n.  This allows you to
        code

            generate `c(obs_t)' index = _n

        and know that index will go from 1 to _N without roundoff errors and without wasting any
        space.

Comment

Joro Kolev

Join Date: Aug 2018

Posts: 3050
#10

01 Sep 2022, 14:42

What Rich Goldstein writes technically answers the question as I posed it...

But I was wondering more Where is this thing c(obs_t) hiding, and why can I not see it when I type -creturn list-?
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#11

01 Sep 2022, 15:04

Originally posted by Joro Kolev View Post

Leonardo Guizzetti : How do you know this black magic?

When I type

Code:

creturn list

it does not seem to give me c(obs_t)...

I think I found it perusing -help creturn- and started incorporating it. Of course, once you know it, you have to remember that it exists. :D Perhaps it was omitted from -creturn list- by accident.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#12

01 Sep 2022, 15:32

Uhhh why not just use egen's group function?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30099
#13

01 Sep 2022, 15:54

Re #12. You could. But unless you specify the -autotype- option, you will get the same problem as O.P. had in #1. -egen, group()- will, by default, generate a float. Also, what variable's would you use as the varlist argument for -group()-. Absent knowledge in advance of some small set of variables that uniquely identify observations, you would have to make it -egen obs_no = group(_all), autotype-. And I think for reasons pointed out in the next paragraph, the performance of that would be pretty poor.

Moreover, take a look at the code for -_ggroup.ado-. There's a lot of overhead in there, and it also sorts the data. When you want an "identifier" that incorporates more than one variable, it's probably worth it, but when you just want to identify individual observations, you can't beat -gen appropriate_data_type id = _n- for efficiency. In a very long data set, I imagine the performance difference would be noticeable, though I've never tried it.

Last edited by Clyde Schechter; 01 Sep 2022, 15:57.
1 like
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#14

02 Sep 2022, 02:30

I do not think that -egen, group(_all)- will do the trick. There might be multiple observations which share the same values of the variables.

If we are looking for exotic solutions

Code:

egen id = seq(), from(1)

will do the job. And I looked through the code, -egen, seq()- automatically employs the device Leonardo showed.
1 like
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

#15

02 Sep 2022, 02:41

And of course -egen, seq()- is much slower than the native solution. Here:

Code:

. clear

. set obs 52145974
Number of observations (_N) was 0, now 52,145,974.

. gen n = rnormal()

. timer clear

. timeit 1: egen id1 = seq(), from(1)

. timeit 2: gen long id2 = _n

. timer list
   1:     11.22 /        1 =      11.2190
   2:      0.85 /        1 =       0.8540

. assert id1 == id2

.

Announcement

Problem with gen id = _n

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment