Dataset with multiple observations per ID, how to count how many individual IDs?

Vilma Antonov

Join Date: Aug 2022

Posts: 47
#1

Dataset with multiple observations per ID, how to count how many individual IDs?

24 Aug 2022, 09:20

Hi!

I'm fairly new to Stata and have tried my best to research the forum on this topic, however without any luck, so I thought I'd try write a post. I'm sorry if this is too basic.

I have a large dataset with about 400,000 observations. Each patient is recorded in multiple observations (due to clinical follow ups). I would like to calculate how many unique IDs (aka patients) there is in the dataset. Any ideas?

Last edited by Vilma Antonov; 24 Aug 2022, 09:24.
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17673

24 Aug 2022, 09:29

Vilma:
welcome to this forum.
You may want to consider something along the following lines:

Code:

use "https://www.stata-press.com/data/r17/nlswork.dta"
. tab idcode if idcode<=5

     NLS ID |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         12       19.67       19.67
          2 |         12       19.67       39.34
          3 |         15       24.59       63.93
          4 |         11       18.03       81.97
          5 |         11       18.03      100.00
------------+-----------------------------------
      Total |         61      100.00

.

Another option might be -collapse-.

Kind regards,
Carlo
(Stata 19.0)

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35436
#3

24 Aug 2022, 09:35

Nobuya Fukugawa started a thread with essentially the same question earlier today.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#4

24 Aug 2022, 09:36

In addition to the possibilities suggested in #2, there is -distinct-, by Gary Longton & Nick Cox, available from SSC. And there is -egen, nvals()- in the -egenmore- package, by Nick Cox, also available from SSC.

Another approach is:
[/code]
by id, sort: gen long counter = sum(id != id[_n-1])
[/code]
which assigns consecutive integers starting from 1 to each distinct id. Then counter[_N] will be the number of distinct id's in the data set.
2 likes
Comment
Vilma Antonov

Join Date: Aug 2022

Posts: 47
#5

24 Aug 2022, 15:31

Thank you all so much for your suggestions! I didn't manage to make Carlo's suggestion work (I think the user -me- is to blame), however, -distinct- worked out!

I tried what Clyde suggested in #4, however, the new variable came out to 1 for each distinct ID... I don't really know how come, since they are not the same individual IDs.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#6

24 Aug 2022, 15:35

My error in #4. It should be:

Code:

sort id gen long counter = sum(id != id[_n-1])

Sorry about that.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#7

24 Aug 2022, 15:37

It depends exactly in what context, but if you just want the number,

Code:

u "http://fmwww.bc.edu/repec/bocode/s/scul_Reunification.dta", clear cls qui xtset qui insp `r(panelvar)' di r(N_unique)
Comment
Andressa Freire

Join Date: Aug 2022

Posts: 33
#8

24 Aug 2022, 19:13

Hi!

I'm new to Stata! I have a database model as shown below (hypothetical data). When running Prais-winsten regression, I would like to store the p-values and confidence intervals in a new variable.

The command “ statsby, by(id): prais log10_prevame year ”, stores only the beta values in a new variable. How do I store the p-value and CI as well?

I also tested the regsave command (below), but it only stores the values of the last regression of prais-wisten and not the set of regressions by id (my database has approx 2000 id).

by(id): prais log10_prevame year
regsave, tstat pval ci

I thank the help of all you.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#9

24 Aug 2022, 19:28

#8 is wildly off-topic for this thread. Posts in this Forum are not simply dialogs between a questioner and responder(s). Other people come to this Forum from time to time and search for already-existing answers to their questions. Also, those who respond to questions use the thread titles to decide which posts to read. So when threads go off topic, many people's time gets wasted.

Please repost in a new thread. When you do that, also please heed the advice in the Forum FAQ (especially #12) pointing out that screenshots of data are not helpful, and recommending, instead, the use of the -dataex- command for showing example data. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#10

24 Aug 2022, 23:36

I think the easiest way to do what OP wants is

Code:

egen tag = tag(id) count if tag
Comment
Vilma Antonov

Join Date: Aug 2022

Posts: 47
#11

20 Oct 2022, 03:14

I am so sorry for missing #6, #7 and #10, all great suggestions! Thanks a lot!
Comment

Announcement

Dataset with multiple observations per ID, how to count how many individual IDs?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment