Count distinct values of variable 'Year' for each different id in variable 'Id'

Francesco Pinto

Join Date: Mar 2020

Posts: 16
#1

Count distinct values of variable 'Year' for each different id in variable 'Id'

29 Mar 2020, 12:14

Dear Statalist users,

this is my first post in the forum and I am a Stata beginner, so please forgive me if my post does not comply with the niveau inhere.
After searching through manuals and the Forum I am still struggling to solve the following problem:

I have a panel dataset with two variables, ID and YearSC.

ID contains ids of individuals. Due to the panel nature of the dataset the same id can occur multiple times.
YearSC contains the date(type:int) when the individual last entered his current job. The same Id can have different values of YearSC.

I would like to count, for each different id, how many different values for YearSC it has and create the corrisponding CNT variable.

This is the result I would like to obtain.

ID YearSC CNT

111 1980 1

222 1985 2

333 1990 3

222 1985 2

222 2005 2

333 1990 3

333 1995 3

333 2000 3

111 1980 1

I hope I could be clear enough and sorry if this is too basic. I saw other similar posts but I couldn't work it out for my problem.

Thank you for your help.

Francesco
Tags: None
Justin Niakamal

Join Date: Aug 2017

Posts: 757
#2

29 Mar 2020, 12:39

Code:

bys ID (YearSC): gen CNT = _N

Should do what you want if I'm interpreting your question correctly.
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10083
#3

29 Mar 2020, 12:40

Code:

isid ID YearSC bys ID: gen CNT= _N

If you have duplicates of ID and YearSC, the above code will not give you a count of distinct values of YearSC.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#4

29 Mar 2020, 12:42

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int(id yearsc) 111 1980 222 1985 333 1990 222 1985 222 2005 333 1990 333 1995 333 2000 111 1980 end by id yearsc, sort: gen cnt = (_n == 1) by id (yearsc): replace cnt = sum(cnt) by id (yearsc): replace cnt = cnt[_N]

In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
2 likes
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

29 Mar 2020, 12:44

Welcome to Statalist.

The trick is to sort your data by ID and by YearSC, and then count the number of times YearSC is not the same as the previous value.

Code:

. by id (yearsc), sort:  generate newyear = yearsc!=yearsc[_n-1]

. by id: egen my_cnt = total(newyear)

. list, sepby(id)

     +---------------------------------------+
     |  id   yearsc   cnt   newyear   my_cnt |
     |---------------------------------------|
  1. | 111     1980     1         1        1 |
  2. | 111     1980     1         0        1 |
     |---------------------------------------|
  3. | 222     1985     2         1        2 |
  4. | 222     1985     2         0        2 |
  5. | 222     2005     2         1        2 |
     |---------------------------------------|
  6. | 333     1990     3         1        3 |
  7. | 333     1990     3         0        3 |
  8. | 333     1995     3         1        3 |
  9. | 333     2000     3         1        3 |
     +---------------------------------------+

Added in edit: Clyde beat me by 2 minutes, with an approach I hadn't thought of.

Last edited by William Lisowski; 29 Mar 2020, 12:46.

Comment

Francesco Pinto

Join Date: Mar 2020

Posts: 16
#6

29 Mar 2020, 15:16

Thank you very much (everyone) for your answers.
Clyde Schechter : Thank you very much, your suggestion helped me a lot (I was unprecise about the data and had to modify the first line, since CJobStart contains missing values).
William Lisowski : Thanks to you too. Your suggestion,was also helpful. The problem for the first three lines is also there though (see below).

But there is still an error in the results. It gives me the wrong count (=2 instead of 1), if a pid has always the same value in CJobStart.
In the example below the cnt variable gives the correct number except for pid: 10007857 (lines 1 to 3)

this is the code I used:

by pid CJobStart, sort: gen cnt = (_n == 1) if CJobStart <=. & CJobStart >=0
by pid (CJobStart): replace cnt = sum(cnt)
by pid (CJobStart): replace cnt = cnt[_N]

Here is an extract of the data I have (sorry if it is so long)

It would be great if you could help me (again). Thank you!

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int CJobStart long pid 1990 10007857 1990 10007857 1990 10007857 1985 10014608 1992 10014608 1984 10014608 1984 10014608 1984 10014608 1984 10014608 . 10016813 1998 10016813 . 10016813 . 10016813 . 10016813 . 10016848 1990 10016848 . 10016848 1994 10016848 . 10016872 1998 10016872 . 10016872 1998 10016872 1998 10016872 1998 10016872 . 10016872 . 10016872 . 10016872 . 10016872 . 10016872 1996 10017933 1997 10017933 1987 10017968 1990 10017968 1989 10017968 1977 10017968 . 10017968 2002 10017992 . 10017992 2004 10017992 2003 10017992 2006 10017992 2006 10017992 end label values CJobStart rjbbgy4

Last edited by Francesco Pinto; 29 Mar 2020, 15:29.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10083

29 Mar 2020, 15:52

With missing years, you need an -if- condition to exclude these observations.

EDIT: Subscripting within egen produces unstable results, as is documented. The code is therefore revised.

Code:

gen tag= !missing(CJobStart)
bys pid (CJobStart tag): gen sum = sum(CJobStart!= CJobStart[_n+1]) if tag
bys pid: egen wanted=max(sum)

Res.:

Code:

. list, sepby(pid)

     +-------------------------------+
     |      pid   CJobSt~t   wanted |
     |-------------------------------|
  1. | 10007857       1990         1 |
  2. | 10007857       1990         1 |
  3. | 10007857       1990         1 |
     |-------------------------------|
  4. | 10014608       1984         3 |
  5. | 10014608       1984         3 |
  6. | 10014608       1984         3 |
  7. | 10014608       1984         3 |
  8. | 10014608       1985         3 |
  9. | 10014608       1992         3 |
     |-------------------------------|
 10. | 10016813       1998         1 |
 11. | 10016813          .         1 |
 12. | 10016813          .         1 |
 13. | 10016813          .         1 |
 14. | 10016813          .         1 |
     |-------------------------------|
 15. | 10016848       1990         2 |
 16. | 10016848       1994         2 |
 17. | 10016848          .         2 |
 18. | 10016848          .         2 |
     |-------------------------------|
 19. | 10016872       1998         1 |
 20. | 10016872       1998         1 |
 21. | 10016872       1998         1 |
 22. | 10016872       1998         1 |
 23. | 10016872          .         1 |
 24. | 10016872          .         1 |
 25. | 10016872          .         1 |
 26. | 10016872          .         1 |
 27. | 10016872          .         1 |
 28. | 10016872          .         1 |
 29. | 10016872          .         1 |
     |-------------------------------|
 30. | 10017933       1996         2 |
 31. | 10017933       1997         2 |
     |-------------------------------|
 32. | 10017968       1977         4 |
 33. | 10017968       1987         4 |
 34. | 10017968       1989         4 |
 35. | 10017968       1990         4 |
 36. | 10017968          .         4 |
     |-------------------------------|
 37. | 10017992       2002         4 |
 38. | 10017992       2003         4 |
 39. | 10017992       2004         4 |
 40. | 10017992       2006         4 |
 41. | 10017992       2006         4 |
 42. | 10017992          .         4 |
     +-------------------------------+

.

Last edited by Andrew Musau; 29 Mar 2020, 16:24.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29956

29 Mar 2020, 15:54

So, I assume you do not want to count missing value is a distinct value of CJobStart. The following code gives the correct count of distinct non-missing values of CJobStart per person:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int CJobStart long pid
1990 10007857
1990 10007857
1990 10007857
1985 10014608
1992 10014608
1984 10014608
1984 10014608
1984 10014608
1984 10014608
   . 10016813
1998 10016813
   . 10016813
   . 10016813
   . 10016813
   . 10016848
1990 10016848
   . 10016848
1994 10016848
   . 10016872
1998 10016872
   . 10016872
1998 10016872
1998 10016872
1998 10016872
   . 10016872
   . 10016872
   . 10016872
   . 10016872
   . 10016872
1996 10017933
1997 10017933
1987 10017968
1990 10017968
1989 10017968
1977 10017968
   . 10017968
2002 10017992
   . 10017992
2004 10017992
2003 10017992
2006 10017992
2006 10017992
end
label values CJobStart rjbbgy4

gen byte mv = missing(CJobStart)
by pid mv CJobStart, sort: gen cnt = (_n == 1)  & !mv
by pid (mv CJobStart): replace cnt = sum(cnt)
by pid (mv CJobStart): replace cnt = cnt[_N]

Lesson to those reading along: in many, perhaps most, data management problems, the presence of missing values in a variable requires some kind of explicit attention in the code. When posting example data, if your real data contains missing values of variables, your example data should, too.

Comment

Francesco Pinto

Join Date: Mar 2020

Posts: 16
#9

30 Mar 2020, 04:24

Thank you very much!
Comment

Announcement