Count distinct values by groups

Duy To

Join Date: Sep 2020

Posts: 43
#1

Count distinct values by groups

29 Nov 2021, 15:34

Hello everyone,

I have one question related to counting distinct values by groups. Here is an example of the data with ID, year, and the job code (i.e., job_code). What I am looking for is to create a variable (or collapse data) that shows how many jobs they have throughout the year from 1994 to 1996. For example, ID 1 had 3 jobs, ID 2 had 1 and ID 3 had 3.

Code:

input ID year job_code 1 1994 50 1 1995 53 1 1996 60 2 1994 35 3 1994 68 3 1995 60 3 1996 53 end

I am aware of the command -distinct- :

Code:

bysort ID: distinct job_code

but of course, it only lists out the number of distinct values for each ID. I do not know how to incorporate the -distinct- command with other commands to create a new variable that can show me how many jobs each ID has.

Could anyone help me with this question, please? Thank you in advance, and stay safe.
Tags: None

Roman Mostazir

Join Date: Apr 2014
Posts: 870

29 Nov 2021, 16:31

Is that something you are after? (See below). I created two new variables of which, one shows the total number of distinct jobs one had (distinct2) repeated in each row, the other is showing the sum of unique job they had chronologically:

Code:

clear

input id year job_code
1 1994 50
1 1995 53
1 1996 60
2 1994 35
3 1994 68
3 1995 60
3 1996 53
4 1994 51
4 1995 51
4 1996 52
end

egen distinct = tag(id job_code)
byso id: replace distinct = sum(distinct)
byso id: egen distinct2 = max(distinct)

  id   year   job_code   distinct   distin~2  
     1   1994         50          1          3  
     1   1995         53          2          3  
     1   1996         60          3          3  
     2   1994         35          1          1  
     3   1994         68          1          3  
     3   1995         60          2          3  
     3   1996         53          3          3  
     4   1994         51          1          2  
     4   1995         51          1          2  
     4   1996         52          2          2

Roman

Comment

Duy To

Join Date: Sep 2020

Posts: 43
#3

29 Nov 2021, 17:02

Thanks, Roman, this is what I am looking for <3 Thanks again, and stay safe.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#4

29 Nov 2021, 18:18

This method, akin to that of Roman Mostazir but simpler, is discussed on p.563 of https://www.stata-journal.com/articl...article=dm0042 -- the paper that introducedm the distinct command.

Code:

clear input id year job_code 1 1994 50 1 1995 53 1 1996 60 2 1994 35 3 1994 68 3 1995 60 3 1996 53 4 1994 51 4 1995 51 4 1996 52 end egen tag = tag(id job_code) egen distinct = total(tag), by(id)
1 like
Comment
Duy To

Join Date: Sep 2020

Posts: 43
#5

29 Nov 2021, 18:27

Thank you for your help, Nick!!!
Comment

Announcement

Count distinct values by groups

Comment

Comment

Comment

Comment