Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Count distinct values by groups

    Hello everyone,

    I have one question related to counting distinct values by groups. Here is an example of the data with ID, year, and the job code (i.e., job_code). What I am looking for is to create a variable (or collapse data) that shows how many jobs they have throughout the year from 1994 to 1996. For example, ID 1 had 3 jobs, ID 2 had 1 and ID 3 had 3.

    Code:
    input ID year job_code
    1 1994 50
    1 1995 53
    1 1996 60
    2 1994 35
    3 1994 68
    3 1995 60
    3 1996 53
    end
    I am aware of the command -distinct- :
    Code:
    bysort ID: distinct job_code
    but of course, it only lists out the number of distinct values for each ID. I do not know how to incorporate the -distinct- command with other commands to create a new variable that can show me how many jobs each ID has.

    Could anyone help me with this question, please? Thank you in advance, and stay safe.

  • #2
    Is that something you are after? (See below). I created two new variables of which, one shows the total number of distinct jobs one had (distinct2) repeated in each row, the other is showing the sum of unique job they had chronologically:

    Code:
    clear
    
    input id year job_code
    1 1994 50
    1 1995 53
    1 1996 60
    2 1994 35
    3 1994 68
    3 1995 60
    3 1996 53
    4 1994 51
    4 1995 51
    4 1996 52
    end
    
    egen distinct = tag(id job_code)
    byso id: replace distinct = sum(distinct)
    byso id: egen distinct2 = max(distinct)
    
      id   year   job_code   distinct   distin~2  
         1   1994         50          1          3  
         1   1995         53          2          3  
         1   1996         60          3          3  
         2   1994         35          1          1  
         3   1994         68          1          3  
         3   1995         60          2          3  
         3   1996         53          3          3  
         4   1994         51          1          2  
         4   1995         51          1          2  
         4   1996         52          2          2
    Roman

    Comment


    • #3
      Thanks, Roman, this is what I am looking for <3 Thanks again, and stay safe.

      Comment


      • #4
        This method, akin to that of Roman Mostazir but simpler, is discussed on p.563 of https://www.stata-journal.com/articl...article=dm0042 -- the paper that introducedm the distinct command.

        Code:
         
         clear  input id year job_code 1 1994 50 1 1995 53 1 1996 60 2 1994 35 3 1994 68 3 1995 60 3 1996 53 4 1994 51 4 1995 51 4 1996 52 end     
         egen tag = tag(id job_code) egen distinct = total(tag), by(id)

        Comment


        • #5
          Thank you for your help, Nick!!!

          Comment

          Working...
          X