Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Count distinct values of variable 'Year' for each different id in variable 'Id'

    Dear Statalist users,

    this is my first post in the forum and I am a Stata beginner, so please forgive me if my post does not comply with the niveau inhere.
    After searching through manuals and the Forum I am still struggling to solve the following problem:

    I have a panel dataset with two variables, ID and YearSC.

    ID contains ids of individuals. Due to the panel nature of the dataset the same id can occur multiple times.
    YearSC contains the date(type:int) when the individual last entered his current job. The same Id can have different values of YearSC.

    I would like to count, for each different id, how many different values for YearSC it has and create the corrisponding CNT variable.

    This is the result I would like to obtain.

    ID YearSC CNT
    111 1980 1
    222 1985 2
    333 1990 3
    222 1985 2
    222 2005 2
    333 1990 3
    333 1995 3
    333 2000 3
    111 1980 1

    I hope I could be clear enough and sorry if this is too basic. I saw other similar posts but I couldn't work it out for my problem.

    Thank you for your help.

    Francesco


  • #2
    Code:
    bys ID (YearSC): gen CNT = _N
    Should do what you want if I'm interpreting your question correctly.

    Comment


    • #3
      Code:
      isid ID YearSC
      bys ID: gen CNT= _N
      If you have duplicates of ID and YearSC, the above code will not give you a count of distinct values of YearSC.

      Comment


      • #4
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input int(id yearsc)
        111 1980
        222 1985
        333 1990
        222 1985
        222 2005
        333 1990
        333 1995
        333 2000
        111 1980
        end
        
        by id yearsc, sort: gen cnt = (_n == 1)
        by id (yearsc): replace cnt = sum(cnt)
        by id (yearsc): replace cnt = cnt[_N]

        In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

        When asking for help with code, always show example data. When showing example data, always use -dataex-.

        Comment


        • #5
          Welcome to Statalist.

          The trick is to sort your data by ID and by YearSC, and then count the number of times YearSC is not the same as the previous value.
          Code:
          . by id (yearsc), sort:  generate newyear = yearsc!=yearsc[_n-1]
          
          . by id: egen my_cnt = total(newyear)
          
          . list, sepby(id)
          
               +---------------------------------------+
               |  id   yearsc   cnt   newyear   my_cnt |
               |---------------------------------------|
            1. | 111     1980     1         1        1 |
            2. | 111     1980     1         0        1 |
               |---------------------------------------|
            3. | 222     1985     2         1        2 |
            4. | 222     1985     2         0        2 |
            5. | 222     2005     2         1        2 |
               |---------------------------------------|
            6. | 333     1990     3         1        3 |
            7. | 333     1990     3         0        3 |
            8. | 333     1995     3         1        3 |
            9. | 333     2000     3         1        3 |
               +---------------------------------------+
          Added in edit: Clyde beat me by 2 minutes, with an approach I hadn't thought of.
          Last edited by William Lisowski; 29 Mar 2020, 12:46.

          Comment


          • #6
            Thank you very much (everyone) for your answers.
            Clyde Schechter : Thank you very much, your suggestion helped me a lot (I was unprecise about the data and had to modify the first line, since CJobStart contains missing values).
            William Lisowski : Thanks to you too. Your suggestion,was also helpful. The problem for the first three lines is also there though (see below).

            But there is still an error in the results. It gives me the wrong count (=2 instead of 1), if a pid has always the same value in CJobStart.
            In the example below the cnt variable gives the correct number except for pid: 10007857 (lines 1 to 3)

            this is the code I used:

            by pid CJobStart, sort: gen cnt = (_n == 1) if CJobStart <=. & CJobStart >=0
            by pid (CJobStart): replace cnt = sum(cnt)
            by pid (CJobStart): replace cnt = cnt[_N]

            Here is an extract of the data I have (sorry if it is so long)

            It would be great if you could help me (again). Thank you!

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input int CJobStart long pid
            1990 10007857
            1990 10007857
            1990 10007857
            1985 10014608
            1992 10014608
            1984 10014608
            1984 10014608
            1984 10014608
            1984 10014608
               . 10016813
            1998 10016813
               . 10016813
               . 10016813
               . 10016813
               . 10016848
            1990 10016848
               . 10016848
            1994 10016848
               . 10016872
            1998 10016872
               . 10016872
            1998 10016872
            1998 10016872
            1998 10016872
               . 10016872
               . 10016872
               . 10016872
               . 10016872
               . 10016872
            1996 10017933
            1997 10017933
            1987 10017968
            1990 10017968
            1989 10017968
            1977 10017968
               . 10017968
            2002 10017992
               . 10017992
            2004 10017992
            2003 10017992
            2006 10017992
            2006 10017992
            end
            label values CJobStart rjbbgy4
            Last edited by Francesco Pinto; 29 Mar 2020, 15:29.

            Comment


            • #7
              With missing years, you need an -if- condition to exclude these observations.

              EDIT: Subscripting within egen produces unstable results, as is documented. The code is therefore revised.

              Code:
              gen tag= !missing(CJobStart)
              bys pid (CJobStart tag): gen sum = sum(CJobStart!= CJobStart[_n+1]) if tag
              bys pid: egen wanted=max(sum)
              Res.:

              Code:
              . list, sepby(pid)
              
                   +-------------------------------+
                   |      pid   CJobSt~t   wanted |
                   |-------------------------------|
                1. | 10007857       1990         1 |
                2. | 10007857       1990         1 |
                3. | 10007857       1990         1 |
                   |-------------------------------|
                4. | 10014608       1984         3 |
                5. | 10014608       1984         3 |
                6. | 10014608       1984         3 |
                7. | 10014608       1984         3 |
                8. | 10014608       1985         3 |
                9. | 10014608       1992         3 |
                   |-------------------------------|
               10. | 10016813       1998         1 |
               11. | 10016813          .         1 |
               12. | 10016813          .         1 |
               13. | 10016813          .         1 |
               14. | 10016813          .         1 |
                   |-------------------------------|
               15. | 10016848       1990         2 |
               16. | 10016848       1994         2 |
               17. | 10016848          .         2 |
               18. | 10016848          .         2 |
                   |-------------------------------|
               19. | 10016872       1998         1 |
               20. | 10016872       1998         1 |
               21. | 10016872       1998         1 |
               22. | 10016872       1998         1 |
               23. | 10016872          .         1 |
               24. | 10016872          .         1 |
               25. | 10016872          .         1 |
               26. | 10016872          .         1 |
               27. | 10016872          .         1 |
               28. | 10016872          .         1 |
               29. | 10016872          .         1 |
                   |-------------------------------|
               30. | 10017933       1996         2 |
               31. | 10017933       1997         2 |
                   |-------------------------------|
               32. | 10017968       1977         4 |
               33. | 10017968       1987         4 |
               34. | 10017968       1989         4 |
               35. | 10017968       1990         4 |
               36. | 10017968          .         4 |
                   |-------------------------------|
               37. | 10017992       2002         4 |
               38. | 10017992       2003         4 |
               39. | 10017992       2004         4 |
               40. | 10017992       2006         4 |
               41. | 10017992       2006         4 |
               42. | 10017992          .         4 |
                   +-------------------------------+
              
              .
              Last edited by Andrew Musau; 29 Mar 2020, 16:24.

              Comment


              • #8
                So, I assume you do not want to count missing value is a distinct value of CJobStart. The following code gives the correct count of distinct non-missing values of CJobStart per person:

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input int CJobStart long pid
                1990 10007857
                1990 10007857
                1990 10007857
                1985 10014608
                1992 10014608
                1984 10014608
                1984 10014608
                1984 10014608
                1984 10014608
                   . 10016813
                1998 10016813
                   . 10016813
                   . 10016813
                   . 10016813
                   . 10016848
                1990 10016848
                   . 10016848
                1994 10016848
                   . 10016872
                1998 10016872
                   . 10016872
                1998 10016872
                1998 10016872
                1998 10016872
                   . 10016872
                   . 10016872
                   . 10016872
                   . 10016872
                   . 10016872
                1996 10017933
                1997 10017933
                1987 10017968
                1990 10017968
                1989 10017968
                1977 10017968
                   . 10017968
                2002 10017992
                   . 10017992
                2004 10017992
                2003 10017992
                2006 10017992
                2006 10017992
                end
                label values CJobStart rjbbgy4
                
                gen byte mv = missing(CJobStart)
                by pid mv CJobStart, sort: gen cnt = (_n == 1)  & !mv
                by pid (mv CJobStart): replace cnt = sum(cnt)
                by pid (mv CJobStart): replace cnt = cnt[_N]


                Lesson to those reading along: in many, perhaps most, data management problems, the presence of missing values in a variable requires some kind of explicit attention in the code. When posting example data, if your real data contains missing values of variables, your example data should, too.

                Comment


                • #9
                  Thank you very much!

                  Comment

                  Working...
                  X