Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Count number of distinct cases by group

    Hi Stata Users,
    I am using Stata version 15 to calculate the number of distinct cases (firm) by a group of two variables (entity and year). For instance, for entity 1, year 2010 I would have 2 (The aim is to avoid double counting!). Below is example of dataset
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte entity int year str1 firm float value
    1 2010 "A" 15
    1 2010 "A"  8
    1 2010 "B" 12
    1 2011 "B" 25
    1 2012 "B"  8
    2 2010 "A"  7
    2 2011 "A"  5
    2 2011 "A" 12
    2 2011 "C" 13
    2 2012 "A" 19
    2 2012 "B" 25
    2 2011 "B" 14
    2 2012 "C" 18
    2 2012 "D" 16
    end
    Thanks in advance!

  • #2
    Hello Stephen,

    You might try
    Code:
    bysort entity year firm : count

    Comment


    • #3
      Thanks for the data example!

      This question arises often here. The paper in SJ 8-4 (2008) accessible at https://www.stata-journal.com/articl...article=dm0042 includes what is possibly the most useful trick, as below.

      SJ-15-3 dm0042_2 . . . . . . . . . . . . . . . . Software update for distinct
      (help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
      Q3/15 SJ 15(3):899
      improved table format and display of large numbers of
      observations

      SJ-12-2 dm0042_1 . . . . . . . . . . . . . . . . Software update for distinct
      (help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
      Q2/12 SJ 12(2):352
      options added to restrict output to variables with a minimum
      or maximum of distinct values


      SJ-8-4 dm0042 . . . . . . . . . . . . Speaking Stata: Distinct observations
      (help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
      Q4/08 SJ 8(4):557--568
      shows how to answer questions about distinct observations
      from first principles; provides a convenience command


      dm0042 is thus revealed as an otherwise unpredictable search term for other discussions on Statalist.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte entity int year str1 firm float value
      1 2010 "A" 15
      1 2010 "A"  8
      1 2010 "B" 12
      1 2011 "B" 25
      1 2012 "B"  8
      2 2010 "A"  7
      2 2011 "A"  5
      2 2011 "A" 12
      2 2011 "C" 13
      2 2012 "A" 19
      2 2012 "B" 25
      2 2011 "B" 14
      2 2012 "C" 18
      2 2012 "D" 16
      end
      
      * two lines of white magic
      egen tag = tag(firm entity year)
      egen ndistinct = total(tag), by(entity year)
      
      sort entity year
      list, sepby(entity year)
      
           +-----------------------------------------------+
           | entity   year   firm   value   tag   ndisti~t |
           |-----------------------------------------------|
        1. |      1   2010      B      12     1          2 |
        2. |      1   2010      A      15     1          2 |
        3. |      1   2010      A       8     0          2 |
           |-----------------------------------------------|
        4. |      1   2011      B      25     1          1 |
           |-----------------------------------------------|
        5. |      1   2012      B       8     1          1 |
           |-----------------------------------------------|
        6. |      2   2010      A       7     1          1 |
           |-----------------------------------------------|
        7. |      2   2011      B      14     1          3 |
        8. |      2   2011      C      13     1          3 |
        9. |      2   2011      A      12     0          3 |
       10. |      2   2011      A       5     1          3 |
           |-----------------------------------------------|
       11. |      2   2012      C      18     1          4 |
       12. |      2   2012      B      25     1          4 |
       13. |      2   2012      A      19     1          4 |
       14. |      2   2012      D      16     1          4 |
           +-----------------------------------------------+
      egenmore (SSC) contains code for an nvals() function for egen from 2000. It took me some years to see instead that the calculation is an easy two-step with existing official functions.

      Comment


      • #4
        Thanks Daniel Schaefer for your proposal. However, it doesn't generate a new variable which is my goal.

        Comment


        • #5
          Nick Cox thanks for your constant brilliance Your solutions are perfect! I appreciate the paper you've shared that extensively tackles the matter.

          Comment


          • #6
            How can I count the frequencies for each value in "ndistinct" over all the groups that match "entity" and "year"?

            The desired table would look like this:

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input byte ndistinct freq
            1 3
            2 1
            3 1
            4 1

            end
            sort ndistinct freq
            list, sepby(ndistinct freq)
            Last edited by rezza cav; 07 Jul 2022, 15:59.

            Comment


            • #7
              I have a very similar problem where I have not found the solution yet.

              In my data (example from other post) I also have multiple entries for the same population ("entity") but also for different years ("year").

              1. With "egen tag" and "egen ndistinct" I can find out how many different "firms" each combination of "entity" and "year" (group?) has.
              ( I will have to do this not only for "firm" but also for "value" and other variables)

              2. Now I would like to know the frequencies for each value in "ndistinct" over all the combinations of "entity" and "year".
              (To get an idea of the variation within different variables and think about how to merge them in the most sensible way).

              3. Last I need to make sure there is only one observation per combinations of "entity" and "year" and therefore combine (merge, collapse, append...?) the values of other variables. If there is a most frequent value, choose that one, if there is not, choose the highest or first, or in some other cases the mean or sum (to be defined for each variable).

              I really struggle to find a way how to do that. I hope my example is clear.

              Original data
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input byte entity int year str1 firm float value
              1 2010 "A" 15
              1 2010 "A"  8
              1 2010 "B" 12
              1 2011 "B" 25
              1 2012 "B"  8
              2 2010 "A"  7
              2 2011 "A"  5
              2 2011 "A" 12
              2 2011 "C" 13
              2 2012 "A" 19
              2 2012 "B" 25
              2 2011 "B" 14
              2 2012 "C" 18
              2 2012 "D" 16
              end
              
              sort entity year
              list, sepby(entity year)
              1. Question
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input byte entity int year str1 firm float value
              1 2010 "A" 15
              1 2010 "A"  8
              1 2010 "B" 12
              1 2011 "B" 25
              1 2012 "B"  8
              2 2010 "A"  7
              2 2011 "A"  5
              2 2011 "A" 12
              2 2011 "C" 13
              2 2012 "A" 19
              2 2012 "B" 25
              2 2011 "B" 14
              2 2012 "C" 18
              2 2012 "D" 16
              end
              
              egen tag = tag(firm entity year)
              egen ndistinct = total(tag), by(entity year)
              
              sort entity year
              list, sepby(entity year)
              2. Question -> This is the table I would like to obtain
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input byte ndistinct freq
              1 3
              2 1
              3 1
              4 1
              end
              sort ndistinct freq
              list, sepby(ndistinct freq)
              3. Question?

              See also:
              https://www.statalist.org/forums/for...iven-condition
              https://www.statalist.org/forums/for...r-each-subject
              https://www.statalist.org/forums/for...f-similar-type

              Comment


              • #8
                Rezza,

                Please don't make identical posts across several threads like this. This behavior may be considered "spamming" the forum. You should incentivize people to respond to a single thread so that any interested readers can see all of the responses to your post.
                Last edited by Daniel Schaefer; 08 Jul 2022, 10:05.

                Comment


                • #9
                  Dear Daniel
                  Thank you very much for your reply. I did not mean to spam the forum. The individual threads all have a similar questions to mine, but only provide part of the solution I am looking for. That's why I thought it would fit in several threads. How do you think I should proceed to find an answer?

                  Comment


                  • #10
                    Rezza, I see that you have a parallel conversation about this same issue here: https://www.statalist.org/forums/for...r-each-subject

                    Please just post in the other thread going forward. If you have a parallel conversation happening somewhere else, please let us know and provide a link.
                    Last edited by Daniel Schaefer; 11 Jul 2022, 11:26.

                    Comment

                    Working...
                    X