Count number of distinct cases by group

Stephen Okiya

Join Date: Mar 2025

Posts: 280
#1

Count number of distinct cases by group

03 Jun 2020, 08:15

Hi Stata Users,
I am using Stata version 15 to calculate the number of distinct cases (firm) by a group of two variables (entity and year). For instance, for entity 1, year 2010 I would have 2 (The aim is to avoid double counting!). Below is example of dataset

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte entity int year str1 firm float value 1 2010 "A" 15 1 2010 "A" 8 1 2010 "B" 12 1 2011 "B" 25 1 2012 "B" 8 2 2010 "A" 7 2 2011 "A" 5 2 2011 "A" 12 2 2011 "C" 13 2 2012 "A" 19 2 2012 "B" 25 2 2011 "B" 14 2 2012 "C" 18 2 2012 "D" 16 end

Thanks in advance!
Tags: None
Daniel Schaefer

Join Date: Mar 2020

Posts: 807
#2

03 Jun 2020, 08:35

Hello Stephen,

You might try

Code:

bysort entity year firm : count
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35256

03 Jun 2020, 08:52

Thanks for the data example!

This question arises often here. The paper in SJ 8-4 (2008) accessible at https://www.stata-journal.com/articl...article=dm0042 includes what is possibly the most useful trick, as below.

SJ-15-3 dm0042_2 . . . . . . . . . . . . . . . . Software update for distinct
(help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
Q3/15 SJ 15(3):899
improved table format and display of large numbers of
observations

SJ-12-2 dm0042_1 . . . . . . . . . . . . . . . . Software update for distinct
(help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
Q2/12 SJ 12(2):352
options added to restrict output to variables with a minimum
or maximum of distinct values

SJ-8-4 dm0042 . . . . . . . . . . . . Speaking Stata: Distinct observations
(help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
Q4/08 SJ 8(4):557--568
shows how to answer questions about distinct observations
from first principles; provides a convenience command

dm0042 is thus revealed as an otherwise unpredictable search term for other discussions on Statalist.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte entity int year str1 firm float value
1 2010 "A" 15
1 2010 "A"  8
1 2010 "B" 12
1 2011 "B" 25
1 2012 "B"  8
2 2010 "A"  7
2 2011 "A"  5
2 2011 "A" 12
2 2011 "C" 13
2 2012 "A" 19
2 2012 "B" 25
2 2011 "B" 14
2 2012 "C" 18
2 2012 "D" 16
end

* two lines of white magic
egen tag = tag(firm entity year)
egen ndistinct = total(tag), by(entity year)

sort entity year
list, sepby(entity year)

     +-----------------------------------------------+
     | entity   year   firm   value   tag   ndisti~t |
     |-----------------------------------------------|
  1. |      1   2010      B      12     1          2 |
  2. |      1   2010      A      15     1          2 |
  3. |      1   2010      A       8     0          2 |
     |-----------------------------------------------|
  4. |      1   2011      B      25     1          1 |
     |-----------------------------------------------|
  5. |      1   2012      B       8     1          1 |
     |-----------------------------------------------|
  6. |      2   2010      A       7     1          1 |
     |-----------------------------------------------|
  7. |      2   2011      B      14     1          3 |
  8. |      2   2011      C      13     1          3 |
  9. |      2   2011      A      12     0          3 |
 10. |      2   2011      A       5     1          3 |
     |-----------------------------------------------|
 11. |      2   2012      C      18     1          4 |
 12. |      2   2012      B      25     1          4 |
 13. |      2   2012      A      19     1          4 |
 14. |      2   2012      D      16     1          4 |
     +-----------------------------------------------+

egenmore (SSC) contains code for an nvals() function for egen from 2000. It took me some years to see instead that the calculation is an easy two-step with existing official functions.

Comment

Stephen Okiya

Join Date: Mar 2025

Posts: 280
#4

03 Jun 2020, 23:54

Thanks Daniel Schaefer for your proposal. However, it doesn't generate a new variable which is my goal.
Comment
Stephen Okiya

Join Date: Mar 2025

Posts: 280
#5

03 Jun 2020, 23:57

Nick Cox thanks for your constant brilliance Your solutions are perfect! I appreciate the paper you've shared that extensively tackles the matter.
Comment
rezza cav

Join Date: Jul 2022

Posts: 13
#6

07 Jul 2022, 15:50

How can I count the frequencies for each value in "ndistinct" over all the groups that match "entity" and "year"?

The desired table would look like this:

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte ndistinct freq
1 3
2 1
3 1
4 1

end
sort ndistinct freq
list, sepby(ndistinct freq)

Last edited by rezza cav; 07 Jul 2022, 15:59.
Comment
rezza cav

Join Date: Jul 2022

Posts: 13
#7

08 Jul 2022, 05:56

I have a very similar problem where I have not found the solution yet.

In my data (example from other post) I also have multiple entries for the same population ("entity") but also for different years ("year").

1. With "egen tag" and "egen ndistinct" I can find out how many different "firms" each combination of "entity" and "year" (group?) has.
( I will have to do this not only for "firm" but also for "value" and other variables)

2. Now I would like to know the frequencies for each value in "ndistinct" over all the combinations of "entity" and "year".
(To get an idea of the variation within different variables and think about how to merge them in the most sensible way).

3. Last I need to make sure there is only one observation per combinations of "entity" and "year" and therefore combine (merge, collapse, append...?) the values of other variables. If there is a most frequent value, choose that one, if there is not, choose the highest or first, or in some other cases the mean or sum (to be defined for each variable).

I really struggle to find a way how to do that. I hope my example is clear.

Original data

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte entity int year str1 firm float value 1 2010 "A" 15 1 2010 "A" 8 1 2010 "B" 12 1 2011 "B" 25 1 2012 "B" 8 2 2010 "A" 7 2 2011 "A" 5 2 2011 "A" 12 2 2011 "C" 13 2 2012 "A" 19 2 2012 "B" 25 2 2011 "B" 14 2 2012 "C" 18 2 2012 "D" 16 end sort entity year list, sepby(entity year)

1. Question

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte entity int year str1 firm float value 1 2010 "A" 15 1 2010 "A" 8 1 2010 "B" 12 1 2011 "B" 25 1 2012 "B" 8 2 2010 "A" 7 2 2011 "A" 5 2 2011 "A" 12 2 2011 "C" 13 2 2012 "A" 19 2 2012 "B" 25 2 2011 "B" 14 2 2012 "C" 18 2 2012 "D" 16 end egen tag = tag(firm entity year) egen ndistinct = total(tag), by(entity year) sort entity year list, sepby(entity year)

2. Question -> This is the table I would like to obtain

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte ndistinct freq 1 3 2 1 3 1 4 1 end sort ndistinct freq list, sepby(ndistinct freq)

3. Question?

See also:
https://www.statalist.org/forums/for...iven-condition
https://www.statalist.org/forums/for...r-each-subject
https://www.statalist.org/forums/for...f-similar-type
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 807
#8

08 Jul 2022, 10:00

Rezza,

Please don't make identical posts across several threads like this. This behavior may be considered "spamming" the forum. You should incentivize people to respond to a single thread so that any interested readers can see all of the responses to your post.

Last edited by Daniel Schaefer; 08 Jul 2022, 10:05.
1 like
Comment
rezza cav

Join Date: Jul 2022

Posts: 13
#9

10 Jul 2022, 23:35

Dear Daniel
Thank you very much for your reply. I did not mean to spam the forum. The individual threads all have a similar questions to mine, but only provide part of the solution I am looking for. That's why I thought it would fit in several threads. How do you think I should proceed to find an answer?
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 807
#10

11 Jul 2022, 10:53

Rezza, I see that you have a parallel conversation about this same issue here: https://www.statalist.org/forums/for...r-each-subject

Please just post in the other thread going forward. If you have a parallel conversation happening somewhere else, please let us know and provide a link.

Last edited by Daniel Schaefer; 11 Jul 2022, 11:26.
Comment

Announcement