How to count the number of distinct values (of a variable) that satisfy a given condition

Otavio Conceicao

Join Date: Feb 2017

Posts: 65
#1

How to count the number of distinct values (of a variable) that satisfy a given condition

28 Jan 2021, 15:31

Dear all,

I am using a dataset that is such that for a given value of the identifying variable ("var_id") we observe several values for another variable, let's say "var_of_interest".

In particular, the dataset is like

var_id var_of_interest

11 PI

11 PA

11 CE

11 BA

12 SP

12 SP

12 BA

12 RN

I would like to create a variable ("new_var") that counts the number of distinct values of "var_of_interest" that are different from "BA".

In particular, I would like the dataset to be like the following

var_id var_of_interest new_var

11 PI 3

11 PA 3

11 CE 3

11 BA 3

12 SP 2

12 SP 2

12 BA 2

12 RN 2

where "new_var" is the variable that retrieves the quantity of interest for each different id (from the variable "var_id").

Can you help me with that?

Thank you very much.

Any help is greatly appreciated.

Obs: here is the code to import the fictitious data:

clear
input var_id str2 var_of_interest
11 "PI"
11 "PA"
11 "CE"
11 "BA"
12 "SP"
12 "SP"
12 "BA"
12 "RN"
end
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10056

28 Jan 2021, 16:11

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float var_id str2 var_of_interest
11 "PI"
11 "PA"
11 "CE"
11 "BA"
12 "SP"
12 "SP"
12 "BA"
12 "RN"
end

bys var_id: egen wanted= total(var_of_interest!="BA")

Res.:

Code:

. l, sepby(var_id)

     +----------------------------+
     | var_id   var_of~t   wanted |
     |----------------------------|
  1. |     11         PI        3 |
  2. |     11         PA        3 |
  3. |     11         CE        3 |
  4. |     11         BA        3 |
     |----------------------------|
  5. |     12         SP        3 |
  6. |     12         SP        3 |
  7. |     12         BA        3 |
  8. |     12         RN        3 |
     +----------------------------+

Comment

Ali Atia

Join Date: May 2020

Posts: 737
#3

28 Jan 2021, 16:11

Code:

bys var_id var_of_interest: gen nvals = _n == 1 if var_of_interest!="BA" bys var_id: egen wanted = total(nvals) drop nvals

See also:
https://www.stata.com/support/faqs/d...-observations/
Comment
Otavio Conceicao

Join Date: Feb 2017

Posts: 65
#4

28 Jan 2021, 16:19

Thank you very much Ali Atia and Andrew Musau !!

I am curious about why the solution of Andrew did return a wrong value for var_id == 12 (the correct value for 'var_of_interest' in this case is 2 and not 3).

The solution from Ali works perfectly!

Thanks again!
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3047

28 Jan 2021, 16:22

Try something like this:

Code:

. gen  new_var = . 
(8 missing values generated)

. forvalues i=11/12 {
  2. levelsof var_of_interest if var_id==`i' & var_of_interest!="BA"
  3. replace  new_var = r(r) if var_id==`i'
  4. }
`"CE"' `"PA"' `"PI"'
(4 real changes made)
`"RN"' `"SP"'
(4 real changes made)

. 
. list, sepby(var_id)

     +-----------------------------+
     | var_id   var_of~t   new_var |
     |-----------------------------|
  1. |     11         PI         3 |
  2. |     11         PA         3 |
  3. |     11         CE         3 |
  4. |     11         BA         3 |
     |-----------------------------|
  5. |     12         SP         2 |
  6. |     12         SP         2 |
  7. |     12         BA         2 |
  8. |     12         RN         2 |
     +-----------------------------+

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10056

28 Jan 2021, 16:30

I did not notice that you have duplicates. -egen(total)- will just count how many observations are within a group. You can tag distinct observations (or first occurrences).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float var_id str2 var_of_interest
11 "PI"
11 "PA"
11 "CE"
11 "BA"
12 "SP"
12 "SP"
12 "BA"
12 "RN"
end

egen fob= tag(var_id var_of_interest)
bys var_id: egen wanted=total(cond(fob & var_of_interest!="BA", 1, 0))

Res.:

Code:

. l, sepby(var_id)

     +----------------------------------+
     | var_id   var_of~t   fob   wanted |
     |----------------------------------|
  1. |     11         PI     1        3 |
  2. |     11         PA     1        3 |
  3. |     11         CE     1        3 |
  4. |     11         BA     1        3 |
     |----------------------------------|
  5. |     12         SP     1        2 |
  6. |     12         SP     0        2 |
  7. |     12         BA     1        2 |
  8. |     12         RN     1        2 |
     +----------------------------------+

Last edited by Andrew Musau; 28 Jan 2021, 16:49.

Comment

Otavio Conceicao

Join Date: Feb 2017

Posts: 65
#7

28 Jan 2021, 20:39

Thank you very much Andrew Musau and Joro Kolev , your solutions worked like a charm!!

Now I understood Andrew the point of using "egen ... total" in the presence of duplicates, many thanks!
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3047

29 Jan 2021, 03:44

Both Ali's solution in #3 and Andrew's solution in #6 are preferable because they do not rely on looping over observations, and looping over observations is relatively slow. They both rely on the same idea of tagging an observation per group. Most probably Ali's solution would be the fastest over a large dataset.

Here is a variation of an egen solution like Andrew's, but more similar to Ali's way of tagging. The point is that -egen, tag- also accepts if/in conditions:

Code:

. egen tagnoba = tag( var_id var_of_interest) if var_of_interest!="BA"

. egen new_var = total(tagnoba), by( var_id )

. list, sepby(var_id)

     +---------------------------------------+
     | var_id   var_of~t   tagnoba   new_var |
     |---------------------------------------|
  1. |     11         PI         1         3 |
  2. |     11         PA         1         3 |
  3. |     11         CE         1         3 |
  4. |     11         BA         0         3 |
     |---------------------------------------|
  5. |     12         SP         1         2 |
  6. |     12         SP         0         2 |
  7. |     12         BA         0         2 |
  8. |     12         RN         1         2 |
     +---------------------------------------+

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35405
#9

29 Jan 2021, 04:25

The tag() function of egen goes back a long way in Stata (1999) and just encapsulates a longer-standing trick that I guess is in the manuals somewhere and/or was discussed on Statalist in the 1990s (all the old posts have disappeared).

In essence the nearest simple equivalent is

Code:

bysort varlist : gen tag = _n == 1

where varlist means a variable list, not something you type directly.

But -- as documented -- the egen function tag() returns 0 for observations excluded by an if or in condition, which people are understandably surprised by if they expect the call to return missings for excluded observations.

This was designed in as what seemed more useful behaviour. for most problems. I suppose there should be an option for those who want to insist on missings in that circumstance, but I don't recall much grumbling about it;
1 like
Comment
rezza cav

Join Date: Jul 2022

Posts: 13
#10

07 Jul 2022, 14:40

So what is the correct syntax if I want the "new_var" to count the number of distinct values of "var_of_interest" in general?

And how can I find out the frequency of each specific value of "var_of_interest" within the same "var_id"?
Comment
rezza cav

Join Date: Jul 2022

Posts: 13
#11

08 Jul 2022, 05:55

I have a very similar problem where I have not found the solution yet.

In my data (example from other post) I also have multiple entries for the same population ("entity") but also for different years ("year").

1. With "egen tag" and "egen ndistinct" I can find out how many different "firms" each combination of "entity" and "year" (group?) has.
( I will have to do this not only for "firm" but also for "value" and other variables)

2. Now I would like to know the frequencies for each value in "ndistinct" over all the combinations of "entity" and "year".
(To get an idea of the variation within different variables and think about how to merge them in the most sensible way).

3. Last I need to make sure there is only one observation per combinations of "entity" and "year" and therefore combine (merge, collapse, append...?) the values of other variables. If there is a most frequent value, choose that one, if there is not, choose the highest or first, or in some other cases the mean or sum (to be defined for each variable).

I really struggle to find a way how to do that. I hope my example is clear.

Original data

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte entity int year str1 firm float value 1 2010 "A" 15 1 2010 "A" 8 1 2010 "B" 12 1 2011 "B" 25 1 2012 "B" 8 2 2010 "A" 7 2 2011 "A" 5 2 2011 "A" 12 2 2011 "C" 13 2 2012 "A" 19 2 2012 "B" 25 2 2011 "B" 14 2 2012 "C" 18 2 2012 "D" 16 end sort entity year list, sepby(entity year)

1. Question

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte entity int year str1 firm float value 1 2010 "A" 15 1 2010 "A" 8 1 2010 "B" 12 1 2011 "B" 25 1 2012 "B" 8 2 2010 "A" 7 2 2011 "A" 5 2 2011 "A" 12 2 2011 "C" 13 2 2012 "A" 19 2 2012 "B" 25 2 2011 "B" 14 2 2012 "C" 18 2 2012 "D" 16 end egen tag = tag(firm entity year) egen ndistinct = total(tag), by(entity year) sort entity year list, sepby(entity year)

2. Question -> This is the table I would like to obtain

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte ndistinct freq 1 3 2 1 3 1 4 1 end sort ndistinct freq list, sepby(ndistinct freq)

3. Question?

See also:
https://www.statalist.org/forums/for...cases-by-group
https://www.statalist.org/forums/for...r-each-subject
https://www.statalist.org/forums/for...f-similar-type
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29942
#12

08 Jul 2022, 10:03

This is one of 6 threads on which this question is posted. Answered at https://www.statalist.org/forums/for...r-each-subject.
1 like
Comment

Announcement