Counting distinct number of identifiers in a panel dataset

Nicole Zander

Join Date: Jan 2016

Posts: 38
#1

Counting distinct number of identifiers in a panel dataset

13 Jul 2016, 08:03

Dear reader,

I have a panel dataset containing executive id's, years and firm id's (cusips)

For every firm, over the time period 2002 to 2009, I want to know how many executives were once in power.

Example of the dataset:

fyear cusip execid
2002 001055 00013
2003 001055 00013
2004 001055 00013
2005 001055 00013
2006 001055 00013
2007 001055 00013
2008 001055 00013
2009 001055 00013
2004 004239 28561
2005 004239 28561
2006 004239 28561
2007 004239 28561
2008 004239 28561
2009 004239 28561
2002 00817Y 20970
2003 00817Y 20970
2004 00817Y 20970
2005 00817Y 20970
2006 00817Y 14660
2007 00817Y 14660
2008 00817Y 14660
2009 00817Y 14660

I want to create a variable that counts the number of execid per cusip over the years 2002 to 2009.
This would mean that for cusip 001055 this number will be 1
for cusip 004239 this number will also be 1
for cusip 00817y however this number will be 2 because there are two different execid id's.

I have tried the following:

Code:

by execid cusip, sort: gen nvals = _n == 1 if inrange(fyear,2002,2009) count if nvals sort Cusip by Cusip: replace nvals = sum(nvals) by Cusip: replace nvals = nvals[_N]

But this is incorrect because Stata 12 counts from random years in the 2002 to 2009 range.
An example of a mistake is that is counts 2 execid's for companies that only have existed from 2007 onwards that only had 1 execid.

Does anyone know the answer to this problem?

Regards,
Nicole
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 29956

13 Jul 2016, 08:50

You can do this "from first principles" as follows:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int fyear str6 cusip int execid
2002 "001055"    13
2003 "001055"    13
2004 "001055"    13
2005 "001055"    13
2006 "001055"    13
2007 "001055"    13
2008 "001055"    13
2009 "001055"    13
2004 "004239" 28561
2005 "004239" 28561
2006 "004239" 28561
2007 "004239" 28561
2008 "004239" 28561
2009 "004239" 28561
2002 "00817Y" 20970
2003 "00817Y" 20970
2004 "00817Y" 20970
2005 "00817Y" 20970
2006 "00817Y" 14660
2007 "00817Y" 14660
2008 "00817Y" 14660
2009 "00817Y" 14660
end

by cusip execid (fyear), sort: gen byte exec_count = 1 if _n == 1 & inrange(fyear, 2002, 2009)
by cusip (execid fyear): replace exec_count = sum(exec_count)
by cusip (exec_count), sort: replace exec_count = exec_count[_N]

list, noobs clean

Or, -findit distinct- will lead you to a link to Nick Cox and G.M. Longton's -distinct.ado-, which you can install and use according to the instructions in its helpfile.

In the future, please use -dataex- to show example data (-ssc install dataex-, -help dataex-). It took much longer to re-create your example the way it was posted than it took to solve your problem. Please be considerate of those who want to help you.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35433
#3

13 Jul 2016, 08:53

What Clyde said, plus http://www.stata-journal.com/sjpdf.h...iclenum=dm0042

Also, a search of this forum for dm0042 -- now that incantation has been revealed will show related threads.
Comment
Daniel Bela

Join Date: Apr 2014

Posts: 246
#4

13 Jul 2016, 08:54

Hi Nicole,

I'm not sure I correctly understood your question, but you could try something like this:

Code:

clear input fyear str20(cusip execid) 2002 "001055" "00013" 2003 "001055" "00013" 2004 "001055" "00013" 2005 "001055" "00013" 2006 "001055" "00013" 2007 "001055" "00013" 2008 "001055" "00013" 2009 "001055" "00013" 2004 "004239" "28561" 2005 "004239" "28561" 2006 "004239" "28561" 2007 "004239" "28561" 2008 "004239" "28561" 2009 "004239" "28561" 2002 "00817Y" "20970" 2003 "00817Y" "20970" 2004 "00817Y" "20970" 2005 "00817Y" "20970" 2006 "00817Y" "14660" 2007 "00817Y" "14660" 2008 "00817Y" "14660" 2009 "00817Y" "14660" end compress egen new_execid=tag(cusip execid) generate Cusip=sum(new_execid)

Essentially, we first create a tag marking each event when a new executive is in charge, and afterwards calculate the cumulative sum of these events as your target variable.

Hope this helps. Regards
Bela

PS: Please consider installing -dataex- from SSC and posting data examples using -dataex-'s output for future questions. This really helps to quickly get into the topic and find a solution without manually inserting quotes and so into example data.
Comment
Daniel Bela

Join Date: Apr 2014

Posts: 246
#5

13 Jul 2016, 08:55

Hm, Clyde and Nick were, as usually, quicker with their replies. And their answers are, also as usual, perfectly correct.

Last edited by Daniel Bela; 13 Jul 2016, 08:59. Reason: typo
Comment
Nicole Zander

Join Date: Jan 2016

Posts: 38
#6

28 Jul 2016, 03:44

Dear all,

Thank you all for your help.

I have tried Clydes method and that worked for the question at hand.

In the same light as the other question: I now need unique executive id's over the time period 2002 to 2009 for EVERY year.
This is so I can count the number of board memebers.

With Clyde's command I get the count of unique executvie id's (execid) over the time period 2002 to 2009.
I however need this for every year. Should I use the same command but only change the inrange to 2002,2002 ; 2003,2003 ; 2004,2004 etc.

Regards,
Nicole

Last edited by Nicole Zander; 28 Jul 2016, 03:52. Reason: Alteration of question
Comment

Announcement