Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting distinct number of identifiers in a panel dataset

    Dear reader,

    I have a panel dataset containing executive id's, years and firm id's (cusips)

    For every firm, over the time period 2002 to 2009, I want to know how many executives were once in power.

    Example of the dataset:

    fyear cusip execid
    2002 001055 00013
    2003 001055 00013
    2004 001055 00013
    2005 001055 00013
    2006 001055 00013
    2007 001055 00013
    2008 001055 00013
    2009 001055 00013
    2004 004239 28561
    2005 004239 28561
    2006 004239 28561
    2007 004239 28561
    2008 004239 28561
    2009 004239 28561
    2002 00817Y 20970
    2003 00817Y 20970
    2004 00817Y 20970
    2005 00817Y 20970
    2006 00817Y 14660
    2007 00817Y 14660
    2008 00817Y 14660
    2009 00817Y 14660

    I want to create a variable that counts the number of execid per cusip over the years 2002 to 2009.
    This would mean that for cusip 001055 this number will be 1
    for cusip 004239 this number will also be 1
    for cusip 00817y however this number will be 2 because there are two different execid id's.

    I have tried the following:
    Code:
    by execid cusip, sort: gen nvals = _n == 1 if inrange(fyear,2002,2009)
    count if nvals
    sort Cusip
    by Cusip: replace nvals = sum(nvals)
    by Cusip: replace nvals = nvals[_N]
    But this is incorrect because Stata 12 counts from random years in the 2002 to 2009 range.
    An example of a mistake is that is counts 2 execid's for companies that only have existed from 2007 onwards that only had 1 execid.

    Does anyone know the answer to this problem?

    Regards,
    Nicole


  • #2
    You can do this "from first principles" as follows:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int fyear str6 cusip int execid
    2002 "001055"    13
    2003 "001055"    13
    2004 "001055"    13
    2005 "001055"    13
    2006 "001055"    13
    2007 "001055"    13
    2008 "001055"    13
    2009 "001055"    13
    2004 "004239" 28561
    2005 "004239" 28561
    2006 "004239" 28561
    2007 "004239" 28561
    2008 "004239" 28561
    2009 "004239" 28561
    2002 "00817Y" 20970
    2003 "00817Y" 20970
    2004 "00817Y" 20970
    2005 "00817Y" 20970
    2006 "00817Y" 14660
    2007 "00817Y" 14660
    2008 "00817Y" 14660
    2009 "00817Y" 14660
    end
    
    by cusip execid (fyear), sort: gen byte exec_count = 1 if _n == 1 & inrange(fyear, 2002, 2009)
    by cusip (execid fyear): replace exec_count = sum(exec_count)
    by cusip (exec_count), sort: replace exec_count = exec_count[_N]
    
    list, noobs clean
    Or, -findit distinct- will lead you to a link to Nick Cox and G.M. Longton's -distinct.ado-, which you can install and use according to the instructions in its helpfile.

    In the future, please use -dataex- to show example data (-ssc install dataex-, -help dataex-). It took much longer to re-create your example the way it was posted than it took to solve your problem. Please be considerate of those who want to help you.

    Comment


    • #3
      What Clyde said, plus http://www.stata-journal.com/sjpdf.h...iclenum=dm0042

      Also, a search of this forum for dm0042 -- now that incantation has been revealed will show related threads.

      Comment


      • #4
        Hi Nicole,

        I'm not sure I correctly understood your question, but you could try something like this:
        Code:
        clear
        input fyear str20(cusip execid)
        2002 "001055" "00013"
        2003 "001055" "00013"
        2004 "001055" "00013"
        2005 "001055" "00013"
        2006 "001055" "00013"
        2007 "001055" "00013"
        2008 "001055" "00013"
        2009 "001055" "00013"
        2004 "004239" "28561"
        2005 "004239" "28561"
        2006 "004239" "28561"
        2007 "004239" "28561"
        2008 "004239" "28561"
        2009 "004239" "28561"
        2002 "00817Y" "20970"
        2003 "00817Y" "20970"
        2004 "00817Y" "20970"
        2005 "00817Y" "20970"
        2006 "00817Y" "14660"
        2007 "00817Y" "14660"
        2008 "00817Y" "14660"
        2009 "00817Y" "14660"
        end
        compress
        
        egen new_execid=tag(cusip execid)
        generate Cusip=sum(new_execid)
        Essentially, we first create a tag marking each event when a new executive is in charge, and afterwards calculate the cumulative sum of these events as your target variable.

        Hope this helps. Regards
        Bela

        PS: Please consider installing -dataex- from SSC and posting data examples using -dataex-'s output for future questions. This really helps to quickly get into the topic and find a solution without manually inserting quotes and so into example data.

        Comment


        • #5
          Hm, Clyde and Nick were, as usually, quicker with their replies. And their answers are, also as usual, perfectly correct.
          Last edited by Daniel Bela; 13 Jul 2016, 08:59. Reason: typo

          Comment


          • #6
            Dear all,

            Thank you all for your help.

            I have tried Clydes method and that worked for the question at hand.

            In the same light as the other question: I now need unique executive id's over the time period 2002 to 2009 for EVERY year.
            This is so I can count the number of board memebers.

            With Clyde's command I get the count of unique executvie id's (execid) over the time period 2002 to 2009.
            I however need this for every year. Should I use the same command but only change the inrange to 2002,2002 ; 2003,2003 ; 2004,2004 etc.

            Regards,
            Nicole
            Last edited by Nicole Zander; 28 Jul 2016, 03:52. Reason: Alteration of question

            Comment

            Working...
            X