Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Count of distinct units in a panel with "by" option

    Hi Beloved Stata Members
    I would like to get the count of distinct units (co_code in my case) based on say each year (how many distinct units are there in each year). I tried user written command distinct, but it doesn't work with option-by-

    Here is my sample data
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long co_code int year float(ncfo eqissue)
        11 2000       .      .
        11 2001       .      .
        11 2002       .      .
        11 2003       .      .
        11 2004       .      .
        11 2005       .      .
        11 2006       .      .
        11 2007       .      .
        11 2008       .      .
        11 2009      .6   60.3
        11 2010    21.1   -6.9
        11 2011     2.3     .3
        11 2012   127.4   51.5
        11 2013   445.8   83.5
        11 2014   367.5     70
        11 2015     303      .
        11 2016   254.2     49
        11 2017   444.7      .
        11 2018   605.7   -2.1
        11 2019   533.6   -1.1
        11 2020   447.4      .
        11 2021   520.4    -.2
     96387 2000       .      .
     96387 2001       .      .
     96387 2002       .      .
     96387 2003       .      .
     96387 2004    39.7     .1
     96387 2005   285.4    1.8
     96387 2006   263.8 1554.7
     96387 2007  1167.6   10.3
     96387 2008  2910.6 -123.1
     96387 2009  5906.6     12
     96387 2010  3422.3 3097.8
     96387 2011  2991.6 1664.9
     96387 2012 -1544.6      .
     96387 2013   494.8 -164.2
     96387 2014   497.6      .
     96387 2015  -513.2      .
     96387 2016  1990.6      .
     96387 2017    1294      .
     96387 2018       .      .
     96387 2019  1370.3      .
     96387 2020  1685.8      .
     96387 2021       .      .
     36277 2000       .      .
     36277 2001       .      .
     36277 2002       .      .
     36277 2003       .      .
     36277 2004       .      .
     36277 2005       .      .
     36277 2006       .      .
     36277 2007       .      .
     36277 2008       .      .
     36277 2009       .      .
     36277 2010       .      .
     36277 2011       .      .
     36277 2012       .      .
     36277 2013       .      .
     36277 2014       .      .
     36277 2015       .      .
     36277 2016       .      .
     36277 2017       .      .
     36277 2018   361.9      .
     36277 2019  2529.9      .
     36277 2020  2450.5      .
     36277 2021  3210.7      .
    491859 2000       .      .
    491859 2001       .      .
    491859 2002       .      .
    491859 2003       .      .
    491859 2004       .      .
    491859 2005       .      .
    491859 2006       .      .
    491859 2007       .      .
    491859 2008       .      .
    491859 2009       .      .
    491859 2010       .      .
    491859 2011       .      .
    491859 2012       .      .
    491859 2013       .      .
    491859 2014       .      .
    491859 2015       .      .
    491859 2016       .      .
    491859 2017       .      .
    491859 2018       .      .
    491859 2019       .      .
    491859 2020       .      .
    491859 2021       .      .
     73119 2000       .      .
     73119 2001       .      .
     73119 2002       .      .
     73119 2003       .      .
     73119 2004   158.7  231.1
     73119 2005  1098.4  763.1
     73119 2006  3122.2 2221.2
     73119 2007  1258.8  -34.8
     73119 2008 -2962.1 4558.2
     73119 2009 -1883.8     50
     73119 2010  -413.4  191.3
     73119 2011 -4878.9      .
    end
    [/CODE]


    . distinct co_code, by (year)
    option by() not allowed


    How to get distinct count with one or more criteria?

  • #2
    If each line represents a unique pair of id and year, then the code would be simple as below.

    Code:
    bys year: gen dist_id = _N

    Comment


    • #3
      distinct is from the Stata Journal, as you are asked to explain (FAQ Advice #12).

      The 2008 paper remains pertinent, but if you want to use the command, go for the latest update:

      SJ-20-4 dm0042_3 . . . . . . . . . . . . . . . . Software update for distinct
      (help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
      Q4/20 SJ 20(4):1028--1030
      sort() option has been added

      SJ-15-3 dm0042_2 . . . . . . . . . . . . . . . . Software update for distinct
      (help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
      Q3/15 SJ 15(3):899
      improved table format and display of large numbers of
      observations

      SJ-12-2 dm0042_1 . . . . . . . . . . . . . . . . Software update for distinct
      (help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
      Q2/12 SJ 12(2):352
      options added to restrict output to variables with a minimum
      or maximum of distinct values

      SJ-8-4 dm0042 . . . . . . . . . . . . Speaking Stata: Distinct observations
      (help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
      Q4/08 SJ 8(4):557--568
      shows how to answer questions about distinct observations
      from first principles; provides a convenience command


      As documented in its help, the by prefix is allowed, as here


      Code:
      . sysuse auto, clear
      (1978 automobile data)
      
      . by foreign : distinct rep78
      
      ------------------------------------------------------------------------------------------------------------------
      -> foreign = Domestic
      
      ------------------------------
             |     total   distinct
      -------+----------------------
       rep78 |        48          5
      ------------------------------
      
      ------------------------------------------------------------------------------------------------------------------
      -> foreign = Foreign
      
      ------------------------------
             |     total   distinct
      -------+----------------------
       rep78 |        21          3
      ------------------------------
      For a large number of panels, this output may not be most convenient. I tend to use a device explained in the 2008 paper (and which hinges on official commands only):


      Code:
      . egen tag = tag(foreign rep78)
      
      . egen ndistinct = total(tag), by(foreign)
      
      . tabdisp foreign, c(ndistinct)
      
      ----------------------
      Car       |
      origin    |  ndistinct
      ----------+-----------
       Domestic |          5
        Foreign |          3
      ----------------------
      Here's a link to the paper. https://www.stata-journal.com/articl...article=dm0042 See especially p.563.

      Comment


      • #4
        Thanks a lot Nick Cox and Fei Wang from saving me from a huge manual work using Excel.

        Comment

        Working...
        X