Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sum variables taking into account missing data

    Hello,

    I have the following dataset:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long gid int(country_code year) long actor_id
    62356   . 1997    .
    62357   . 1997    .
    79600 710 2012 2664
    79601 710 2015 2664
    79601 710 2015 2664
    79601 710 1999 2794
    79601 710 2013 2664
    79601 710 2013 2664
    79601 710 1999 2794
    79601 710 2013 2664
    80317 710 2015 2794
    80317 710 2012 2664
    80317 710 2017 2664
    80317 710 2015 2664
    80317 710 2015 2664
    80317 710 2012 2535
    80317 710 2002 2794
    80317 710 2008 2664
    80317 710 2015 2664
    80317 710 2013 2794
    80317 710 2009 2794
    80317 710 2014 2794
    80317 710 2009 2794
    80317 710 2008 2664
    80317 710 2017 2794
    80317 710 2015 2794
    80317 710 2015 2794
    80317 710 2015 2794
    80317 710 2008 2794
    80317 710 2012 2664
    80317 710 2015 2774
    80317 710 2010 2794
    80317 710 2017 2664
    80317 710 2015 2664
    80317 710 2014 2794
    80317 710 2016 2664
    80317 710 2016 2664
    80317 710 2015 2794
    80317 710 2014 2664
    80317 710 2015 2794
    80317 710 2016 2664
    80317 710 2009 2794
    80317 710 2012 2664
    80317 710 2017 2794
    80317 710 1999 2664
    80317 710 2013 2794
    80317 710 2009 2664
    80317 710 2015 2794
    80317 710 2012 2664
    80317 710 2017 2664
    80317 710 2017 2794
    80317 710 2016 3181
    80317 710 2016 2664
    80317 710 2009 2794
    80317 710 2012 2535
    80317 710 2012 2794
    80317 710 2010 2664
    80317 710 2012 2664
    80317 710 2015 2794
    80317 710 2010 2664
    80317 710 2015 2664
    80317 710 2012 2794
    80317 710 2017 2794
    80317 710 2015 2664
    80317 710 2012 2535
    80317 710 2017 2794
    80317 710 2012 2794
    80317 710 2014 2794
    80317 710 2015 2794
    80317 710 2017 2794
    80317 710 2014 2794
    80317 710 2016 2664
    80317 710 2015 2794
    80317 710 2008 2794
    80317 710 2017 2664
    80317 710 2015 2794
    80317 710 2015 2794
    80317 710 2006 2664
    80317 710 2012 2664
    80317 710 2017 2664
    80317 710 2012 2664
    80317 710 2017 2664
    80317 710 2009 2794
    80317 710 2017 2794
    80317 710 2001 2794
    80317 710 2015 2794
    80317 710 2017 2794
    80317 710 2012 2664
    80317 710 2017 2664
    80317 710 2017 2794
    80317 710 2017 2664
    80317 710 2009 2794
    80317 710 2013 2188
    80317 710 2014 2664
    80317 710 2009 2794
    80317 710 2015 2794
    80317 710 2015 2664
    80317 710 2017 2664
    80317 710 2013 2664
    80317 710 2012 2664
    end
    I would like to sum the number of actors (variable actor_id) by year and gid. My final outcome should be a dataset which include a column with the total number of actors by gid and year. In total, my final dataset should be four columns, gid, year, total number of actors, country_code (it does not change by gid and year, it is time invariant).

    I am using the following code:

    Code:
    bysort gid year actor_id: gen nactors = _n
    keep if nactors == 1
    egen nactors2 = count(nactors), by(year gid)
    It is works for me if I do not have missing value. Unfortunately, I have it. For example, in the sample I attach, for gid 62356 and year 1997, there is a missing. I would like to have that the total number of actors by gid 62356 and year 1997 be 0. I am not able to do that.

    In my dataset, if one pair year-gid has an actor, there is no any missing for that pair of gid-year pair.

    I hope you understand my problem. If not, please let me know. Any suggestion/help/comment to solve it, it is more than welcome.

  • #2
    Diego, if your original data only have the four variables (gid, country_code, year, and actor_id), then the following code may work.

    Code:
    duplicates drop
    collapse (count) nactors = actor_id, by(country_code gid year)

    Comment


    • #3
      Dear Fei Wang, thank you for your answer. Yes, I think it works. My dataset has more variable, but all of them like country, that is, they are time-invariant. Should I include it in the by?

      Diego.

      Comment


      • #4
        I have checked the command carefully and it's not working. Probably it's my fault because I do not specify well. I would like to have a column that sum the number of different actors by gid and year, not the total number of actors. I am sorry for this misunderstanding.

        Comment


        • #5
          With your data example (thanks!), I tried this

          Code:
           . egen tag = tag(actor gid year)
          
          . egen wanted = total(tag), by(gid year)
          
          .
          . tabdisp year gid, c(wanted)
          
          ---------------------------------------------
                    |                gid              
               year | 62356  62357  79600  79601  80317
          ----------+----------------------------------
               1997 |     0      0                    
               1999 |                          1      1
               2001 |                                 1
               2002 |                                 1
               2006 |                                 1
               2008 |                                 2
               2009 |                                 2
               2010 |                                 2
               2012 |                   1             3
               2013 |                          1      3
               2014 |                                 2
               2015 |                          1      3
               2016 |                                 2
               2017 |                                 2
          ---------------------------------------------
          For full context see https://www.stata-journal.com/articl...article=dm0042 especially p.563.

          Comment


          • #6
            Diego, the first line of code duplicates drop drops redundant observations. For example, in the data example, lines 4 and 5 are identical, and the command only keeps one line. After that, within each country_id-gid-year group, a particular actor id will only appear once -- then, counting the number of observations within a country_id-gid-year group is the same as counting the number of different actor ids within the group -- which is what you want.

            Comment


            • #7
              Thank you Nick Cox for your help. It seems that it works. Thank you also for the referee, it would be helpful.

              Fei Wang you are right. I see your point. I did not get why I should include the duplicates drop at the begining but now I see it. Thank you.

              Comment

              Working...
              X