Sum variables taking into account missing data

Diego Malo

Join Date: Jun 2019
Posts: 137

Sum variables taking into account missing data

29 Oct 2021, 04:30

Hello,

I have the following dataset:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long gid int(country_code year) long actor_id
62356   . 1997    .
62357   . 1997    .
79600 710 2012 2664
79601 710 2015 2664
79601 710 2015 2664
79601 710 1999 2794
79601 710 2013 2664
79601 710 2013 2664
79601 710 1999 2794
79601 710 2013 2664
80317 710 2015 2794
80317 710 2012 2664
80317 710 2017 2664
80317 710 2015 2664
80317 710 2015 2664
80317 710 2012 2535
80317 710 2002 2794
80317 710 2008 2664
80317 710 2015 2664
80317 710 2013 2794
80317 710 2009 2794
80317 710 2014 2794
80317 710 2009 2794
80317 710 2008 2664
80317 710 2017 2794
80317 710 2015 2794
80317 710 2015 2794
80317 710 2015 2794
80317 710 2008 2794
80317 710 2012 2664
80317 710 2015 2774
80317 710 2010 2794
80317 710 2017 2664
80317 710 2015 2664
80317 710 2014 2794
80317 710 2016 2664
80317 710 2016 2664
80317 710 2015 2794
80317 710 2014 2664
80317 710 2015 2794
80317 710 2016 2664
80317 710 2009 2794
80317 710 2012 2664
80317 710 2017 2794
80317 710 1999 2664
80317 710 2013 2794
80317 710 2009 2664
80317 710 2015 2794
80317 710 2012 2664
80317 710 2017 2664
80317 710 2017 2794
80317 710 2016 3181
80317 710 2016 2664
80317 710 2009 2794
80317 710 2012 2535
80317 710 2012 2794
80317 710 2010 2664
80317 710 2012 2664
80317 710 2015 2794
80317 710 2010 2664
80317 710 2015 2664
80317 710 2012 2794
80317 710 2017 2794
80317 710 2015 2664
80317 710 2012 2535
80317 710 2017 2794
80317 710 2012 2794
80317 710 2014 2794
80317 710 2015 2794
80317 710 2017 2794
80317 710 2014 2794
80317 710 2016 2664
80317 710 2015 2794
80317 710 2008 2794
80317 710 2017 2664
80317 710 2015 2794
80317 710 2015 2794
80317 710 2006 2664
80317 710 2012 2664
80317 710 2017 2664
80317 710 2012 2664
80317 710 2017 2664
80317 710 2009 2794
80317 710 2017 2794
80317 710 2001 2794
80317 710 2015 2794
80317 710 2017 2794
80317 710 2012 2664
80317 710 2017 2664
80317 710 2017 2794
80317 710 2017 2664
80317 710 2009 2794
80317 710 2013 2188
80317 710 2014 2664
80317 710 2009 2794
80317 710 2015 2794
80317 710 2015 2664
80317 710 2017 2664
80317 710 2013 2664
80317 710 2012 2664
end

I would like to sum the number of actors (variable actor_id) by year and gid. My final outcome should be a dataset which include a column with the total number of actors by gid and year. In total, my final dataset should be four columns, gid, year, total number of actors, country_code (it does not change by gid and year, it is time invariant).

I am using the following code:

Code:

bysort gid year actor_id: gen nactors = _n
keep if nactors == 1
egen nactors2 = count(nactors), by(year gid)

It is works for me if I do not have missing value. Unfortunately, I have it. For example, in the sample I attach, for gid 62356 and year 1997, there is a missing. I would like to have that the total number of actors by gid 62356 and year 1997 be 0. I am not able to do that.

In my dataset, if one pair year-gid has an actor, there is no any missing for that pair of gid-year pair.

I hope you understand my problem. If not, please let me know. Any suggestion/help/comment to solve it, it is more than welcome.

Tags: missing data, sum

Fei Wang

Join Date: Oct 2021

Posts: 726
#2

29 Oct 2021, 04:48

Diego, if your original data only have the four variables (gid, country_code, year, and actor_id), then the following code may work.

Code:

duplicates drop collapse (count) nactors = actor_id, by(country_code gid year)
Comment
Diego Malo

Join Date: Jun 2019

Posts: 137
#3

29 Oct 2021, 06:32

Dear Fei Wang, thank you for your answer. Yes, I think it works. My dataset has more variable, but all of them like country, that is, they are time-invariant. Should I include it in the by?

Diego.
Comment
Diego Malo

Join Date: Jun 2019

Posts: 137
#4

29 Oct 2021, 07:10

I have checked the command carefully and it's not working. Probably it's my fault because I do not specify well. I would like to have a column that sum the number of different actors by gid and year, not the total number of actors. I am sorry for this misunderstanding.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35444

29 Oct 2021, 07:18

With your data example (thanks!), I tried this

Code:

 . egen tag = tag(actor gid year)

. egen wanted = total(tag), by(gid year)

.
. tabdisp year gid, c(wanted)

---------------------------------------------
          |                gid              
     year | 62356  62357  79600  79601  80317
----------+----------------------------------
     1997 |     0      0                    
     1999 |                          1      1
     2001 |                                 1
     2002 |                                 1
     2006 |                                 1
     2008 |                                 2
     2009 |                                 2
     2010 |                                 2
     2012 |                   1             3
     2013 |                          1      3
     2014 |                                 2
     2015 |                          1      3
     2016 |                                 2
     2017 |                                 2
---------------------------------------------

For full context see https://www.stata-journal.com/articl...article=dm0042 especially p.563.

Comment

Fei Wang

Join Date: Oct 2021

Posts: 726
#6

29 Oct 2021, 07:28

Diego, the first line of code duplicates drop drops redundant observations. For example, in the data example, lines 4 and 5 are identical, and the command only keeps one line. After that, within each country_id-gid-year group, a particular actor id will only appear once -- then, counting the number of observations within a country_id-gid-year group is the same as counting the number of different actor ids within the group -- which is what you want.
1 like
Comment
Diego Malo

Join Date: Jun 2019

Posts: 137
#7

29 Oct 2021, 08:07

Thank you Nick Cox for your help. It seems that it works. Thank you also for the referee, it would be helpful.

Fei Wang you are right. I see your point. I did not get why I should include the duplicates drop at the begining but now I see it. Thank you.
Comment

Announcement

Sum variables taking into account missing data

Comment

Comment

Comment

Comment

Comment

Comment