Sum by group the distinct values of two different columns into a new one.

Diego Malo

Join Date: Jun 2019
Posts: 140

Sum by group the distinct values of two different columns into a new one.

03 Nov 2021, 11:21

Hello,

I have the following database:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long gid int(year ccode_GID) long(actor_id actor_id2)
62356 1997 216    .    .
62357 1997 216    .    .
79599 1997 216    .    .
79600 1997 216    .    .
79601 1997 216    .    .
80317 2012 216 2082  393
80317 2014 216 2082  393
80317 2015 216 2515  393
80317 2013 216 1795  393
80317 2012 216 2082  393
80317 2017 216 2515  393
80317 2016 216 2515  393
80317 2012 216 2082  393
80317 2016 216 2515  393
80317 2012 216 2082  393
80317 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2009 216 2082  393
80318 2015 216 2515  393
80318 2012 216 2082  393
80318 2017 216 2515  393
80318 2012 216 2082 1911
80318 2002 216 2515  393
80318 2015 216 1258  393
80318 2012 216 2082 1911
80318 2014 216 2515  393
80318 1997 216 2515  393
80318 2012 216 2082  393
80318 2003 216 2515  393
80318 1998 216 2515  393
80318 2004 216 2515  392
80318 2015 216 2082 1911
80318 2012 216   32   21
80318 2013 216 2622  335
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2007 216 2082 1822
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 1998 216 2515  393
80318 2014 216 2515  393
80318 2012 216 2082 1911
80318 2014 216 2515  393
80318 2017 216 2515  393
80318 2012 216 2082 1911
80318 2013 216 2515  393
80318 1997 216 2515  393
80318 2003 216 2515  393
80318 2013 216 1795  393
80318 2015 216 2082 1822
80318 2012 216 2515  393
80318 2013 216 2515  393
80319 2012 216 2082 1911
80319 2012 216 2082 1911
80320 1997 216    .    .
80321 1997 216    .    .
80322 1997 216    .    .
80323 1997 216    .    .
80324 1997 216    .    .
80325 1997 216    .    .
80326 1997 216    .    .
80327 2012 216 2082 1911
80327 2012 216 2082 1911
80327 2012 216 2082 1911
80328 1997 216    .    .
80329 1997 216    .    .
80330 2004 216 2515  393
80331 1997 216    .    .
80332 1997 216    .    .
81037 2012 216 2082  408
81037 2003 216 2515 1727
81037 2000 216 2515  393
81037 2014 216 2515  393
81037 1998 216 1909  393
81037 2004 216 2515 1727
81037 2002 216 2515  393
81037 2012 216 2082 1911
81037 1999 216 2515  393
81037 2015 216 2082 1822
81037 1999 216   33 2095
81037 2002 216 2082  393
81037 2013 216 2082  361
81037 2015 216 2082 1911
81037 2014 216 2515  393
81037 1998 216 1909  393
81037 2000 216 2515  393
81037 2016 216   33  393
81037 2000 216 2515  393
81037 1999 216 2515  361
81037 2016 216 2082  381
81037 2000 216 2515  393
81037 2014 216 2515  393
81037 1998 216 1909  393
81037 2017 216 2548  392
81037 2000 216 2515  393
81037 2001 216 2515  393
81037 2014 216 2515  393
81037 1998 216 2515  393
81037 2012 216 2082 1822
81037 2014 216 2515  393
end

I would like to sum the number of different actors of the two columns (actor_id and actor_id2) by gid and year. In other words, I would like to have a column with the sum of the distincts values of the variables actor_id and actor_id2 by gid and year. When there are some missing values, I would like taking into account them as a 0. Variable ccode_GID does not change by gid (it is time invariant)

Do you think it is feasible to do it? I do not find the way. Any suggestion is more than welcome.

Best,

Diego.

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

03 Nov 2021, 11:35

Code:

mvencode actor_id*, mv(0)

by gid year (actor_id), sort: gen distinct_actor_id = sum(actor_id != actor_id[_n-1])
by gid year (actor_id2), sort: gen distinct_actor_id2 = sum(actor_id2 != actor_id2[_n-1])
by gid year: gen wanted = distinct_actor_id[_N] + distinct_actor_id2[_N]

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#3

03 Nov 2021, 11:36

Diego:
do you mean:

Code:

. bysort gid (year): gen wanted= actor_id+ actor_id2 . replace wanted=0 if wanted==.

Kind regards,
Carlo
(Stata 19.0)
Comment
George Ford

Join Date: Aug 2014

Posts: 3152
#4

03 Nov 2021, 15:41

If you want the non-missing values included in the sum, then replace missing values with 0 first. Or, perhaps better, create a new variable and then replace the missing with 0 on it, thus saving the original variable.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#5

04 Nov 2021, 01:51

Diego:
as a second thought, I would propose:

Code:

egen wanted=group( actor_id actor_id2 ) bysort gid year: egen count=count(wanted)

Kind regards,
Carlo
(Stata 19.0)
Comment
Diego Malo

Join Date: Jun 2019

Posts: 140
#6

04 Nov 2021, 01:52

Thank you for all your answers.

Clyde Schechter Thank you for your answer. However, I find some problems with the code you proposes. For example, for cell 80318 and year 2012 (it is in the example I gave in the post), it should be 6 different actors and I obtained 5 with your code. It is not finishing to count well all the number of different actors. Another problem, it is that the missing are counting as a different actor, I want to count them as a 0, that is, the missing is not a different actor.

Carlo Lazzaro Thank you for your code Lazaro. I think your code is summing the total numbers and what I would like to have is the total number of different actors by grid and year, without counting the missing values as a distinct actors.

George Ford Thank you for your question. However, If I do that, I am considering them as a different actor and I do not want to consider them as a different actor.

It's a bit difficult to get what I want, I do not finish to get it.

Last edited by Diego Malo; 04 Nov 2021, 01:54.
Comment
Diego Malo

Join Date: Jun 2019

Posts: 140
#7

04 Nov 2021, 02:02

Thank you Carlo Lazzaro for your second thought. I appreciate it. However, I do not understand well why but it is not counting well the number of different actors. For example, for gid 80317 and year 2012, it should be 3 different actors, with your code I obtained 5.
Comment

Fei Wang

Join Date: Oct 2021
Posts: 726

04 Nov 2021, 02:13

Diego Malo , based on your description in #6, is the following code what you need?

Code:

ren actor_id actor_id1
gen id = _n
reshape long actor_id, i(id) j(j)

bys gid year (actor_id): gen num = actor_id != actor_id[_n-1] & actor_id != .
bys gid year: egen wanted = total(num)
drop num

reshape wide actor_id, i(id) j(j)
drop id
ren actor_id1 actor_id

Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17707

04 Nov 2021, 02:30

Diego:

Code:

. egen wanted=group( actor_id actor_id2 )

. bysort gid year (wanted): gen flag=1 if wanted[_n]!=wanted[_n+1]

. bysort gid year: egen Num_actor=total(flag)

Kind regards,
Carlo
(Stata 19.0)

Comment

Diego Malo

Join Date: Jun 2019

Posts: 140
#10

04 Nov 2021, 02:41

Fei Wang it seems that it works!! Thank you so much!! Can you explain to me a bit what your code does? I would like to learn it.

Carlo Lazzaro Thank you Carlo for your code. However, it is still not counting well. Grid 80317 and year 2012 should have 3 distinc actors, with your code I obtain 2. Another example, grid 80318 and year 2012 should be 6 actors, with your code I obtain 4.
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#11

04 Nov 2021, 02:53

You're welcome, Diego. As you are counting the number of different actor IDs in "actor_id" and "actor_id2" together, it would be convenient to combine the two actor ID variables to one single column, and count. I used -reshape long- to combine the two columns, and after counting, used -reshape wide- to restore your original data form -- That's the algorithm of my code. BTW, I think concrete examples in the OP, like what you gave in #6 and #7, would help us better understand your question from the very beginning -- I didn't get your point either until I read #6.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#12

04 Nov 2021, 02:56

Diego:
you're correct, as my last code counts the distinct actors in -actor_id2- only.

Kind regards,
Carlo
(Stata 19.0)
Comment
Diego Malo

Join Date: Jun 2019

Posts: 140
#13

04 Nov 2021, 03:48

Fei Wang Thank you again!. I will try to give more examples when I explain things. On the other hand, if I am not wrong, ren is rename, so you are rename my column actor_id as actor_id1, but what happens with actor_id2? Sorry for my ignorance but I just would like to learn a bit of Stata.

Carlo Lazzaro it is ok, thank you for your help!.

Anyway, do you know if it is any possibility to do it with command egen and tag, as Nick Cox proposed me here Sum variables taking into account missing data - Statalist ?

Thank you,

Diego.

Last edited by Diego Malo; 04 Nov 2021, 03:51.
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#14

04 Nov 2021, 04:22

Diego, yes, -ren- is -rename-, it's to rename your "actor_id" to "actor_id1" for the purpose of using -reshape long-. I don't need to do this for "actor_id2" because it has a suffix "2".

To my understanding, Nick's wise solution using -tag()- is consistent with the use of -group()- in this thread. But I think it may not fit your case. For example, for "actor_id" and "actor_id2", if the first line records "1" and "2", and the second line is "1" and "3", then there are three different actors while -tag()- will only mark two distinct groups. After combining the two IDs into one, -tag()- can be used as in the other thread.
Comment
Diego Malo

Join Date: Jun 2019

Posts: 140
#15

04 Nov 2021, 04:52

I understood, thank you again Fei Wang !
Comment

Announcement

Sum by group the distinct values of two different columns into a new one.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment