Sum by group the distinct values of two different columns into a new one.

Nick Cox

Join Date: Mar 2014
Posts: 35698

#16

04 Nov 2021, 06:57

Coming late to this party....

An early reaction was to use tag() but that won't work because you are counting distinct values in the union of two variables.

Here is a first principles approach. without using reshape

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long gid int(year ccode_GID) long(actor_id actor_id2)
62356 1997 216    .    .
62357 1997 216    .    .
79599 1997 216    .    .
79600 1997 216    .    .
79601 1997 216    .    .
80317 2012 216 2082  393
80317 2014 216 2082  393
80317 2015 216 2515  393
80317 2013 216 1795  393
80317 2012 216 2082  393
80317 2017 216 2515  393
80317 2016 216 2515  393
80317 2012 216 2082  393
80317 2016 216 2515  393
80317 2012 216 2082  393
80317 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2009 216 2082  393
80318 2015 216 2515  393
80318 2012 216 2082  393
80318 2017 216 2515  393
80318 2012 216 2082 1911
80318 2002 216 2515  393
80318 2015 216 1258  393
80318 2012 216 2082 1911
80318 2014 216 2515  393
80318 1997 216 2515  393
80318 2012 216 2082  393
80318 2003 216 2515  393
80318 1998 216 2515  393
80318 2004 216 2515  392
80318 2015 216 2082 1911
80318 2012 216   32   21
80318 2013 216 2622  335
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2007 216 2082 1822
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 1998 216 2515  393
80318 2014 216 2515  393
80318 2012 216 2082 1911
80318 2014 216 2515  393
80318 2017 216 2515  393
80318 2012 216 2082 1911
80318 2013 216 2515  393
80318 1997 216 2515  393
80318 2003 216 2515  393
80318 2013 216 1795  393
80318 2015 216 2082 1822
80318 2012 216 2515  393
80318 2013 216 2515  393
80319 2012 216 2082 1911
80319 2012 216 2082 1911
80320 1997 216    .    .
80321 1997 216    .    .
80322 1997 216    .    .
80323 1997 216    .    .
80324 1997 216    .    .
80325 1997 216    .    .
80326 1997 216    .    .
80327 2012 216 2082 1911
80327 2012 216 2082 1911
80327 2012 216 2082 1911
80328 1997 216    .    .
80329 1997 216    .    .
80330 2004 216 2515  393
80331 1997 216    .    .
80332 1997 216    .    .
81037 2012 216 2082  408
81037 2003 216 2515 1727
81037 2000 216 2515  393
81037 2014 216 2515  393
81037 1998 216 1909  393
81037 2004 216 2515 1727
81037 2002 216 2515  393
81037 2012 216 2082 1911
81037 1999 216 2515  393
81037 2015 216 2082 1822
81037 1999 216   33 2095
81037 2002 216 2082  393
81037 2013 216 2082  361
81037 2015 216 2082 1911
81037 2014 216 2515  393
81037 1998 216 1909  393
81037 2000 216 2515  393
81037 2016 216   33  393
81037 2000 216 2515  393
81037 1999 216 2515  361
81037 2016 216 2082  381
81037 2000 216 2515  393
81037 2014 216 2515  393
81037 1998 216 1909  393
81037 2017 216 2548  392
81037 2000 216 2515  393
81037 2001 216 2515  393
81037 2014 216 2515  393
81037 1998 216 2515  393
81037 2012 216 2082 1822
81037 2014 216 2515  393
end



gen wanted = 0
gen this = 0
egen group = group(gid year) , label

su group, meanonly

forval g = 1/`r(max)' {
     quietly replace this = group == `g'
     mata : data = st_data(., ("actor_id actor_id2"), "this")
     mata : obs = rows(data)
     mata : data = data[,1] \ data[,2]
     mata : data = select(data, (data :< .))
     mata : count = J(obs, 1, rows(uniqrows(data)))
     mata : st_store(., "wanted", "this", count)
}

tabdisp group, c(wanted)

-----------------------
group(gid  |
year)      |     wanted
-----------+-----------
62356 1997 |          0
62357 1997 |          0
79599 1997 |          0
79600 1997 |          0
79601 1997 |          0
80317 2012 |          3
80317 2013 |          2
80317 2014 |          2
80317 2015 |          2
80317 2016 |          2
80317 2017 |          2
80318 1997 |          2
80318 1998 |          2
80318 2002 |          2
80318 2003 |          2
80318 2004 |          2
80318 2007 |          2
80318 2009 |          2
80318 2012 |          6
80318 2013 |          5
80318 2014 |          2
80318 2015 |          6
80318 2017 |          2
80319 2012 |          2
80320 1997 |          0
80321 1997 |          0
80322 1997 |          0
80323 1997 |          0
80324 1997 |          0
80325 1997 |          0
80326 1997 |          0
80327 2012 |          2
80328 1997 |          0
80329 1997 |          0
80330 2004 |          2
80331 1997 |          0
80332 1997 |          0
81037 1998 |          3
81037 1999 |          5
81037 2000 |          2
81037 2001 |          2
81037 2002 |          3
81037 2003 |          2
81037 2004 |          2
81037 2012 |          4
81037 2013 |          2
81037 2014 |          2
81037 2015 |          3
81037 2016 |          4
81037 2017 |          2
-----------------------

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#17

04 Nov 2021, 10:05

You have a solution to your problem, courtesy of Fei Wang. Nevertheless, just for reference, here is a correction of my code:

Code:

mvencode actor_id*, mv(0) by gid year (actor_id), sort: gen distinct_actor_id = sum(actor_id != actor_id[_n-1]) by gid year (actor_id): replace distinct_actor_id = distinct_actor_id[_N] by gid year (actor_id2), sort: gen distinct_actor_id2 = sum(actor_id2 != actor_id2[_n-1]) by gid year (actor_id2): replace distinct_actor_id2 = distinct_actor_id2[_N] gen wanted = distinct_actor_id + distinct_actor_id2

If your data set is large, this will probably run faster than Fei Wang's approach because it avoids a time-consuming -reshape-. On the other hand, depending on what else you will be doing with this data, you may well be better off reshaping the data anyway (and leaving it in long layout).
Comment

George Ford

Join Date: Aug 2014
Posts: 3152

#18

04 Nov 2021, 14:12

Code:

g Nactors = 0
g Nactors2 = 0
levelsof gid, local(gids)
levelsof year, local(years)
foreach i of local gids {
    foreach y of local years {
    di `i'
    di `y'
    qui unique actor_id if gid==`i' & year==`y'
    replace Nactors = `r(sum)'  if gid==`i' & year==`y'
    qui unique actor_id2 if gid==`i' & year==`y'
    replace Nactors2 = `r(sum)'  if gid==`i' & year==`y'
    }
}    
g Nall = Nactors+Nactors2

Comment

George Ford

Join Date: Aug 2014
Posts: 3152

#19

04 Nov 2021, 15:28

This is owed to William Lisowski.

Code:

ssc install egenmore
egen Nactor = nvals(actor_id), by(gid year)
egen Nactor2 = nvals(actor_id2), by(gid year)
g Nall = Nactor+Nactor2

Comment

Diego Malo

Join Date: Jun 2019

Posts: 140
#20

08 Nov 2021, 08:06

Thank all of you for your comments and time.

However, I still find some problems in the codes you are providing. I am sorry but I do not really how to solve the problem.

Nick Cox Thank you for your code Nick. I have checked and I think your code is not counting well.For example, if actor X is in actor_id and actor_id2 in year 2008 and gid 3, it is counting two times but I want that this is counting just one. The distinct number of actors between actor_id and actor_id2 by gid and year. If one actor is in both in the same grid and year, I just want to count it once. Please, let me know if you do not see my point.

Clyde Schechter Thank you for your code Clyde. However, I see in your code the same problem than the one in Nick: "if actor X is in actor_id and actor_id2 in year 2008 and gid 3, it is counting two times but I want that this is counting just one. The distinct number of actors between actor_id and actor_id2 by gid and year. If one actor is in both in the same grid and year, I just want to count it once. ". Please let me know if you do not see my point.

Fei Wang Fei, I have found an error in your code. It is counting very well the number of actors as I want but the number of actors is different. For example, the number of actors before doing your code for gid 80318 and year 2021 is different than the one after run your code. Your code is changing the number of actors by grid and year. I do not know if I explain well myself. Please let me know if you do not see my point.

George Ford Thank you George for your code. When I run it, it gives me the following error: "command unique is unrecognized"

Sorry but I am a bit desesperate with that problem.

Diego.

Last edited by Diego Malo; 08 Nov 2021, 08:09.
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#21

08 Nov 2021, 08:38

Diego Malo , in your original example data, the number of different actors for gid 80318 and year 2012 is 6. After running the code, the counting result is also 6. So I didn't get your point that they are different.
1 like
Comment
Diego Malo

Join Date: Jun 2019

Posts: 140
#22

08 Nov 2021, 08:45

Thank you Fei Wang for your comment. It is true that the number of distinct actors is the same in both, but I think it is a bit by change.

Before running the reshape command I have the following number of actors in actor_id:

When I apply your code, I obtain the following (second picture). It is changing the actors, they sum 6 but they are not the same actors. Do you see it?
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#23

08 Nov 2021, 08:58

Diego Malo , your original example data do not contain value labels, so I'm not able to judge on that. But we do have actor ids. Before running the code, the actor ids for gid 80318 and year 2012 are 21, 32, 393, 1911, 2515, 2082. After running the code, the actor ids within this group are still the six numbers, unchanged. As long as the relationship between values and value labels remains the same, there should be no such issues. Please check if the underlying values (not the name of actors) of actor id remain identical before and after running the code.
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

#24

08 Nov 2021, 09:18

#20 says about my code.

Thank you for your code Nick. I have checked and I think your code is not counting well.For example, if actor X is in actor_id and actor_id2 in year 2008 and gid 3, it is counting two times but I want that this is counting just one. The distinct number of actors between actor_id and actor_id2 by gid and year. If one actor is in both in the same grid and year, I just want to count it once. Please, let me know if you do not see my point.

gid 3 is in no example you've given us, but I believe that my code does exactly what you require.

Here is a re-run on one subset with extra lines inserted to show each step, which is precisely what you could have done to drill deeper.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long gid int(year ccode_GID) long(actor_id actor_id2)
80318 2012 216 2082 1911
80318 2012 216 2082  393
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2012 216 2082  393
80318 2012 216   32   21
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2012 216 2515  393
end

gen wanted = 0
gen this = 0
egen group = group(gid year) , label

su group, meanonly

forval g = 1/`r(max)' {
     quietly replace this = group == `g'
     mata : data = st_data(., ("actor_id actor_id2"), "this")
     mata : data 
     mata : obs = rows(data)
     mata : obs 
     mata : data = data[,1] \ data[,2]
     mata : data 
     mata : data = select(data, (data :< .))
     mata : data 
     mata : uniqrows(data)
     mata : rows(uniqrows(data))
     mata : count = J(obs, 1, rows(uniqrows(data)))
     mata : count 
     mata : st_store(., "wanted", "this", count)
}

tabdisp group, c(wanted)

Here is the Mata output:

First, the data for actors echoed and a count of the number of rows.

Code:

           1      2
     +---------------+
   1 |  2082   1911  |
   2 |  2082    393  |
   3 |  2082   1911  |
   4 |  2082   1911  |
   5 |  2082    393  |
   6 |    32     21  |
   7 |  2082   1911  |
   8 |  2082   1911  |
   9 |  2082   1911  |
  10 |  2082   1911  |
  11 |  2082   1911  |
  12 |  2082   1911  |
  13 |  2515    393  |
     +---------------+
  13

Next the two columns are stacked on top of each other

Code:

             1
     +--------+
   1 |  2082  |
   2 |  2082  |
   3 |  2082  |
   4 |  2082  |
   5 |  2082  |
   6 |    32  |
   7 |  2082  |
   8 |  2082  |
   9 |  2082  |
  10 |  2082  |
  11 |  2082  |
  12 |  2082  |
  13 |  2515  |
  14 |  1911  |
  15 |   393  |
  16 |  1911  |
  17 |  1911  |
  18 |   393  |
  19 |    21  |
  20 |  1911  |
  21 |  1911  |
  22 |  1911  |
  23 |  1911  |
  24 |  1911  |
  25 |  1911  |
  26 |   393  |
     +--------+

and missing values are dropped (which here makes no difference)

Code:

           1
     +--------+
   1 |  2082  |
   2 |  2082  |
   3 |  2082  |
   4 |  2082  |
   5 |  2082  |
   6 |    32  |
   7 |  2082  |
   8 |  2082  |
   9 |  2082  |
  10 |  2082  |
  11 |  2082  |
  12 |  2082  |
  13 |  2515  |
  14 |  1911  |
  15 |   393  |
  16 |  1911  |
  17 |  1911  |
  18 |   393  |
  19 |    21  |
  20 |  1911  |
  21 |  1911  |
  22 |  1911  |
  23 |  1911  |
  24 |  1911  |
  25 |  1911  |
  26 |   393  |
     +--------+

Now we look for distinct actors using uniqrows() and thus count each just once as required

Code:

      1
    +--------+
  1 |    21  |
  2 |    32  |
  3 |   393  |
  4 |  1911  |
  5 |  2082  |
  6 |  2515  |
    +--------+
  6

Then we want that expanded to put back in the original variable:

Code:

So, where's the problem?

Comment

Diego Malo

Join Date: Jun 2019

Posts: 140
#25

08 Nov 2021, 09:30

Fei Wang Thank you. You are right, the underlying values are the same. How is it possible that after the label change? It makes me a bit mess. On the other hand, I found another problem but it is mine. My original actor_id and actor_id2 are string variables. I transform them to numeric using encode command. When I do it, if actor A is in actor_id, it is received a different number than when actor A is in actor_id2. Hence, your code is counting two times actor A but it is because the way in which I am using encode. Do you understand what I mean? Is it any possibility to solve it?

Nick Cox You are completely right. The problem is mine. My original actor_id and actor_id2 are string variables. I transform them to numeric using encode command. When I do it, if actor A is in actor_id, it is received a different number than when actor A is in actor_id2. Hence, your code is counting two times actor A but it is because the way in which I am using encode. This is why I have been confused. Do you understand what I mean? Is it any possibility to solve it?
Comment

Fei Wang

Join Date: Oct 2021
Posts: 726

#26

08 Nov 2021, 09:46

Diego Malo , there are ways of solving the problem (you may encode actor_id right after -reshape long- in my original code), but you may rethink if it's necessary to encode the actor names. Counting different names does not require actor ids to be numeric. An example as below:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long gid int(year ccode_GID) str4(actor_id actor_id2)
62356 1997 216 ""     ""    
62357 1997 216 ""     ""    
79599 1997 216 ""     ""    
79600 1997 216 ""     ""    
79601 1997 216 ""     ""    
80317 2012 216 "2082" "393"
80317 2014 216 "2082" "393"
80317 2015 216 "2515" "393"
80317 2013 216 "1795" "393"
80317 2012 216 "2082" "393"
80317 2017 216 "2515" "393"
80317 2016 216 "2515" "393"
80317 2012 216 "2082" "393"
80317 2016 216 "2515" "393"
80317 2012 216 "2082" "393"
80317 2012 216 "2082" "1911"
80318 2012 216 "2082" "1911"
80318 2009 216 "2082" "393"
80318 2015 216 "2515" "393"
80318 2012 216 "2082" "393"
80318 2017 216 "2515" "393"
80318 2012 216 "2082" "1911"
80318 2002 216 "2515" "393"
80318 2015 216 "1258" "393"
80318 2012 216 "2082" "1911"
80318 2014 216 "2515" "393"
80318 1997 216 "2515" "393"
80318 2012 216 "2082" "393"
80318 2003 216 "2515" "393"
80318 1998 216 "2515" "393"
80318 2004 216 "2515" "392"
80318 2015 216 "2082" "1911"
80318 2012 216 "32"   "21"  
80318 2013 216 "2622" "335"
80318 2012 216 "2082" "1911"
80318 2012 216 "2082" "1911"
80318 2007 216 "2082" "1822"
80318 2012 216 "2082" "1911"
80318 2012 216 "2082" "1911"
80318 1998 216 "2515" "393"
80318 2014 216 "2515" "393"
80318 2012 216 "2082" "1911"
80318 2014 216 "2515" "393"
80318 2017 216 "2515" "393"
80318 2012 216 "2082" "1911"
80318 2013 216 "2515" "393"
80318 1997 216 "2515" "393"
80318 2003 216 "2515" "393"
80318 2013 216 "1795" "393"
80318 2015 216 "2082" "1822"
80318 2012 216 "2515" "393"
80318 2013 216 "2515" "393"
80319 2012 216 "2082" "1911"
80319 2012 216 "2082" "1911"
80320 1997 216 ""     ""    
80321 1997 216 ""     ""    
80322 1997 216 ""     ""    
80323 1997 216 ""     ""    
80324 1997 216 ""     ""    
80325 1997 216 ""     ""    
80326 1997 216 ""     ""    
80327 2012 216 "2082" "1911"
80327 2012 216 "2082" "1911"
80327 2012 216 "2082" "1911"
80328 1997 216 ""     ""    
80329 1997 216 ""     ""    
80330 2004 216 "2515" "393"
80331 1997 216 ""     ""    
80332 1997 216 ""     ""    
81037 2012 216 "2082" "408"
81037 2003 216 "2515" "1727"
81037 2000 216 "2515" "393"
81037 2014 216 "2515" "393"
81037 1998 216 "1909" "393"
81037 2004 216 "2515" "1727"
81037 2002 216 "2515" "393"
81037 2012 216 "2082" "1911"
81037 1999 216 "2515" "393"
81037 2015 216 "2082" "1822"
81037 1999 216 "33"   "2095"
81037 2002 216 "2082" "393"
81037 2013 216 "2082" "361"
81037 2015 216 "2082" "1911"
81037 2014 216 "2515" "393"
81037 1998 216 "1909" "393"
81037 2000 216 "2515" "393"
81037 2016 216 "33"   "393"
81037 2000 216 "2515" "393"
81037 1999 216 "2515" "361"
81037 2016 216 "2082" "381"
81037 2000 216 "2515" "393"
81037 2014 216 "2515" "393"
81037 1998 216 "1909" "393"
81037 2017 216 "2548" "392"
81037 2000 216 "2515" "393"
81037 2001 216 "2515" "393"
81037 2014 216 "2515" "393"
81037 1998 216 "2515" "393"
81037 2012 216 "2082" "1822"
81037 2014 216 "2515" "393"
end

ren actor_id actor_id1
gen id = _n
reshape long actor_id, i(id) j(j)

bys gid year (actor_id): gen num = actor_id != actor_id[_n-1] & actor_id != ""
bys gid year: egen wanted = total(num)
drop num

reshape wide actor_id, i(id) j(j)
drop id
ren actor_id1 actor_id

Last edited by Fei Wang; 08 Nov 2021, 09:49.

Comment

Diego Malo

Join Date: Jun 2019

Posts: 140
#27

08 Nov 2021, 09:57

Fei Wang Pfff... Now, it's perfect. I eliminate the encode step. Really, thank you so much Fei. It works very well.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

#28

08 Nov 2021, 10:07

Thanks for #25. My code could be adapted for string variables, as below.

Otherwise the answer is surely to use destring not encode.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long gid int(year ccode_GID) long(actor_id actor_id2)
80318 2012 216 2082 1911
80318 2012 216 2082  393
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2012 216 2082  393
80318 2012 216   32   21
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2012 216 2082 1911
80318 2012 216 2515  393
end

gen wanted = 0
gen this = 0
egen group = group(gid year) , label

tostring actor_*, replace 

su group, meanonly

forval g = 1/`r(max)' {
     quietly replace this = group == `g'
     mata : data = st_sdata(., ("actor_id actor_id2"), "this")
     mata : data 
     mata : obs = rows(data)
     mata : obs 
     mata : data = data[,1] \ data[,2]
     mata : data 
     mata : data = select(data, !(data :== ""))
     mata : data 
     mata : uniqrows(data)
     mata : rows(uniqrows(data))
     mata : count = J(obs, 1, rows(uniqrows(data)))
     mata : count 
     mata : st_store(., "wanted", "this", count)
}

tabdisp group, c(wanted)

Comment

Diego Malo

Join Date: Jun 2019

Posts: 140
#29

09 Nov 2021, 00:36

Thank you Nick Cox, your code works very well. To be honest, I do not have the enough level in Stata to understand it, but it works perfect. Thank you again.

Diego.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#30

09 Nov 2021, 02:57

Glad you solved your problem. Note that back in #1 you told us your actor variables were long, so working on your own with string variables was something we couldn't know and was only too likely to confuse you.

Also, the procedure in #20 was spelled out step by step in #24 and here is some pseudo code.

for each gid and year
copy actor data (2 variables) into Mata
stack 2 columns into one
lose missing values
reduce to unique (distinct) rows
count rows
copy result back into Stata variable

Separate use of encode for several string variables that have related values is such a really bad idea that it deserves a stronger warning against it in the manuals.

Back in #20: unique is a community-contributed command from SSC. and

Code:

search unique

does tell you that (among many other things). George Ford forgot to tell you that.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment