Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Coming late to this party....

    An early reaction was to use tag() but that won't work because you are counting distinct values in the union of two variables.

    Here is a first principles approach. without using reshape

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long gid int(year ccode_GID) long(actor_id actor_id2)
    62356 1997 216    .    .
    62357 1997 216    .    .
    79599 1997 216    .    .
    79600 1997 216    .    .
    79601 1997 216    .    .
    80317 2012 216 2082  393
    80317 2014 216 2082  393
    80317 2015 216 2515  393
    80317 2013 216 1795  393
    80317 2012 216 2082  393
    80317 2017 216 2515  393
    80317 2016 216 2515  393
    80317 2012 216 2082  393
    80317 2016 216 2515  393
    80317 2012 216 2082  393
    80317 2012 216 2082 1911
    80318 2012 216 2082 1911
    80318 2009 216 2082  393
    80318 2015 216 2515  393
    80318 2012 216 2082  393
    80318 2017 216 2515  393
    80318 2012 216 2082 1911
    80318 2002 216 2515  393
    80318 2015 216 1258  393
    80318 2012 216 2082 1911
    80318 2014 216 2515  393
    80318 1997 216 2515  393
    80318 2012 216 2082  393
    80318 2003 216 2515  393
    80318 1998 216 2515  393
    80318 2004 216 2515  392
    80318 2015 216 2082 1911
    80318 2012 216   32   21
    80318 2013 216 2622  335
    80318 2012 216 2082 1911
    80318 2012 216 2082 1911
    80318 2007 216 2082 1822
    80318 2012 216 2082 1911
    80318 2012 216 2082 1911
    80318 1998 216 2515  393
    80318 2014 216 2515  393
    80318 2012 216 2082 1911
    80318 2014 216 2515  393
    80318 2017 216 2515  393
    80318 2012 216 2082 1911
    80318 2013 216 2515  393
    80318 1997 216 2515  393
    80318 2003 216 2515  393
    80318 2013 216 1795  393
    80318 2015 216 2082 1822
    80318 2012 216 2515  393
    80318 2013 216 2515  393
    80319 2012 216 2082 1911
    80319 2012 216 2082 1911
    80320 1997 216    .    .
    80321 1997 216    .    .
    80322 1997 216    .    .
    80323 1997 216    .    .
    80324 1997 216    .    .
    80325 1997 216    .    .
    80326 1997 216    .    .
    80327 2012 216 2082 1911
    80327 2012 216 2082 1911
    80327 2012 216 2082 1911
    80328 1997 216    .    .
    80329 1997 216    .    .
    80330 2004 216 2515  393
    80331 1997 216    .    .
    80332 1997 216    .    .
    81037 2012 216 2082  408
    81037 2003 216 2515 1727
    81037 2000 216 2515  393
    81037 2014 216 2515  393
    81037 1998 216 1909  393
    81037 2004 216 2515 1727
    81037 2002 216 2515  393
    81037 2012 216 2082 1911
    81037 1999 216 2515  393
    81037 2015 216 2082 1822
    81037 1999 216   33 2095
    81037 2002 216 2082  393
    81037 2013 216 2082  361
    81037 2015 216 2082 1911
    81037 2014 216 2515  393
    81037 1998 216 1909  393
    81037 2000 216 2515  393
    81037 2016 216   33  393
    81037 2000 216 2515  393
    81037 1999 216 2515  361
    81037 2016 216 2082  381
    81037 2000 216 2515  393
    81037 2014 216 2515  393
    81037 1998 216 1909  393
    81037 2017 216 2548  392
    81037 2000 216 2515  393
    81037 2001 216 2515  393
    81037 2014 216 2515  393
    81037 1998 216 2515  393
    81037 2012 216 2082 1822
    81037 2014 216 2515  393
    end
    
    
    
    gen wanted = 0
    gen this = 0
    egen group = group(gid year) , label
    
    su group, meanonly
    
    forval g = 1/`r(max)' {
         quietly replace this = group == `g'
         mata : data = st_data(., ("actor_id actor_id2"), "this")
         mata : obs = rows(data)
         mata : data = data[,1] \ data[,2]
         mata : data = select(data, (data :< .))
         mata : count = J(obs, 1, rows(uniqrows(data)))
         mata : st_store(., "wanted", "this", count)
    }
    
    tabdisp group, c(wanted)
    
    -----------------------
    group(gid  |
    year)      |     wanted
    -----------+-----------
    62356 1997 |          0
    62357 1997 |          0
    79599 1997 |          0
    79600 1997 |          0
    79601 1997 |          0
    80317 2012 |          3
    80317 2013 |          2
    80317 2014 |          2
    80317 2015 |          2
    80317 2016 |          2
    80317 2017 |          2
    80318 1997 |          2
    80318 1998 |          2
    80318 2002 |          2
    80318 2003 |          2
    80318 2004 |          2
    80318 2007 |          2
    80318 2009 |          2
    80318 2012 |          6
    80318 2013 |          5
    80318 2014 |          2
    80318 2015 |          6
    80318 2017 |          2
    80319 2012 |          2
    80320 1997 |          0
    80321 1997 |          0
    80322 1997 |          0
    80323 1997 |          0
    80324 1997 |          0
    80325 1997 |          0
    80326 1997 |          0
    80327 2012 |          2
    80328 1997 |          0
    80329 1997 |          0
    80330 2004 |          2
    80331 1997 |          0
    80332 1997 |          0
    81037 1998 |          3
    81037 1999 |          5
    81037 2000 |          2
    81037 2001 |          2
    81037 2002 |          3
    81037 2003 |          2
    81037 2004 |          2
    81037 2012 |          4
    81037 2013 |          2
    81037 2014 |          2
    81037 2015 |          3
    81037 2016 |          4
    81037 2017 |          2
    -----------------------

    Comment


    • #17
      You have a solution to your problem, courtesy of Fei Wang. Nevertheless, just for reference, here is a correction of my code:

      Code:
      mvencode actor_id*, mv(0)
      
      by gid year (actor_id), sort: gen distinct_actor_id = sum(actor_id != actor_id[_n-1])
      by gid year (actor_id): replace distinct_actor_id = distinct_actor_id[_N]
      by gid year (actor_id2), sort: gen distinct_actor_id2 = sum(actor_id2 != actor_id2[_n-1])
      by gid year (actor_id2): replace distinct_actor_id2 = distinct_actor_id2[_N]
      gen wanted = distinct_actor_id + distinct_actor_id2
      If your data set is large, this will probably run faster than Fei Wang's approach because it avoids a time-consuming -reshape-. On the other hand, depending on what else you will be doing with this data, you may well be better off reshaping the data anyway (and leaving it in long layout).

      Comment


      • #18
        Code:
        g Nactors = 0
        g Nactors2 = 0
        levelsof gid, local(gids)
        levelsof year, local(years)
        foreach i of local gids {
            foreach y of local years {
            di `i'
            di `y'
            qui unique actor_id if gid==`i' & year==`y'
            replace Nactors = `r(sum)'  if gid==`i' & year==`y'
            qui unique actor_id2 if gid==`i' & year==`y'
            replace Nactors2 = `r(sum)'  if gid==`i' & year==`y'
            }
        }    
        g Nall = Nactors+Nactors2

        Comment


        • #19
          This is owed to William Lisowski.
          Code:
          ssc install egenmore
          egen Nactor = nvals(actor_id), by(gid year)
          egen Nactor2 = nvals(actor_id2), by(gid year)
          g Nall = Nactor+Nactor2

          Comment


          • #20
            Thank all of you for your comments and time.

            However, I still find some problems in the codes you are providing. I am sorry but I do not really how to solve the problem.

            Nick Cox Thank you for your code Nick. I have checked and I think your code is not counting well.For example, if actor X is in actor_id and actor_id2 in year 2008 and gid 3, it is counting two times but I want that this is counting just one. The distinct number of actors between actor_id and actor_id2 by gid and year. If one actor is in both in the same grid and year, I just want to count it once. Please, let me know if you do not see my point.

            Clyde Schechter Thank you for your code Clyde. However, I see in your code the same problem than the one in Nick: "if actor X is in actor_id and actor_id2 in year 2008 and gid 3, it is counting two times but I want that this is counting just one. The distinct number of actors between actor_id and actor_id2 by gid and year. If one actor is in both in the same grid and year, I just want to count it once. ". Please let me know if you do not see my point.

            Fei Wang Fei, I have found an error in your code. It is counting very well the number of actors as I want but the number of actors is different. For example, the number of actors before doing your code for gid 80318 and year 2021 is different than the one after run your code. Your code is changing the number of actors by grid and year. I do not know if I explain well myself. Please let me know if you do not see my point.

            George Ford Thank you George for your code. When I run it, it gives me the following error: "command unique is unrecognized"

            Sorry but I am a bit desesperate with that problem.

            Diego.
            Last edited by Diego Malo; 08 Nov 2021, 08:09.

            Comment


            • #21
              Diego Malo , in your original example data, the number of different actors for gid 80318 and year 2012 is 6. After running the code, the counting result is also 6. So I didn't get your point that they are different.

              Comment


              • #22
                Thank you Fei Wang for your comment. It is true that the number of distinct actors is the same in both, but I think it is a bit by change.

                Before running the reshape command I have the following number of actors in actor_id:


                Click image for larger version

Name:	Proof1.JPG
Views:	1
Size:	130.7 KB
ID:	1635433


                When I apply your code, I obtain the following (second picture). It is changing the actors, they sum 6 but they are not the same actors. Do you see it?


                Click image for larger version

Name:	Proof2.JPG
Views:	1
Size:	166.4 KB
ID:	1635434

                Comment


                • #23
                  Diego Malo , your original example data do not contain value labels, so I'm not able to judge on that. But we do have actor ids. Before running the code, the actor ids for gid 80318 and year 2012 are 21, 32, 393, 1911, 2515, 2082. After running the code, the actor ids within this group are still the six numbers, unchanged. As long as the relationship between values and value labels remains the same, there should be no such issues. Please check if the underlying values (not the name of actors) of actor id remain identical before and after running the code.

                  Comment


                  • #24
                    #20 says about my code.


                    Thank you for your code Nick. I have checked and I think your code is not counting well.For example, if actor X is in actor_id and actor_id2 in year 2008 and gid 3, it is counting two times but I want that this is counting just one. The distinct number of actors between actor_id and actor_id2 by gid and year. If one actor is in both in the same grid and year, I just want to count it once. Please, let me know if you do not see my point.
                    gid 3 is in no example you've given us, but I believe that my code does exactly what you require.

                    Here is a re-run on one subset with extra lines inserted to show each step, which is precisely what you could have done to drill deeper.


                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input long gid int(year ccode_GID) long(actor_id actor_id2)
                    80318 2012 216 2082 1911
                    80318 2012 216 2082  393
                    80318 2012 216 2082 1911
                    80318 2012 216 2082 1911
                    80318 2012 216 2082  393
                    80318 2012 216   32   21
                    80318 2012 216 2082 1911
                    80318 2012 216 2082 1911
                    80318 2012 216 2082 1911
                    80318 2012 216 2082 1911
                    80318 2012 216 2082 1911
                    80318 2012 216 2082 1911
                    80318 2012 216 2515  393
                    end
                    
                    gen wanted = 0
                    gen this = 0
                    egen group = group(gid year) , label
                    
                    su group, meanonly
                    
                    forval g = 1/`r(max)' {
                         quietly replace this = group == `g'
                         mata : data = st_data(., ("actor_id actor_id2"), "this")
                         mata : data 
                         mata : obs = rows(data)
                         mata : obs 
                         mata : data = data[,1] \ data[,2]
                         mata : data 
                         mata : data = select(data, (data :< .))
                         mata : data 
                         mata : uniqrows(data)
                         mata : rows(uniqrows(data))
                         mata : count = J(obs, 1, rows(uniqrows(data)))
                         mata : count 
                         mata : st_store(., "wanted", "this", count)
                    }
                    
                    tabdisp group, c(wanted)

                    Here is the Mata output:

                    First, the data for actors echoed and a count of the number of rows.

                    Code:
                               1      2
                         +---------------+
                       1 |  2082   1911  |
                       2 |  2082    393  |
                       3 |  2082   1911  |
                       4 |  2082   1911  |
                       5 |  2082    393  |
                       6 |    32     21  |
                       7 |  2082   1911  |
                       8 |  2082   1911  |
                       9 |  2082   1911  |
                      10 |  2082   1911  |
                      11 |  2082   1911  |
                      12 |  2082   1911  |
                      13 |  2515    393  |
                         +---------------+
                      13
                    Next the two columns are stacked on top of each other

                    Code:
                                 1
                         +--------+
                       1 |  2082  |
                       2 |  2082  |
                       3 |  2082  |
                       4 |  2082  |
                       5 |  2082  |
                       6 |    32  |
                       7 |  2082  |
                       8 |  2082  |
                       9 |  2082  |
                      10 |  2082  |
                      11 |  2082  |
                      12 |  2082  |
                      13 |  2515  |
                      14 |  1911  |
                      15 |   393  |
                      16 |  1911  |
                      17 |  1911  |
                      18 |   393  |
                      19 |    21  |
                      20 |  1911  |
                      21 |  1911  |
                      22 |  1911  |
                      23 |  1911  |
                      24 |  1911  |
                      25 |  1911  |
                      26 |   393  |
                         +--------+
                    and missing values are dropped (which here makes no difference)

                    Code:
                               1
                         +--------+
                       1 |  2082  |
                       2 |  2082  |
                       3 |  2082  |
                       4 |  2082  |
                       5 |  2082  |
                       6 |    32  |
                       7 |  2082  |
                       8 |  2082  |
                       9 |  2082  |
                      10 |  2082  |
                      11 |  2082  |
                      12 |  2082  |
                      13 |  2515  |
                      14 |  1911  |
                      15 |   393  |
                      16 |  1911  |
                      17 |  1911  |
                      18 |   393  |
                      19 |    21  |
                      20 |  1911  |
                      21 |  1911  |
                      22 |  1911  |
                      23 |  1911  |
                      24 |  1911  |
                      25 |  1911  |
                      26 |   393  |
                         +--------+
                    Now we look for distinct actors using uniqrows() and thus count each just once as required

                    Code:
                          1
                        +--------+
                      1 |    21  |
                      2 |    32  |
                      3 |   393  |
                      4 |  1911  |
                      5 |  2082  |
                      6 |  2515  |
                        +--------+
                      6
                    Then we want that expanded to put back in the original variable:

                    Code:
                          1
                         +-----+
                       1 |  6  |
                       2 |  6  |
                       3 |  6  |
                       4 |  6  |
                       5 |  6  |
                       6 |  6  |
                       7 |  6  |
                       8 |  6  |
                       9 |  6  |
                      10 |  6  |
                      11 |  6  |
                      12 |  6  |
                      13 |  6  |
                         +-----+
                    So, where's the problem?

                    Comment


                    • #25
                      Fei Wang Thank you. You are right, the underlying values are the same. How is it possible that after the label change? It makes me a bit mess. On the other hand, I found another problem but it is mine. My original actor_id and actor_id2 are string variables. I transform them to numeric using encode command. When I do it, if actor A is in actor_id, it is received a different number than when actor A is in actor_id2. Hence, your code is counting two times actor A but it is because the way in which I am using encode. Do you understand what I mean? Is it any possibility to solve it?

                      Nick Cox You are completely right. The problem is mine. My original actor_id and actor_id2 are string variables. I transform them to numeric using encode command. When I do it, if actor A is in actor_id, it is received a different number than when actor A is in actor_id2. Hence, your code is counting two times actor A but it is because the way in which I am using encode. This is why I have been confused. Do you understand what I mean? Is it any possibility to solve it?

                      Comment


                      • #26
                        Diego Malo , there are ways of solving the problem (you may encode actor_id right after -reshape long- in my original code), but you may rethink if it's necessary to encode the actor names. Counting different names does not require actor ids to be numeric. An example as below:

                        Code:
                        * Example generated by -dataex-. For more info, type help dataex
                        clear
                        input long gid int(year ccode_GID) str4(actor_id actor_id2)
                        62356 1997 216 ""     ""    
                        62357 1997 216 ""     ""    
                        79599 1997 216 ""     ""    
                        79600 1997 216 ""     ""    
                        79601 1997 216 ""     ""    
                        80317 2012 216 "2082" "393"
                        80317 2014 216 "2082" "393"
                        80317 2015 216 "2515" "393"
                        80317 2013 216 "1795" "393"
                        80317 2012 216 "2082" "393"
                        80317 2017 216 "2515" "393"
                        80317 2016 216 "2515" "393"
                        80317 2012 216 "2082" "393"
                        80317 2016 216 "2515" "393"
                        80317 2012 216 "2082" "393"
                        80317 2012 216 "2082" "1911"
                        80318 2012 216 "2082" "1911"
                        80318 2009 216 "2082" "393"
                        80318 2015 216 "2515" "393"
                        80318 2012 216 "2082" "393"
                        80318 2017 216 "2515" "393"
                        80318 2012 216 "2082" "1911"
                        80318 2002 216 "2515" "393"
                        80318 2015 216 "1258" "393"
                        80318 2012 216 "2082" "1911"
                        80318 2014 216 "2515" "393"
                        80318 1997 216 "2515" "393"
                        80318 2012 216 "2082" "393"
                        80318 2003 216 "2515" "393"
                        80318 1998 216 "2515" "393"
                        80318 2004 216 "2515" "392"
                        80318 2015 216 "2082" "1911"
                        80318 2012 216 "32"   "21"  
                        80318 2013 216 "2622" "335"
                        80318 2012 216 "2082" "1911"
                        80318 2012 216 "2082" "1911"
                        80318 2007 216 "2082" "1822"
                        80318 2012 216 "2082" "1911"
                        80318 2012 216 "2082" "1911"
                        80318 1998 216 "2515" "393"
                        80318 2014 216 "2515" "393"
                        80318 2012 216 "2082" "1911"
                        80318 2014 216 "2515" "393"
                        80318 2017 216 "2515" "393"
                        80318 2012 216 "2082" "1911"
                        80318 2013 216 "2515" "393"
                        80318 1997 216 "2515" "393"
                        80318 2003 216 "2515" "393"
                        80318 2013 216 "1795" "393"
                        80318 2015 216 "2082" "1822"
                        80318 2012 216 "2515" "393"
                        80318 2013 216 "2515" "393"
                        80319 2012 216 "2082" "1911"
                        80319 2012 216 "2082" "1911"
                        80320 1997 216 ""     ""    
                        80321 1997 216 ""     ""    
                        80322 1997 216 ""     ""    
                        80323 1997 216 ""     ""    
                        80324 1997 216 ""     ""    
                        80325 1997 216 ""     ""    
                        80326 1997 216 ""     ""    
                        80327 2012 216 "2082" "1911"
                        80327 2012 216 "2082" "1911"
                        80327 2012 216 "2082" "1911"
                        80328 1997 216 ""     ""    
                        80329 1997 216 ""     ""    
                        80330 2004 216 "2515" "393"
                        80331 1997 216 ""     ""    
                        80332 1997 216 ""     ""    
                        81037 2012 216 "2082" "408"
                        81037 2003 216 "2515" "1727"
                        81037 2000 216 "2515" "393"
                        81037 2014 216 "2515" "393"
                        81037 1998 216 "1909" "393"
                        81037 2004 216 "2515" "1727"
                        81037 2002 216 "2515" "393"
                        81037 2012 216 "2082" "1911"
                        81037 1999 216 "2515" "393"
                        81037 2015 216 "2082" "1822"
                        81037 1999 216 "33"   "2095"
                        81037 2002 216 "2082" "393"
                        81037 2013 216 "2082" "361"
                        81037 2015 216 "2082" "1911"
                        81037 2014 216 "2515" "393"
                        81037 1998 216 "1909" "393"
                        81037 2000 216 "2515" "393"
                        81037 2016 216 "33"   "393"
                        81037 2000 216 "2515" "393"
                        81037 1999 216 "2515" "361"
                        81037 2016 216 "2082" "381"
                        81037 2000 216 "2515" "393"
                        81037 2014 216 "2515" "393"
                        81037 1998 216 "1909" "393"
                        81037 2017 216 "2548" "392"
                        81037 2000 216 "2515" "393"
                        81037 2001 216 "2515" "393"
                        81037 2014 216 "2515" "393"
                        81037 1998 216 "2515" "393"
                        81037 2012 216 "2082" "1822"
                        81037 2014 216 "2515" "393"
                        end
                        
                        ren actor_id actor_id1
                        gen id = _n
                        reshape long actor_id, i(id) j(j)
                        
                        bys gid year (actor_id): gen num = actor_id != actor_id[_n-1] & actor_id != ""
                        bys gid year: egen wanted = total(num)
                        drop num
                        
                        reshape wide actor_id, i(id) j(j)
                        drop id
                        ren actor_id1 actor_id
                        Last edited by Fei Wang; 08 Nov 2021, 09:49.

                        Comment


                        • #27
                          Fei Wang Pfff... Now, it's perfect. I eliminate the encode step. Really, thank you so much Fei. It works very well.

                          Comment


                          • #28
                            Thanks for #25. My code could be adapted for string variables, as below.


                            Otherwise the answer is surely to use destring not encode.

                            Code:
                            * Example generated by -dataex-. For more info, type help dataex
                            clear
                            input long gid int(year ccode_GID) long(actor_id actor_id2)
                            80318 2012 216 2082 1911
                            80318 2012 216 2082  393
                            80318 2012 216 2082 1911
                            80318 2012 216 2082 1911
                            80318 2012 216 2082  393
                            80318 2012 216   32   21
                            80318 2012 216 2082 1911
                            80318 2012 216 2082 1911
                            80318 2012 216 2082 1911
                            80318 2012 216 2082 1911
                            80318 2012 216 2082 1911
                            80318 2012 216 2082 1911
                            80318 2012 216 2515  393
                            end
                            
                            gen wanted = 0
                            gen this = 0
                            egen group = group(gid year) , label
                            
                            tostring actor_*, replace 
                            
                            su group, meanonly
                            
                            forval g = 1/`r(max)' {
                                 quietly replace this = group == `g'
                                 mata : data = st_sdata(., ("actor_id actor_id2"), "this")
                                 mata : data 
                                 mata : obs = rows(data)
                                 mata : obs 
                                 mata : data = data[,1] \ data[,2]
                                 mata : data 
                                 mata : data = select(data, !(data :== ""))
                                 mata : data 
                                 mata : uniqrows(data)
                                 mata : rows(uniqrows(data))
                                 mata : count = J(obs, 1, rows(uniqrows(data)))
                                 mata : count 
                                 mata : st_store(., "wanted", "this", count)
                            }
                            
                            tabdisp group, c(wanted)

                            Comment


                            • #29
                              Thank you Nick Cox, your code works very well. To be honest, I do not have the enough level in Stata to understand it, but it works perfect. Thank you again.

                              Diego.

                              Comment


                              • #30
                                Glad you solved your problem. Note that back in #1 you told us your actor variables were long, so working on your own with string variables was something we couldn't know and was only too likely to confuse you.

                                Also, the procedure in #20 was spelled out step by step in #24 and here is some pseudo code.

                                for each gid and year
                                copy actor data (2 variables) into Mata
                                stack 2 columns into one
                                lose missing values
                                reduce to unique (distinct) rows
                                count rows
                                copy result back into Stata variable

                                Separate use of encode for several string variables that have related values is such a really bad idea that it deserves a stronger warning against it in the manuals.

                                Back in #20: unique is a community-contributed command from SSC. and

                                Code:
                                search unique
                                does tell you that (among many other things). George Ford forgot to tell you that.

                                Comment

                                Working...
                                X