Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sum by group the distinct values of two different columns into a new one.

    Hello,

    I have the following database:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long gid int(year ccode_GID) long(actor_id actor_id2)
    62356 1997 216    .    .
    62357 1997 216    .    .
    79599 1997 216    .    .
    79600 1997 216    .    .
    79601 1997 216    .    .
    80317 2012 216 2082  393
    80317 2014 216 2082  393
    80317 2015 216 2515  393
    80317 2013 216 1795  393
    80317 2012 216 2082  393
    80317 2017 216 2515  393
    80317 2016 216 2515  393
    80317 2012 216 2082  393
    80317 2016 216 2515  393
    80317 2012 216 2082  393
    80317 2012 216 2082 1911
    80318 2012 216 2082 1911
    80318 2009 216 2082  393
    80318 2015 216 2515  393
    80318 2012 216 2082  393
    80318 2017 216 2515  393
    80318 2012 216 2082 1911
    80318 2002 216 2515  393
    80318 2015 216 1258  393
    80318 2012 216 2082 1911
    80318 2014 216 2515  393
    80318 1997 216 2515  393
    80318 2012 216 2082  393
    80318 2003 216 2515  393
    80318 1998 216 2515  393
    80318 2004 216 2515  392
    80318 2015 216 2082 1911
    80318 2012 216   32   21
    80318 2013 216 2622  335
    80318 2012 216 2082 1911
    80318 2012 216 2082 1911
    80318 2007 216 2082 1822
    80318 2012 216 2082 1911
    80318 2012 216 2082 1911
    80318 1998 216 2515  393
    80318 2014 216 2515  393
    80318 2012 216 2082 1911
    80318 2014 216 2515  393
    80318 2017 216 2515  393
    80318 2012 216 2082 1911
    80318 2013 216 2515  393
    80318 1997 216 2515  393
    80318 2003 216 2515  393
    80318 2013 216 1795  393
    80318 2015 216 2082 1822
    80318 2012 216 2515  393
    80318 2013 216 2515  393
    80319 2012 216 2082 1911
    80319 2012 216 2082 1911
    80320 1997 216    .    .
    80321 1997 216    .    .
    80322 1997 216    .    .
    80323 1997 216    .    .
    80324 1997 216    .    .
    80325 1997 216    .    .
    80326 1997 216    .    .
    80327 2012 216 2082 1911
    80327 2012 216 2082 1911
    80327 2012 216 2082 1911
    80328 1997 216    .    .
    80329 1997 216    .    .
    80330 2004 216 2515  393
    80331 1997 216    .    .
    80332 1997 216    .    .
    81037 2012 216 2082  408
    81037 2003 216 2515 1727
    81037 2000 216 2515  393
    81037 2014 216 2515  393
    81037 1998 216 1909  393
    81037 2004 216 2515 1727
    81037 2002 216 2515  393
    81037 2012 216 2082 1911
    81037 1999 216 2515  393
    81037 2015 216 2082 1822
    81037 1999 216   33 2095
    81037 2002 216 2082  393
    81037 2013 216 2082  361
    81037 2015 216 2082 1911
    81037 2014 216 2515  393
    81037 1998 216 1909  393
    81037 2000 216 2515  393
    81037 2016 216   33  393
    81037 2000 216 2515  393
    81037 1999 216 2515  361
    81037 2016 216 2082  381
    81037 2000 216 2515  393
    81037 2014 216 2515  393
    81037 1998 216 1909  393
    81037 2017 216 2548  392
    81037 2000 216 2515  393
    81037 2001 216 2515  393
    81037 2014 216 2515  393
    81037 1998 216 2515  393
    81037 2012 216 2082 1822
    81037 2014 216 2515  393
    end
    I would like to sum the number of different actors of the two columns (actor_id and actor_id2) by gid and year. In other words, I would like to have a column with the sum of the distincts values of the variables actor_id and actor_id2 by gid and year. When there are some missing values, I would like taking into account them as a 0. Variable ccode_GID does not change by gid (it is time invariant)

    Do you think it is feasible to do it? I do not find the way. Any suggestion is more than welcome.

    Best,

    Diego.

  • #2
    Code:
    mvencode actor_id*, mv(0)
    
    by gid year (actor_id), sort: gen distinct_actor_id = sum(actor_id != actor_id[_n-1])
    by gid year (actor_id2), sort: gen distinct_actor_id2 = sum(actor_id2 != actor_id2[_n-1])
    by gid year: gen wanted = distinct_actor_id[_N] + distinct_actor_id2[_N]

    Comment


    • #3
      Diego:
      do you mean:
      Code:
      . bysort gid (year): gen wanted= actor_id+ actor_id2
      
      
      . replace wanted=0 if wanted==.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4

        If you want the non-missing values included in the sum, then replace missing values with 0 first. Or, perhaps better, create a new variable and then replace the missing with 0 on it, thus saving the original variable.

        Comment


        • #5
          Diego:
          as a second thought, I would propose:
          Code:
          egen wanted=group( actor_id actor_id2 )
          bysort gid year: egen count=count(wanted)
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Thank you for all your answers.

            Clyde Schechter Thank you for your answer. However, I find some problems with the code you proposes. For example, for cell 80318 and year 2012 (it is in the example I gave in the post), it should be 6 different actors and I obtained 5 with your code. It is not finishing to count well all the number of different actors. Another problem, it is that the missing are counting as a different actor, I want to count them as a 0, that is, the missing is not a different actor.

            Carlo Lazzaro Thank you for your code Lazaro. I think your code is summing the total numbers and what I would like to have is the total number of different actors by grid and year, without counting the missing values as a distinct actors.

            George Ford Thank you for your question. However, If I do that, I am considering them as a different actor and I do not want to consider them as a different actor.

            It's a bit difficult to get what I want, I do not finish to get it.
            Last edited by Diego Malo; 04 Nov 2021, 01:54.

            Comment


            • #7
              Thank you Carlo Lazzaro for your second thought. I appreciate it. However, I do not understand well why but it is not counting well the number of different actors. For example, for gid 80317 and year 2012, it should be 3 different actors, with your code I obtained 5.

              Comment


              • #8
                Diego Malo , based on your description in #6, is the following code what you need?

                Code:
                ren actor_id actor_id1
                gen id = _n
                reshape long actor_id, i(id) j(j)
                
                bys gid year (actor_id): gen num = actor_id != actor_id[_n-1] & actor_id != .
                bys gid year: egen wanted = total(num)
                drop num
                
                reshape wide actor_id, i(id) j(j)
                drop id
                ren actor_id1 actor_id

                Comment


                • #9
                  Diego:
                  Code:
                  . egen wanted=group( actor_id actor_id2 )
                  
                  . bysort gid year (wanted): gen flag=1 if wanted[_n]!=wanted[_n+1]
                  
                  . bysort gid year: egen Num_actor=total(flag)
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment


                  • #10
                    Fei Wang it seems that it works!! Thank you so much!! Can you explain to me a bit what your code does? I would like to learn it.

                    Carlo Lazzaro Thank you Carlo for your code. However, it is still not counting well. Grid 80317 and year 2012 should have 3 distinc actors, with your code I obtain 2. Another example, grid 80318 and year 2012 should be 6 actors, with your code I obtain 4.

                    Comment


                    • #11
                      You're welcome, Diego. As you are counting the number of different actor IDs in "actor_id" and "actor_id2" together, it would be convenient to combine the two actor ID variables to one single column, and count. I used -reshape long- to combine the two columns, and after counting, used -reshape wide- to restore your original data form -- That's the algorithm of my code. BTW, I think concrete examples in the OP, like what you gave in #6 and #7, would help us better understand your question from the very beginning -- I didn't get your point either until I read #6.

                      Comment


                      • #12
                        Diego:
                        you're correct, as my last code counts the distinct actors in -actor_id2- only.
                        Kind regards,
                        Carlo
                        (Stata 19.0)

                        Comment


                        • #13
                          Fei Wang Thank you again!. I will try to give more examples when I explain things. On the other hand, if I am not wrong, ren is rename, so you are rename my column actor_id as actor_id1, but what happens with actor_id2? Sorry for my ignorance but I just would like to learn a bit of Stata.

                          Carlo Lazzaro it is ok, thank you for your help!.

                          Anyway, do you know if it is any possibility to do it with command egen and tag, as Nick Cox proposed me here Sum variables taking into account missing data - Statalist ?

                          Thank you,

                          Diego.
                          Last edited by Diego Malo; 04 Nov 2021, 03:51.

                          Comment


                          • #14
                            Diego, yes, -ren- is -rename-, it's to rename your "actor_id" to "actor_id1" for the purpose of using -reshape long-. I don't need to do this for "actor_id2" because it has a suffix "2".

                            To my understanding, Nick's wise solution using -tag()- is consistent with the use of -group()- in this thread. But I think it may not fit your case. For example, for "actor_id" and "actor_id2", if the first line records "1" and "2", and the second line is "1" and "3", then there are three different actors while -tag()- will only mark two distinct groups. After combining the two IDs into one, -tag()- can be used as in the other thread.

                            Comment


                            • #15
                              I understood, thank you again Fei Wang !

                              Comment

                              Working...
                              X