Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating a total using only distinct observations within a group

    Hi all,

    I've got a variable (total_patents_91_05) that lists the total number of patents per PERSON_ID for a set period. However, I'd like to create another total based on the PSN_ID that would be the sum the distinct totals listed in PERSON_ID. So, as can be seen below, I would need to sum 310 + 1796 + 2369 + 0 + 455 ... etc. I started by marking the distinct observations:

    Code:
    by PERSON_ID sort: generate nvals = _n ==1
    But I'm not sure where to go from here. Any suggestions?


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long PERSON_ID str4 IPC_CLASS int total_patents_91_05 long PSN_ID
    5212052 "A61M"  310 55098
    5701047 "A44B" 1796 55098
    5701047 "C07H" 1796 55098
    5701047 "A47G" 1796 55098
    5212052 "G08G"  310 55098
    5231021 "D06L" 2369 55098
    5212052 "H01G"  310 55098
    5212052 "H01J"  310 55098
    5212052 "B43M"    0 55098
    5212052 "H01L"  310 55098
    5212052 "H01M"  310 55098
    5212052 "H04N"  310 55098
    5212052 "B66B"    0 55098
    5212052 "H01S"  310 55098
    5212056 "A47B"  455 55098
    5701047 "B21F" 1796 55098
    5212056 "B05D"  455 55098
    5212056 "C40B"    0 55098
    5212056 "B23P"  455 55098
    5212056 "B32B"  455 55098
    5212052 "F41G"    0 55098
    5212056 "B41F"  455 55098
    5231021 "G01L" 2369 55098
    5216166 "A21C"    0 55098
    5216166 "B29B"    0 55098
    5212056 "B66F"  455 55098
    5212056 "B67B"  455 55098
    5231021 "D01F" 2369 55098
    5212056 "C07H"  455 55098
    5212056 "C08G"  455 55098
    5231021 "H01B" 2369 55098
    5216166 "B62J"    0 55098
    5212056 "C09J"  455 55098
    5212056 "C10M"  455 55098
    5212056 "C12M"  455 55098
    5212056 "C12P"  455 55098
    5212056 "C22C"  455 55098
    5212056 "D06C"  455 55098
    5701047 "D04B" 1796 55098
    5231021 "B65D" 2369 55098
    5212056 "D21H"  455 55098
    5216166 "G04G"    0 55098
    5216166 "F23C"    0 55098
    5212056 "F21K"  455 55098
    5231021 "F16J" 2369 55098
    5212056 "F28F"  455 55098
    5212052 "G01V"    0 55098
    5212056 "F27B"  455 55098
    5212056 "G02B"  455 55098
    5212056 "G03C"  455 55098
    5701047 "C08K" 1796 55098
    5216166 "G11B"    0 55098
    5701047 "F16F" 1796 55098
    5231021 "F27D" 2369 55098
    5212056 "H01S"  455 55098
    5212056 "H02G"  455 55098
    5216166 "H01M"   16 55098
    5212056 "H05K"  455 55098
    5212056 "F24D"    0 55098
    5212052 "B66C"    0 55098
    5212056 "G01P"    0 55098
    5216166 "H02G"    0 55098
    5216166 "C09D"   16 55098
    5212056 "H01M"  455 55098
    5212052 "F24D"    0 55098
    5701047 "G06G" 1796 55098
    5216166 "H01T"    0 55098
    5701047 "F01D" 1796 55098
    5701047 "F02B" 1796 55098
    5701047 "A47B" 1796 55098
    5231021 "H01H" 2369 55098
    5701047 "H01G" 1796 55098
    5212056 "H04W"    0 55098
    5212056 "D04D"    0 55098
    5701047 "G09F" 1796 55098
    5212056 "A21C"    0 55098
    5212052 "B32B"  310 55098
    5701047 "C07D" 1796 55098
    5216166 "G08B"    0 55098
    5231021 "A61K" 2369 55098
    5212052 "D01F"  310 55098
    5701047 "B43L" 1796 55098
    5231021 "G06Q" 2369 55098
    5231021 "B28B" 2369 55098
    5216166 "G06K"    0 55098
    5212056 "C09C"    0 55098
    5216166 "B24D"   16 55098
    5212056 "E03D"    0 55098
    5701047 "H01Q" 1796 55098
    5216166 "B23D"    0 55098
    5212056 "B65H"  455 55098
    5212056 "F16K"  455 55098
    5216166 "A44B"    0 55098
    5231021 "D01D" 2369 55098
    5701047 "F25B" 1796 55098
    5212052 "G01B"  310 55098
    5216166 "A61P"    0 55098
    5231021 "A63C" 2369 55098
    5212056 "F28C"  455 55098
    5216166 "B41C"    0 55098
    end

  • #2
    Code:
    bys PSN_ID: egen wanted=total(cond(nvals, total_patents_91_05, .))
    For an elaboration of this technique, see https://www.stata-journal.com/articl...article=dm0055.

    I started by marking the distinct observations:
    Note that the variable "total_patents_91_05" does not appear to be constant within PERSON_ID, so your code tags an arbitrary observation (the one sorted first). I do not know specifically what you mean by "distinct observations" to suggest how to tag the observations if the above is not what you wanted.
    Last edited by Andrew Musau; 17 Jan 2024, 12:20.

    Comment


    • #3
      Ah, I see what you mean.

      I'm looking to tag the max value of "total_patents_91_05" for each PERSON_ID and then sum across PSN_ID.

      Comment


      • #4
        That's a scalar (single number) if I understand correctly. Here it is done two ways, using summarize and done over the top by using more variables.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input long PERSON_ID str4 IPC_CLASS int total_patents_91_05 long PSN_ID
        5212052 "A61M"  310 55098
        5701047 "A44B" 1796 55098
        5701047 "C07H" 1796 55098
        5701047 "A47G" 1796 55098
        5212052 "G08G"  310 55098
        5231021 "D06L" 2369 55098
        5212052 "H01G"  310 55098
        5212052 "H01J"  310 55098
        5212052 "B43M"    0 55098
        5212052 "H01L"  310 55098
        5212052 "H01M"  310 55098
        5212052 "H04N"  310 55098
        5212052 "B66B"    0 55098
        5212052 "H01S"  310 55098
        5212056 "A47B"  455 55098
        5701047 "B21F" 1796 55098
        5212056 "B05D"  455 55098
        5212056 "C40B"    0 55098
        5212056 "B23P"  455 55098
        5212056 "B32B"  455 55098
        5212052 "F41G"    0 55098
        5212056 "B41F"  455 55098
        5231021 "G01L" 2369 55098
        5216166 "A21C"    0 55098
        5216166 "B29B"    0 55098
        5212056 "B66F"  455 55098
        5212056 "B67B"  455 55098
        5231021 "D01F" 2369 55098
        5212056 "C07H"  455 55098
        5212056 "C08G"  455 55098
        5231021 "H01B" 2369 55098
        5216166 "B62J"    0 55098
        5212056 "C09J"  455 55098
        5212056 "C10M"  455 55098
        5212056 "C12M"  455 55098
        5212056 "C12P"  455 55098
        5212056 "C22C"  455 55098
        5212056 "D06C"  455 55098
        5701047 "D04B" 1796 55098
        5231021 "B65D" 2369 55098
        5212056 "D21H"  455 55098
        5216166 "G04G"    0 55098
        5216166 "F23C"    0 55098
        5212056 "F21K"  455 55098
        5231021 "F16J" 2369 55098
        5212056 "F28F"  455 55098
        5212052 "G01V"    0 55098
        5212056 "F27B"  455 55098
        5212056 "G02B"  455 55098
        5212056 "G03C"  455 55098
        5701047 "C08K" 1796 55098
        5216166 "G11B"    0 55098
        5701047 "F16F" 1796 55098
        5231021 "F27D" 2369 55098
        5212056 "H01S"  455 55098
        5212056 "H02G"  455 55098
        5216166 "H01M"   16 55098
        5212056 "H05K"  455 55098
        5212056 "F24D"    0 55098
        5212052 "B66C"    0 55098
        5212056 "G01P"    0 55098
        5216166 "H02G"    0 55098
        5216166 "C09D"   16 55098
        5212056 "H01M"  455 55098
        5212052 "F24D"    0 55098
        5701047 "G06G" 1796 55098
        5216166 "H01T"    0 55098
        5701047 "F01D" 1796 55098
        5701047 "F02B" 1796 55098
        5701047 "A47B" 1796 55098
        5231021 "H01H" 2369 55098
        5701047 "H01G" 1796 55098
        5212056 "H04W"    0 55098
        5212056 "D04D"    0 55098
        5701047 "G09F" 1796 55098
        5212056 "A21C"    0 55098
        5212052 "B32B"  310 55098
        5701047 "C07D" 1796 55098
        5216166 "G08B"    0 55098
        5231021 "A61K" 2369 55098
        5212052 "D01F"  310 55098
        5701047 "B43L" 1796 55098
        5231021 "G06Q" 2369 55098
        5231021 "B28B" 2369 55098
        5216166 "G06K"    0 55098
        5212056 "C09C"    0 55098
        5216166 "B24D"   16 55098
        5212056 "E03D"    0 55098
        5701047 "H01Q" 1796 55098
        5216166 "B23D"    0 55098
        5212056 "B65H"  455 55098
        5212056 "F16K"  455 55098
        5216166 "A44B"    0 55098
        5231021 "D01D" 2369 55098
        5701047 "F25B" 1796 55098
        5212052 "G01B"  310 55098
        5216166 "A61P"    0 55098
        5231021 "A63C" 2369 55098
        5212056 "F28C"  455 55098
        5216166 "B41C"    0 55098
        end
        
        bysort PERSON_ID (total_patents_91_05) : gen ismax = _n == _N 
        su total_patents_91_05 if ismax 
        
        di r(sum)
        
        egen max = max(total_patents_91_05), by(PERSON_ID)
        egen tag = tag(PERSON_ID)
        egen double wanted = total(tag * max)
        
        di wanted[1]
        I get 4946 both ways with your data example.

        Warning: The first method will fall over with any missing values, but that can be fixed.

        Note that you had implicitly

        Code:
        . di 310 + 1796 + 2369 + 0 + 455
        4930
        but you missed a value of 16.

        Comment

        Working...
        X