Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating variable showing number of duplicates

    Hello

    I'm analyzing data containing patent and inventor data. A patent can have several inventors, for each patent-inventor combination an observation is created. A small snapshot of the data:

    patent id invt_id
    05915552 05915552-2
    05915552 05915552-1
    05921324 04815261-1
    05921607 04777746-1
    05921607 05921607-1
    05922435 05345825-2
    05922435 04352292-2
    05922435 05922435-4
    05922435 05922435-1
    05922435 05922435-5
    05922498 05434657-2
    05922498 04921769-1
    05922498 04050935-3

    I'd like to create a variable showing the 'team size' of a patent e.g. patent 05915552 has 2 inventors, patent 05922435 has 4 inventors, etc... for each observation.
    The command 'duplicates report patent id' gives a 'copies column', I want to create a variable for every observation showing how many copies there are based on the patent id variable.

    Any suggestion on this?

    Thanks
    Ludo

  • #2
    Welcome to Statalist.

    The duplicates tag command is what you need, with a small fix afterwards.
    Code:
    . * Example generated by -dataex-. To install: ssc install dataex
    . clear
    
    . input str8 patent_id str10 invt_id
    
         patent_id     invt_id
      1. "05915552" "05915552-2"
      2. "05915552" "05915552-1"
      3. "05921324" "04815261-1"
      4. "05921607" "04777746-1"
      5. "05921607" "05921607-1"
      6. "05922435" "05345825-2"
      7. "05922435" "04352292-2"
      8. "05922435" "05922435-4"
      9. "05922435" "05922435-1"
     10. "05922435" "05922435-5"
     11. "05922498" "05434657-2"
     12. "05922498" "04921769-1"
     13. "05922498" "04050935-3"
     14. end
    
    . 
    . duplicates tag patent_id, generate(copies)
    
    Duplicates in terms of patent_id
    
    . replace copies = copies+1
    (13 real changes made)
    
    . list, noobs sepby(patent_id)
    
      +--------------------------------+
      | patent~d      invt_id   copies |
      |--------------------------------|
      | 05915552   05915552-2        2 |
      | 05915552   05915552-1        2 |
      |--------------------------------|
      | 05921324   04815261-1        1 |
      |--------------------------------|
      | 05921607   04777746-1        2 |
      | 05921607   05921607-1        2 |
      |--------------------------------|
      | 05922435   05345825-2        5 |
      | 05922435   04352292-2        5 |
      | 05922435   05922435-4        5 |
      | 05922435   05922435-1        5 |
      | 05922435   05922435-5        5 |
      |--------------------------------|
      | 05922498   05434657-2        3 |
      | 05922498   04921769-1        3 |
      | 05922498   04050935-3        3 |
      +--------------------------------+

    Comment


    • #3
      Also

      Code:
      bysort patient_id : gen copies = _N

      Comment


      • #4
        Worked fine for me, thanks!

        Comment

        Working...
        X