Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging Duplicates without Deleting

    Hello there. I am a graduate student and relatively new to STATA, but I have found posts on this forum very helpful in the past. I am hoping you can help with my current problem.

    I am working with a very "cleaned" (for sake of clarification, a very limited) set of patent data. Patents are given a uniquely-identifying grant number (var: GrantNumber) and are classified in CPC codes (var: CPC). However, patents may be (and are often) filed in multiple CPC codes. In the dataset with which I am working, there are 60,172 unique grant numbers (GrantNumber), only 3 CPC codes (CPC), but 195,582 observations.

    There are many unique grant numbers (GrantNumber) that are duplicated, either two or three times, depending on how many CPC codes in which that patent is classified. For an example, a portion of my data looks like this:

    Grant Number CPC
    6171609 A61K0009
    6171609 A61K0047
    6171609 B82Y0005
    6171611 A61K0009
    6171612 A61K0047
    6171613 A61K0009
    6171614 A61K0009
    6171614 A61K0047












    I want my data to look like this:
    GrantNumber CPC FirstCPC SecondCPC ThirdCPC
    6171609 A61K0009 A61K0009 A61K0047 B82Y0005
    6171609 A61K0047 . . .
    6171609 B82Y0005 . . .
    6171611 A61K0009 A61K0009 . .
    6171612 A61K0047 A61K0047 . .
    6171613 A61K0009 A61K0009 . .
    6171614 A61K0009 A61K0009 A61K0047 .
    6171614 A61K0047 . . .












    Then, I could simply delete the duplicate GrantNumbers, leaving a final result of:
    GrantNumber CPC FirstCPC SecondCPC ThirdCPC
    6171609 A61K0009 A61K0009 A61K0047 B82Y0005
    6171611 A61K0009 A61K0009 . .
    6171612 A61K0047 A61K0047 . .
    6171613 A61K0009 A61K0009 . .
    6171614 A61K0009 A61K0009 A61K0047 .








    I've already performed the following commands, but none of them accomplish what I am trying to accomplish:
    • egen GrantNumber_count = count(GrantNumber), by(CPC) --- Generates a variable that shows the count of GrantNumber for each of the three CPC codes. Helpful, but not what I am looking for.
    • egen CPC_count = count(CPC), by(GrantNumber) --- Generates a variable that shows the count of CPC codes for each GrantNumber. Again, helpful and closer to what I want, but not what I am looking for.
    • The following group of commands:
      • sort GrantNumber --- To prepare for generating a duplicate variable.
      • quietly by GrantNumber: gen dup = cond(_N==1, 0, _n) --- Generates a variable that indicates which observations are duplicates (dup==2 and dup==3), so that I can eventually drop these duplicates like I mentioned above.
    Simply knowing that some GrantNumbers are in multiple CPC codes is not enough. I need to how many GrantNumbers appear in the the following categories:​​​​​​
    • Only Two CPCs: A61K0009 and A61K0047
    • Only Two CPCs: A61K0009 and B82Y0005
    • Only Two CPCs: A61K0047 and B82Y0005
    • All Three CPCs: A61K0009 and A61K0047 and B82Y0005
    Any help would be greatly appreciated. Thank you in advance!

    Per the rules of the Statalist Forum, my data looks like this:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    ` dataex GrantNumber CPC
    clear
    input long GrantNumber str8 CPC
    6167880 "A61K0009"
    6167887 "A61K0009"
    6168778 "A61K0047"
    6168796 "A61K0047"
    6168801 "A61K0009"
    6168804 "A61K0009"
    6168805 "A61K0009"
    6168806 "A61K0009"
    6169068 "A61K0009"
    6169078 "A61K0047"
    6169083 "A61K0009"
    6169083 "A61K0047"
    6169084 "A61K0009"
    6169084 "A61K0047"
    6169099 "A61K0009"
    6169100 "A61K0009"
    6169117 "A61K0009"
    6169130 "A61K0009"
    6169194 "B82Y0005"
    6170717 "A61K0009"
    6171577 "A61K0047"
    6171580 "B82Y0005"
    6171591 "A61K0047"
    6171599 "A61K0009"
    6171600 "A61K0009"
    6171606 "A61K0009"
    6171609 "A61K0009"
    6171609 "A61K0047"
    6171609 "B82Y0005"
    6171611 "A61K0009"
    6171612 "A61K0047"
    6171613 "A61K0009"
    6171614 "A61K0009"
    6171614 "A61K0047"
    6171615 "A61K0009"
    6171616 "A61K0009"
    6171617 "A61K0009"
    6171618 "A61K0009"
    6171619 "A61K0009"
    6171859 "A61K0047"
    6172040 "A61K0047"
    6172045 "A61K0047"
    6172045 "B82Y0005"
    6172048 "A61K0009"
    6172048 "A61K0047"
    6172049 "A61K0009"
    6172049 "A61K0047"
    6172089 "A61K0009"
    6172090 "A61K0009"
    6172107 "A61K0009"
    6172202 "A61K0047"
    6172208 "A61K0047"
    6172219 "A61K0009"
    6172219 "A61K0047"
    6172261 "A61K0047"
    6174517 "B82Y0005"
    6174518 "A61K0009"
    6174524 "A61K0009"
    6174529 "A61K0009"
    6174529 "A61K0047"
    6174530 "A61K0047"
    6174533 "B82Y0005"
    6174539 "A61K0047"
    6174540 "A61K0009"
    6174540 "A61K0047"
    6174543 "A61K0009"
    6174543 "A61K0047"
    6174545 "A61K0009"
    6174546 "A61K0009"
    6174547 "A61K0009"
    6174548 "A61K0009"
    6174687 "A61K0047"
    6174723 "A61K0047"
    6174856 "A61K0009"
    6174856 "A61K0047"
    6174858 "A61K0047"
    6174866 "A61K0009"
    6174873 "A61K0009"
    6174873 "A61K0047"
    6174902 "A61K0009"
    6174904 "A61K0009"
    6174999 "A61K0009"
    6174999 "A61K0047"
    6175054 "A61K0009"
    6177059 "A61K0009"
    6177059 "A61K0047"
    6177062 "B82Y0005"
    6177068 "A61K0009"
    6177074 "A61K0047"
    6177076 "A61K0009"
    6177087 "A61K0047"
    6177095 "A61K0009"
    6177096 "A61K0009"
    6177098 "A61K0009"
    6177099 "A61K0009"
    6177100 "B82Y0005"
    6177101 "A61K0009"
    6177102 "A61K0009"
    6177103 "A61K0009"
    6177104 "A61K0009"
    end
    Last edited by Daley Gay; 16 Dec 2021, 00:16.

  • #2
    Reshape wide is what you need.

    Code:
    bys GrantNumber (CPC): gen j = _n
    reshape wide CPC, i(GrantNumber) j(j)

    Comment


    • #3
      While I agree that the reshape command is the tool you need, I think that you can improve upon the objective that you describe in post #1, since you only have three possible values of CPC.

      Consider the following example, where I have used the data you actually presented in your explanation to create a useful dataex sample to demonstrate the results. (You could have run the command
      Code:
      dataex if GrantNumber>=6171609 & GrantNumber<=6171614
      to create this sample.)
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input long GrantNumber str8 CPC
      6171609 "A61K0009"
      6171609 "A61K0047"
      6171609 "B82Y0005"
      6171611 "A61K0009"
      6171612 "A61K0047"
      6171613 "A61K0009"
      6171614 "A61K0009"
      6171614 "A61K0047"
      end
      generate has = 1
      reshape wide has, i(GrantNumber) j(CPC) string
      mvencode has*, mv(0)
      rename (has*) (*)
      list, abbreviate(16)
      The result is one indicator variable for each CPC that is 1 if the specific CPC was present and 0 otherwise.
      Code:
      . generate has = 1
      
      . reshape wide has, i(GrantNumber) j(CPC) string
      (j = A61K0009 A61K0047 B82Y0005)
      
      Data                               Long   ->   Wide
      -----------------------------------------------------------------------------
      Number of observations                8   ->   5           
      Number of variables                   3   ->   4           
      j variable (3 values)               CPC   ->   (dropped)
      xij variables:
                                          has   ->   hasA61K0009 hasA61K0047 hasB82Y0005
      -----------------------------------------------------------------------------
      
      . mvencode has*, mv(0)
       hasA61K0009: 1 missing value recoded
       hasA61K0047: 2 missing values recoded
       hasB82Y0005: 4 missing values recoded
      
      . rename (has*) (*)
      
      . list, abbreviate(16)
      
           +----------------------------------------------+
           | GrantNumber   A61K0009   A61K0047   B82Y0005 |
           |----------------------------------------------|
        1. |     6171609          1          1          1 |
        2. |     6171611          1          0          0 |
        3. |     6171612          0          1          0 |
        4. |     6171613          1          0          0 |
        5. |     6171614          1          1          0 |
           +----------------------------------------------+
      
      .

      Comment


      • #4
        Fei Wang and William Lisowski have given you good advice how to achieve what you asked. So I'll spoil the party and point out that you may regret getting what you wished for. Your original data are in long layout and you will have transformed it to wide. While this makes it easier for humans to read, and is the conventional organization of such data in spreadsheets, the wide layout is generally less useful in Stata. While some commands work well with wide data in Stata, they are a minority. Most Stata analysis or data management commands work best, or only, with the long layout you started with in the first place. It will not surprise me if your next Statalist post is a question about how to do something with this data, the answer to which begins with going back to the original long layout.

        Comment


        • #5
          While I normally agree with Clyde's recommendation of the long layout, my advice was based on the following points in post #1:
          • Only a single variable with three possible values was identified as differing across the observations
          • The poster is new to Stata, and working with a long layout requires some skill at thinking "across observations"
          • For this task, the bulleted categories requested in post #1 can be coded in a straightforward fashion in the wide layout
          Code:
          generate byte code =  0
          replace code = 1 if  A61K0009 &  A61K0047 & !B82Y0005
          replace code = 2 if  A61K0009 & !A61K0047 &  B82Y0005
          replace code = 3 if !A61K0009 &  A61K0047 &  B82Y0005
          replace code = 4 if  A61K0009 &  A61K0047 &  B82Y0005
          label define CODE 0 "1 CPC"               ///
                            1 "A61K0009 & A61K0047" ///
                            2 "A61K0009 & B82Y0005" ///
                            3 "A61K0047 & B82Y0005" ///
                            4 "3 CPCs"
          label values code CODE
          Code:
          . list, abbreviate(16)
          
               +--------------------------------------------------------------------+
               | GrantNumber   A61K0009   A61K0047   B82Y0005                  code |
               |--------------------------------------------------------------------|
            1. |     6171609          1          1          1                3 CPCs |
            2. |     6171611          1          0          0                 1 CPC |
            3. |     6171612          0          1          0                 1 CPC |
            4. |     6171613          1          0          0                 1 CPC |
            5. |     6171614          1          1          0   A61K0009 & A61K0047 |
               +--------------------------------------------------------------------+
          
          . list, abbreviate(16) nolabel
          
               +-----------------------------------------------------+
               | GrantNumber   A61K0009   A61K0047   B82Y0005   code |
               |-----------------------------------------------------|
            1. |     6171609          1          1          1      4 |
            2. |     6171611          1          0          0      0 |
            3. |     6171612          0          1          0      0 |
            4. |     6171613          1          0          0      0 |
            5. |     6171614          1          1          0      1 |
               +-----------------------------------------------------+
          With that said, let me add the following advice that I routinely give to posters who identify as being new to Stata.

          I'm sympathetic to you as a new user of Stata - there is quite a lot to absorb.

          When I began using Stata in a serious way, I started, as have others here, by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. All of these manuals are included as PDFs in the Stata installation and are accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu.

          The objective in doing the reading was not so much to master Stata - I'm still far from that goal - as to be sure I'd become familiar with a wide variety of important basic techniques, so that when the time came that I needed them, I might recall their existence, if not the full syntax, and know how to find out more about them in the help files and PDF manuals.

          Stata supplies exceptionally good documentation that amply repays the time spent studying it - there's just a lot of it. The path I followed surfaces the things you need to know to get started in a hurry and to work effectively.

          Stata also supples YouTube videos, if that's your thing.

          Comment

          Working...
          X