Merging Duplicates without Deleting

Daley Gay

Join Date: Dec 2021
Posts: 1

Merging Duplicates without Deleting

15 Dec 2021, 22:42

Hello there. I am a graduate student and relatively new to STATA, but I have found posts on this forum very helpful in the past. I am hoping you can help with my current problem.

I am working with a very "cleaned" (for sake of clarification, a very limited) set of patent data. Patents are given a uniquely-identifying grant number (var: GrantNumber) and are classified in CPC codes (var: CPC). However, patents may be (and are often) filed in multiple CPC codes. In the dataset with which I am working, there are 60,172 unique grant numbers (GrantNumber), only 3 CPC codes (CPC), but 195,582 observations.

There are many unique grant numbers (GrantNumber) that are duplicated, either two or three times, depending on how many CPC codes in which that patent is classified. For an example, a portion of my data looks like this:

Grant Number	CPC
6171609	A61K0009
6171609	A61K0047
6171609	B82Y0005
6171611	A61K0009
6171612	A61K0047
6171613	A61K0009
6171614	A61K0009
6171614	A61K0047

I want my data to look like this:

GrantNumber	CPC	FirstCPC	SecondCPC	ThirdCPC
6171609	A61K0009	A61K0009	A61K0047	B82Y0005
6171609	A61K0047	.	.	.
6171609	B82Y0005	.	.	.
6171611	A61K0009	A61K0009	.	.
6171612	A61K0047	A61K0047	.	.
6171613	A61K0009	A61K0009	.	.
6171614	A61K0009	A61K0009	A61K0047	.
6171614	A61K0047	.	.	.

Then, I could simply delete the duplicate GrantNumbers, leaving a final result of:

GrantNumber	CPC	FirstCPC	SecondCPC	ThirdCPC
6171609	A61K0009	A61K0009	A61K0047	B82Y0005
6171611	A61K0009	A61K0009	.	.
6171612	A61K0047	A61K0047	.	.
6171613	A61K0009	A61K0009	.	.
6171614	A61K0009	A61K0009	A61K0047	.

I've already performed the following commands, but none of them accomplish what I am trying to accomplish:

egen GrantNumber_count = count(GrantNumber), by(CPC) --- Generates a variable that shows the count of GrantNumber for each of the three CPC codes. Helpful, but not what I am looking for.
egen CPC_count = count(CPC), by(GrantNumber) --- Generates a variable that shows the count of CPC codes for each GrantNumber. Again, helpful and closer to what I want, but not what I am looking for.
The following group of commands:
- sort GrantNumber --- To prepare for generating a duplicate variable.
- quietly by GrantNumber: gen dup = cond(_N==1, 0, _n) --- Generates a variable that indicates which observations are duplicates (dup==2 and dup==3), so that I can eventually drop these duplicates like I mentioned above.

Simply knowing that some GrantNumbers are in multiple CPC codes is not enough. I need to how many GrantNumbers appear in the the following categories:

Only Two CPCs: A61K0009 and A61K0047
Only Two CPCs: A61K0009 and B82Y0005
Only Two CPCs: A61K0047 and B82Y0005
All Three CPCs: A61K0009 and A61K0047 and B82Y0005

Any help would be greatly appreciated. Thank you in advance!

Per the rules of the Statalist Forum, my data looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
` dataex GrantNumber CPC
clear
input long GrantNumber str8 CPC
6167880 "A61K0009"
6167887 "A61K0009"
6168778 "A61K0047"
6168796 "A61K0047"
6168801 "A61K0009"
6168804 "A61K0009"
6168805 "A61K0009"
6168806 "A61K0009"
6169068 "A61K0009"
6169078 "A61K0047"
6169083 "A61K0009"
6169083 "A61K0047"
6169084 "A61K0009"
6169084 "A61K0047"
6169099 "A61K0009"
6169100 "A61K0009"
6169117 "A61K0009"
6169130 "A61K0009"
6169194 "B82Y0005"
6170717 "A61K0009"
6171577 "A61K0047"
6171580 "B82Y0005"
6171591 "A61K0047"
6171599 "A61K0009"
6171600 "A61K0009"
6171606 "A61K0009"
6171609 "A61K0009"
6171609 "A61K0047"
6171609 "B82Y0005"
6171611 "A61K0009"
6171612 "A61K0047"
6171613 "A61K0009"
6171614 "A61K0009"
6171614 "A61K0047"
6171615 "A61K0009"
6171616 "A61K0009"
6171617 "A61K0009"
6171618 "A61K0009"
6171619 "A61K0009"
6171859 "A61K0047"
6172040 "A61K0047"
6172045 "A61K0047"
6172045 "B82Y0005"
6172048 "A61K0009"
6172048 "A61K0047"
6172049 "A61K0009"
6172049 "A61K0047"
6172089 "A61K0009"
6172090 "A61K0009"
6172107 "A61K0009"
6172202 "A61K0047"
6172208 "A61K0047"
6172219 "A61K0009"
6172219 "A61K0047"
6172261 "A61K0047"
6174517 "B82Y0005"
6174518 "A61K0009"
6174524 "A61K0009"
6174529 "A61K0009"
6174529 "A61K0047"
6174530 "A61K0047"
6174533 "B82Y0005"
6174539 "A61K0047"
6174540 "A61K0009"
6174540 "A61K0047"
6174543 "A61K0009"
6174543 "A61K0047"
6174545 "A61K0009"
6174546 "A61K0009"
6174547 "A61K0009"
6174548 "A61K0009"
6174687 "A61K0047"
6174723 "A61K0047"
6174856 "A61K0009"
6174856 "A61K0047"
6174858 "A61K0047"
6174866 "A61K0009"
6174873 "A61K0009"
6174873 "A61K0047"
6174902 "A61K0009"
6174904 "A61K0009"
6174999 "A61K0009"
6174999 "A61K0047"
6175054 "A61K0009"
6177059 "A61K0009"
6177059 "A61K0047"
6177062 "B82Y0005"
6177068 "A61K0009"
6177074 "A61K0047"
6177076 "A61K0009"
6177087 "A61K0047"
6177095 "A61K0009"
6177096 "A61K0009"
6177098 "A61K0009"
6177099 "A61K0009"
6177100 "B82Y0005"
6177101 "A61K0009"
6177102 "A61K0009"
6177103 "A61K0009"
6177104 "A61K0009"
end

Last edited by Daley Gay; 15 Dec 2021, 23:16.

Tags: duplicate, duplicates, egen, Generate, sort

Fei Wang

Join Date: Oct 2021

Posts: 726
#2

15 Dec 2021, 22:56

Reshape wide is what you need.

Code:

bys GrantNumber (CPC): gen j = _n reshape wide CPC, i(GrantNumber) j(j)
1 like
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

16 Dec 2021, 08:55

While I agree that the reshape command is the tool you need, I think that you can improve upon the objective that you describe in post #1, since you only have three possible values of CPC.

Consider the following example, where I have used the data you actually presented in your explanation to create a useful dataex sample to demonstrate the results. (You could have run the command

Code:

dataex if GrantNumber>=6171609 & GrantNumber<=6171614

to create this sample.)

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long GrantNumber str8 CPC
6171609 "A61K0009"
6171609 "A61K0047"
6171609 "B82Y0005"
6171611 "A61K0009"
6171612 "A61K0047"
6171613 "A61K0009"
6171614 "A61K0009"
6171614 "A61K0047"
end
generate has = 1
reshape wide has, i(GrantNumber) j(CPC) string
mvencode has*, mv(0)
rename (has*) (*)
list, abbreviate(16)

The result is one indicator variable for each CPC that is 1 if the specific CPC was present and 0 otherwise.

Code:

. generate has = 1

. reshape wide has, i(GrantNumber) j(CPC) string
(j = A61K0009 A61K0047 B82Y0005)

Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations                8   ->   5           
Number of variables                   3   ->   4           
j variable (3 values)               CPC   ->   (dropped)
xij variables:
                                    has   ->   hasA61K0009 hasA61K0047 hasB82Y0005
-----------------------------------------------------------------------------

. mvencode has*, mv(0)
 hasA61K0009: 1 missing value recoded
 hasA61K0047: 2 missing values recoded
 hasB82Y0005: 4 missing values recoded

. rename (has*) (*)

. list, abbreviate(16)

     +----------------------------------------------+
     | GrantNumber   A61K0009   A61K0047   B82Y0005 |
     |----------------------------------------------|
  1. |     6171609          1          1          1 |
  2. |     6171611          1          0          0 |
  3. |     6171612          0          1          0 |
  4. |     6171613          1          0          0 |
  5. |     6171614          1          1          0 |
     +----------------------------------------------+

.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

16 Dec 2021, 10:58

Fei Wang and William Lisowski have given you good advice how to achieve what you asked. So I'll spoil the party and point out that you may regret getting what you wished for. Your original data are in long layout and you will have transformed it to wide. While this makes it easier for humans to read, and is the conventional organization of such data in spreadsheets, the wide layout is generally less useful in Stata. While some commands work well with wide data in Stata, they are a minority. Most Stata analysis or data management commands work best, or only, with the long layout you started with in the first place. It will not surprise me if your next Statalist post is a question about how to do something with this data, the answer to which begins with going back to the original long layout.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

16 Dec 2021, 14:04

While I normally agree with Clyde's recommendation of the long layout, my advice was based on the following points in post #1:
Only a single variable with three possible values was identified as differing across the observations

The poster is new to Stata, and working with a long layout requires some skill at thinking "across observations"

For this task, the bulleted categories requested in post #1 can be coded in a straightforward fashion in the wide layout

Code:

generate byte code = 0 replace code = 1 if A61K0009 & A61K0047 & !B82Y0005 replace code = 2 if A61K0009 & !A61K0047 & B82Y0005 replace code = 3 if !A61K0009 & A61K0047 & B82Y0005 replace code = 4 if A61K0009 & A61K0047 & B82Y0005 label define CODE 0 "1 CPC" /// 1 "A61K0009 & A61K0047" /// 2 "A61K0009 & B82Y0005" /// 3 "A61K0047 & B82Y0005" /// 4 "3 CPCs" label values code CODE

Code:

. list, abbreviate(16) +--------------------------------------------------------------------+ | GrantNumber A61K0009 A61K0047 B82Y0005 code | |--------------------------------------------------------------------| 1. | 6171609 1 1 1 3 CPCs | 2. | 6171611 1 0 0 1 CPC | 3. | 6171612 0 1 0 1 CPC | 4. | 6171613 1 0 0 1 CPC | 5. | 6171614 1 1 0 A61K0009 & A61K0047 | +--------------------------------------------------------------------+ . list, abbreviate(16) nolabel +-----------------------------------------------------+ | GrantNumber A61K0009 A61K0047 B82Y0005 code | |-----------------------------------------------------| 1. | 6171609 1 1 1 4 | 2. | 6171611 1 0 0 0 | 3. | 6171612 0 1 0 0 | 4. | 6171613 1 0 0 0 | 5. | 6171614 1 1 0 1 | +-----------------------------------------------------+

With that said, let me add the following advice that I routinely give to posters who identify as being new to Stata.

I'm sympathetic to you as a new user of Stata - there is quite a lot to absorb.

When I began using Stata in a serious way, I started, as have others here, by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. All of these manuals are included as PDFs in the Stata installation and are accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu.

The objective in doing the reading was not so much to master Stata - I'm still far from that goal - as to be sure I'd become familiar with a wide variety of important basic techniques, so that when the time came that I needed them, I might recall their existence, if not the full syntax, and know how to find out more about them in the help files and PDF manuals.

Stata supplies exceptionally good documentation that amply repays the time spent studying it - there's just a lot of it. The path I followed surfaces the things you need to know to get started in a hurry and to work effectively.

Stata also supples YouTube videos, if that's your thing.
Comment

Announcement

Merging Duplicates without Deleting

Comment

Comment

Comment

Comment