Hello there. I am a graduate student and relatively new to STATA, but I have found posts on this forum very helpful in the past. I am hoping you can help with my current problem.
I am working with a very "cleaned" (for sake of clarification, a very limited) set of patent data. Patents are given a uniquely-identifying grant number (var: GrantNumber) and are classified in CPC codes (var: CPC). However, patents may be (and are often) filed in multiple CPC codes. In the dataset with which I am working, there are 60,172 unique grant numbers (GrantNumber), only 3 CPC codes (CPC), but 195,582 observations.
There are many unique grant numbers (GrantNumber) that are duplicated, either two or three times, depending on how many CPC codes in which that patent is classified. For an example, a portion of my data looks like this:
I want my data to look like this:
Then, I could simply delete the duplicate GrantNumbers, leaving a final result of:
I've already performed the following commands, but none of them accomplish what I am trying to accomplish:
Per the rules of the Statalist Forum, my data looks like this:
I am working with a very "cleaned" (for sake of clarification, a very limited) set of patent data. Patents are given a uniquely-identifying grant number (var: GrantNumber) and are classified in CPC codes (var: CPC). However, patents may be (and are often) filed in multiple CPC codes. In the dataset with which I am working, there are 60,172 unique grant numbers (GrantNumber), only 3 CPC codes (CPC), but 195,582 observations.
There are many unique grant numbers (GrantNumber) that are duplicated, either two or three times, depending on how many CPC codes in which that patent is classified. For an example, a portion of my data looks like this:
Grant Number | CPC |
6171609 | A61K0009 |
6171609 | A61K0047 |
6171609 | B82Y0005 |
6171611 | A61K0009 |
6171612 | A61K0047 |
6171613 | A61K0009 |
6171614 | A61K0009 |
6171614 | A61K0047 |
I want my data to look like this:
GrantNumber | CPC | FirstCPC | SecondCPC | ThirdCPC |
6171609 | A61K0009 | A61K0009 | A61K0047 | B82Y0005 |
6171609 | A61K0047 | . | . | . |
6171609 | B82Y0005 | . | . | . |
6171611 | A61K0009 | A61K0009 | . | . |
6171612 | A61K0047 | A61K0047 | . | . |
6171613 | A61K0009 | A61K0009 | . | . |
6171614 | A61K0009 | A61K0009 | A61K0047 | . |
6171614 | A61K0047 | . | . | . |
Then, I could simply delete the duplicate GrantNumbers, leaving a final result of:
GrantNumber | CPC | FirstCPC | SecondCPC | ThirdCPC |
6171609 | A61K0009 | A61K0009 | A61K0047 | B82Y0005 |
6171611 | A61K0009 | A61K0009 | . | . |
6171612 | A61K0047 | A61K0047 | . | . |
6171613 | A61K0009 | A61K0009 | . | . |
6171614 | A61K0009 | A61K0009 | A61K0047 | . |
I've already performed the following commands, but none of them accomplish what I am trying to accomplish:
- egen GrantNumber_count = count(GrantNumber), by(CPC) --- Generates a variable that shows the count of GrantNumber for each of the three CPC codes. Helpful, but not what I am looking for.
- egen CPC_count = count(CPC), by(GrantNumber) --- Generates a variable that shows the count of CPC codes for each GrantNumber. Again, helpful and closer to what I want, but not what I am looking for.
- The following group of commands:
- sort GrantNumber --- To prepare for generating a duplicate variable.
- quietly by GrantNumber: gen dup = cond(_N==1, 0, _n) --- Generates a variable that indicates which observations are duplicates (dup==2 and dup==3), so that I can eventually drop these duplicates like I mentioned above.
- Only Two CPCs: A61K0009 and A61K0047
- Only Two CPCs: A61K0009 and B82Y0005
- Only Two CPCs: A61K0047 and B82Y0005
- All Three CPCs: A61K0009 and A61K0047 and B82Y0005
Per the rules of the Statalist Forum, my data looks like this:
Code:
* Example generated by -dataex-. To install: ssc install dataex
` dataex GrantNumber CPC
clear
input long GrantNumber str8 CPC
6167880 "A61K0009"
6167887 "A61K0009"
6168778 "A61K0047"
6168796 "A61K0047"
6168801 "A61K0009"
6168804 "A61K0009"
6168805 "A61K0009"
6168806 "A61K0009"
6169068 "A61K0009"
6169078 "A61K0047"
6169083 "A61K0009"
6169083 "A61K0047"
6169084 "A61K0009"
6169084 "A61K0047"
6169099 "A61K0009"
6169100 "A61K0009"
6169117 "A61K0009"
6169130 "A61K0009"
6169194 "B82Y0005"
6170717 "A61K0009"
6171577 "A61K0047"
6171580 "B82Y0005"
6171591 "A61K0047"
6171599 "A61K0009"
6171600 "A61K0009"
6171606 "A61K0009"
6171609 "A61K0009"
6171609 "A61K0047"
6171609 "B82Y0005"
6171611 "A61K0009"
6171612 "A61K0047"
6171613 "A61K0009"
6171614 "A61K0009"
6171614 "A61K0047"
6171615 "A61K0009"
6171616 "A61K0009"
6171617 "A61K0009"
6171618 "A61K0009"
6171619 "A61K0009"
6171859 "A61K0047"
6172040 "A61K0047"
6172045 "A61K0047"
6172045 "B82Y0005"
6172048 "A61K0009"
6172048 "A61K0047"
6172049 "A61K0009"
6172049 "A61K0047"
6172089 "A61K0009"
6172090 "A61K0009"
6172107 "A61K0009"
6172202 "A61K0047"
6172208 "A61K0047"
6172219 "A61K0009"
6172219 "A61K0047"
6172261 "A61K0047"
6174517 "B82Y0005"
6174518 "A61K0009"
6174524 "A61K0009"
6174529 "A61K0009"
6174529 "A61K0047"
6174530 "A61K0047"
6174533 "B82Y0005"
6174539 "A61K0047"
6174540 "A61K0009"
6174540 "A61K0047"
6174543 "A61K0009"
6174543 "A61K0047"
6174545 "A61K0009"
6174546 "A61K0009"
6174547 "A61K0009"
6174548 "A61K0009"
6174687 "A61K0047"
6174723 "A61K0047"
6174856 "A61K0009"
6174856 "A61K0047"
6174858 "A61K0047"
6174866 "A61K0009"
6174873 "A61K0009"
6174873 "A61K0047"
6174902 "A61K0009"
6174904 "A61K0009"
6174999 "A61K0009"
6174999 "A61K0047"
6175054 "A61K0009"
6177059 "A61K0009"
6177059 "A61K0047"
6177062 "B82Y0005"
6177068 "A61K0009"
6177074 "A61K0047"
6177076 "A61K0009"
6177087 "A61K0047"
6177095 "A61K0009"
6177096 "A61K0009"
6177098 "A61K0009"
6177099 "A61K0009"
6177100 "B82Y0005"
6177101 "A61K0009"
6177102 "A61K0009"
6177103 "A61K0009"
6177104 "A61K0009"
end
Comment