Hi there,
I have a dataset on patents and their respective subclasses. Over a large number of inventors and years, my sample totals nearly 4 million patents. Each of them can have various subclasses.
I want to identify with a dummy variable when a patent subclass first appears in an inventor's portfolio.
Here is an example with two inventors. Each row corresponds to a different patent. The data has been sorted by date (which is in Stata format), showing the first 3 subclasses for each patent.
The last three variables are the ones I am trying to create.
For example, in the last observation (i.e. most recent patent for the second inventor) the three dummy variables take the value of 0 because:
- for dummy wanted_1: subclass 166.007 showed up in the inventor's previous patent under class_subclass1
- for dummy wanted_2: subclass 119.0399 showed up in the inventor's third patent (4203167) under class_subclass1
- for dummy wanted_3: subclass 448.0214 showed up in the inventor's second patent as its only subclass.
The complexity here is due to the fact that the code needs to search over the three variables in all of the inventor's previous patents, then assign 0 or 1 in a specific cell.
I have tried running all sorts of loops and using countmatch/rangestat, to no avail.
Thanks for your help
I have a dataset on patents and their respective subclasses. Over a large number of inventors and years, my sample totals nearly 4 million patents. Each of them can have various subclasses.
I want to identify with a dummy variable when a patent subclass first appears in an inventor's portfolio.
Here is an example with two inventors. Each row corresponds to a different patent. The data has been sorted by date (which is in Stata format), showing the first 3 subclasses for each patent.
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str11 inventor_id str8 patent_id float(date_stata class_subclass1 class_subclass2 class_subclass3 wanted_1 wanted_2 wanted_3) "3930300-1" "4255207" 7038 132.0609 132.1129 132.0469 1 1 1 "3930300-1" "4272753" 7230 193.0214 . . 1 . . "3930300-1" "4255209" 7294 132.0541 132.0609 132.0529 1 0 1 "3930300-1" "4670769" 7602 132.0469 132.1401 132.047 0 1 1 "3930300-1" "4900689" 10569 132.0541 132.0609 132.1393 0 0 1 "3930309-1" "4211003" 6781 166.0079 104.0072 443.0042 1 1 1 "3930309-1" "D257612" 6797 448.0214 . . 1 . . "3930309-1" "4203167" 6909 119.0399 81.0251 81.0435 1 1 1 "3930309-1" "4265017" 7189 166.0079 . . 0 . . "3930309-1" "4442559" 7853 166.007 385.0057 . 1 1 . "3930309-1" "4389775" 7873 166.007 119.0399 448.0214 0 0 0 end
The last three variables are the ones I am trying to create.
For example, in the last observation (i.e. most recent patent for the second inventor) the three dummy variables take the value of 0 because:
- for dummy wanted_1: subclass 166.007 showed up in the inventor's previous patent under class_subclass1
- for dummy wanted_2: subclass 119.0399 showed up in the inventor's third patent (4203167) under class_subclass1
- for dummy wanted_3: subclass 448.0214 showed up in the inventor's second patent as its only subclass.
The complexity here is due to the fact that the code needs to search over the three variables in all of the inventor's previous patents, then assign 0 or 1 in a specific cell.
I have tried running all sorts of loops and using countmatch/rangestat, to no avail.
Thanks for your help
Comment