Hello,
The aim of my code below is to extract patent information contained in the 'tls201_part01.csv' 'tls201_part02.csv' 'tls201_part03.csv' files for a list of patent applications collected in 'temp_209_PVblock.dta'. I am performing the matching based on the application identifier 'appln_id', which uniquely identifies observations in both 'temp_209_PVblock.dta' and the 'tls201' files. Every 'appln_id' that is contained in 'temp_209_PVblock.dta' should be contained in 'tls201' (I cannot directly check this easily, but I am fairly confident that it is true).
Normally, I would just import the csv files and match these with 'temp_209_PVblock.dta' - keeping only the matched elements - however, the csvs are very large (with around 50 million observations each), which prompted me to carry out the matching process in bits with the rowrange option.
The problem that I am experiencing is once I run the full code below, it seems that 'temp_209_PVblock.dta' contains application identifiers for which no matches were created. Do you have any ideas how could this happen?
Thank you,
Gabor
Here you can see the values taken by 'appln_id':
The aim of my code below is to extract patent information contained in the 'tls201_part01.csv' 'tls201_part02.csv' 'tls201_part03.csv' files for a list of patent applications collected in 'temp_209_PVblock.dta'. I am performing the matching based on the application identifier 'appln_id', which uniquely identifies observations in both 'temp_209_PVblock.dta' and the 'tls201' files. Every 'appln_id' that is contained in 'temp_209_PVblock.dta' should be contained in 'tls201' (I cannot directly check this easily, but I am fairly confident that it is true).
Normally, I would just import the csv files and match these with 'temp_209_PVblock.dta' - keeping only the matched elements - however, the csvs are very large (with around 50 million observations each), which prompted me to carry out the matching process in bits with the rowrange option.
The problem that I am experiencing is once I run the full code below, it seems that 'temp_209_PVblock.dta' contains application identifiers for which no matches were created. Do you have any ideas how could this happen?
Thank you,
Gabor
Code:
local N=1 local i=1 quietly { while `N'>0 { local k=`i'*1000000 local b=`k'-999999 import delimited using "raw/tls201_part01.csv", varnames(1) encoding(ISO-8859-1)clear rowrange(`b':`k') local N=_N merge 1:1 appln_id using "temp/temp_209_PVblock.dta", keep(match) nogenerate save "temp/temp_201_PVblock`i'_full.dta", replace local i=`i'+1 global end_section=`i' } } local N=1 local i=1 quietly { while `N'>0 { local k=`i'*1000000 local b=`k'-999999 import delimited using "raw/tls201_part02.csv", varnames(1) encoding(ISO-8859-1)clear rowrange(`b':`k') local N=_N merge 1:1 appln_id using "temp/temp_209_PVblock.dta", keep(match) nogenerate local j=`i'+$end_section save "temp/temp_201_PVblock`j'_full.dta", replace local i=`i'+1 global end=`j' } } *52 global end_section=$end local N=1 local i=1 quietly { while `N'>0 { local k=`i'*1000000 local b=`k'-999999 import delimited using "raw/tls201_part03.csv", varnames(1) encoding(ISO-8859-1)clear rowrange(`b':`k') local N=_N merge 1:1 appln_id using "temp/temp_209_PVblock.dta", keep(match) nogenerate local j=`i'+$end_section save "temp/temp_201_PVblock`j'_full.dta", replace local i=`i'+1 } } use "temp/temp_201_PVblock1_full.dta", clear forval i=2/118 { append using "temp/temp_201_PVblock`i'_full.dta" } save "temp/temp_201_USPVblock_prefamily.dta", replace
Code:
. su appln_id, det appln_id ------------------------------------------------------------- Percentiles Smallest 1% 2680387 240 5% 1.40e+07 376 10% 2.61e+07 636 Obs 794,245 25% 3.95e+07 647 Sum of wgt. 794,245 50% 3.77e+08 Mean 2.95e+08 Largest Std. dev. 2.06e+08 75% 4.81e+08 5.45e+08 90% 5.19e+08 5.45e+08 Variance 4.23e+16 95% 5.32e+08 5.45e+08 Skewness -.356266 99% 5.42e+08 5.45e+08 Kurtosis 1.363318
Comment