Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Disappearing ids while matching with rowrange

    Hello,

    The aim of my code below is to extract patent information contained in the 'tls201_part01.csv' 'tls201_part02.csv' 'tls201_part03.csv' files for a list of patent applications collected in 'temp_209_PVblock.dta'. I am performing the matching based on the application identifier 'appln_id', which uniquely identifies observations in both 'temp_209_PVblock.dta' and the 'tls201' files. Every 'appln_id' that is contained in 'temp_209_PVblock.dta' should be contained in 'tls201' (I cannot directly check this easily, but I am fairly confident that it is true).
    Normally, I would just import the csv files and match these with 'temp_209_PVblock.dta' - keeping only the matched elements - however, the csvs are very large (with around 50 million observations each), which prompted me to carry out the matching process in bits with the rowrange option.
    The problem that I am experiencing is once I run the full code below, it seems that 'temp_209_PVblock.dta' contains application identifiers for which no matches were created. Do you have any ideas how could this happen?

    Thank you,
    Gabor

    Code:
    local N=1
    local i=1
    quietly {
    while `N'>0 {
    local k=`i'*1000000
    local b=`k'-999999
    import delimited using "raw/tls201_part01.csv", varnames(1) encoding(ISO-8859-1)clear rowrange(`b':`k')
    local N=_N
    merge 1:1 appln_id using "temp/temp_209_PVblock.dta", keep(match) nogenerate
    save "temp/temp_201_PVblock`i'_full.dta", replace
    local i=`i'+1
    global end_section=`i'
    }
    }
    
    local N=1
    local i=1
    
    quietly {
    while `N'>0 {
    local k=`i'*1000000
    local b=`k'-999999
    import delimited using "raw/tls201_part02.csv", varnames(1) encoding(ISO-8859-1)clear rowrange(`b':`k')
    local N=_N
    merge 1:1 appln_id using "temp/temp_209_PVblock.dta", keep(match) nogenerate
    local j=`i'+$end_section
    save "temp/temp_201_PVblock`j'_full.dta", replace
    local i=`i'+1
    global end=`j'
    }
    }
    
    *52
    global end_section=$end
    
    local N=1
    local i=1
    
    quietly {
    while `N'>0 {
    local k=`i'*1000000
    local b=`k'-999999
    import delimited using "raw/tls201_part03.csv", varnames(1) encoding(ISO-8859-1)clear rowrange(`b':`k')
    local N=_N
    merge 1:1 appln_id using "temp/temp_209_PVblock.dta", keep(match) nogenerate
    local j=`i'+$end_section
    save "temp/temp_201_PVblock`j'_full.dta", replace
    local i=`i'+1
    }
    }
    
    use "temp/temp_201_PVblock1_full.dta", clear
    forval i=2/118 {
    append using "temp/temp_201_PVblock`i'_full.dta"
    }
    save "temp/temp_201_USPVblock_prefamily.dta", replace
    Here you can see the values taken by 'appln_id':

    Code:
    . su appln_id, det
    
                              appln_id
    -------------------------------------------------------------
          Percentiles      Smallest
     1%      2680387            240
     5%     1.40e+07            376
    10%     2.61e+07            636       Obs             794,245
    25%     3.95e+07            647       Sum of wgt.     794,245
    
    50%     3.77e+08                      Mean           2.95e+08
                            Largest       Std. dev.      2.06e+08
    75%     4.81e+08       5.45e+08
    90%     5.19e+08       5.45e+08       Variance       4.23e+16
    95%     5.32e+08       5.45e+08       Skewness       -.356266
    99%     5.42e+08       5.45e+08       Kurtosis       1.363318

  • #2
    This is now resolved. The problem is unrelated to the matching, sorry.

    Comment

    Working...
    X