Hi Statalist,
I have to merge two datasets, based on different characteristics. But first, I would like to have the same string variables between both datasets.
Those are datasets about i. Car registrations in Spain between 2015-2019, ii. new car prices between the same period.
My dataset 1 (i. above) has car models' names cleaned:
My second dataset (ii. above) needs some cleaning:
I'd like the string variable -model- in dataset 2 to be written in the same way as the string variable -model- in dataset 1.
This case is fairly simple, but other cases are more complex.
I started by doing this (SSC -strgroup- by Julian Reif):
But this will take far too much time which I can't afford to waste. Any suggestions please?
Thanks in advance!
Edit: My peers don't want to use fuzzy merge. So ideally these variables should have the same name. The final idea is to produce a merge between (i) car registrations database and (ii) car prices, based on several characteristics.
I may have noticed something useful: it's often the first two or three "words" (so a series of strings, separated by a space, another series of strings, separated by another space). It's actually a single space, because I've cleaned it up beforehand using -chartab-.
I have to merge two datasets, based on different characteristics. But first, I would like to have the same string variables between both datasets.
Those are datasets about i. Car registrations in Spain between 2015-2019, ii. new car prices between the same period.
My dataset 1 (i. above) has car models' names cleaned:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str23 brand str39 model str1 fuel long engine_cap double fiscal_power str22 model_dup_original "DACIA" "SANDERO" "0" 999 7.82 "SANDERO" "DACIA" "SANDERO" "0" 898 7.34 "SANDERO" "DACIA" "SANDERO" "0" 1149 9.54 "SANDERO" "DACIA" "SANDERO" "0" 898 7.34 "SANDERO" "DACIA" "SANDERO" "1" 1461 11.02 "SANDERO" end
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str13 brand str147 model int(period_unique engine_cap) str4 fuel str11 pkw double fiscal_power long price "DACIA" "SANDERO 0.9 TCE GLP Ambiance 90" 2016 898 "S" "66" 7.34 8000 "DACIA" "SANDERO 0.9 TCE GLP Laureate 90" 2016 898 "S" "66" 7.34 8500 "DACIA" "SANDERO 0.9 TCE GLP Stepway 90" 2018 898 "S" "66" 7.34 7900 "DACIA" "SANDERO 0.9 TCE Laureate 90" 2016 898 "G" "66" 7.34 7900 "DACIA" "SANDERO 0.9 TCE Laureate Easy-R 90" 2016 898 "G" "66" 7.34 8300 end
I'd like the string variable -model- in dataset 2 to be written in the same way as the string variable -model- in dataset 1.
This case is fairly simple, but other cases are more complex.
I started by doing this (SSC -strgroup- by Julian Reif):
Code:
strgroup model, generate(similar_model) threshold(0.15) first normalize(shorter) force
But this will take far too much time which I can't afford to waste. Any suggestions please?
Thanks in advance!
Edit: My peers don't want to use fuzzy merge. So ideally these variables should have the same name. The final idea is to produce a merge between (i) car registrations database and (ii) car prices, based on several characteristics.
I may have noticed something useful: it's often the first two or three "words" (so a series of strings, separated by a space, another series of strings, separated by another space). It's actually a single space, because I've cleaned it up beforehand using -chartab-.
Comment