Hi everyone,
Hello,
I would like to know if it is possible to retrieve the changes made during a collapse of the initial database.
This is a special case using the SSC -strgroup()- command. It uses the Levenstein distance. Here is my code:
and I do that for all car models that I have. For example:
Is there a way to retrieve those changes applied to my basic dataset?
Here is my basic dataset, w/o cleaning anything. Let say for T car model:
And here the changed dataset with collapse and cleaned:
Could anyone provide me some help, please?
I really wouldn't want to have to do all that cleaning work again. Thanks in advance.
Best,
Michael
Hello,
I would like to know if it is possible to retrieve the changes made during a collapse of the initial database.
This is a special case using the SSC -strgroup()- command. It uses the Levenstein distance. Here is my code:
Code:
// --- // // T --- cd "${path}/stata/demand" use "cardesc15-19_fordemand_cleaned_withenginecap.dta", clear keep if description == "T" replace model = itrim(trim(model)) gen num = 1 collapse (sum) num, by(description model COD_PROPULSION cilindrada potenciafiscal weight_max) tab model [aw=num] sort model strgroup model, generate(similar_model1) threshold(0.15) first normalize(shorter) force sort similar_model1 replace model = "4 RUNNER" if inrange(similar_model1, 2, 3) replace model = "GT86" if similar_model1 == 4 /// | inrange(similar_model1, 75, 76) /// | similar_model1 == 148 replace model = "AURIS" if inrange(similar_model1, 5, 29) /// | similar_model1 == 138 /// | similar_model1 == 202 ...
Code:
// P --- cd "${path}/stata/demand" use "cardesc15-19_fordemand_cleaned_withenginecap.dta", clear keep if description == "P" replace model = itrim(trim(model)) gen num = 1 collapse (sum) num, by(description model COD_PROPULSION cilindrada potenciafiscal weight_max) tab model [aw=num] sort model strgroup model, generate(similar_model1) threshold(0.15) first normalize(shorter) force sort similar_model1 split model, parse(" ") g(split_) replace model = split_1 if inrange(similar_model1, 1, 2) replace model = split_1 if inrange(similar_model1, 8, 71) replace model = split_1 if inrange(similar_model1, 78, 661) replace model = split_2 if inrange(similar_model1, 1116, 1123) replace model = "108" if inrange(similar_model1, 1, 661) & ustrpos(model, "108")>0 /// | inrange(similar_model1, 761, 1026) & ustrpos(model, "108")>0 ...
Here is my basic dataset, w/o cleaning anything. Let say for T car model:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str23 description str31 model str1 COD_PROPULSION long cilindrada double potenciafiscal long(weight_max muni_code) "T" "AURIS" "0" 1798 12.49 1815 28005 "T" "T COROLLA" "0" 1987 13.27 1910 11020 "T" "T YARIS" "0" 1329 10.42 1490 38006 "T" "T PRIUS PLUS" "0" 1798 12.49 2115 8019 "T" "T RAV4" "1" 1998 13.31 2135 4052 end
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str23 description str31 model str1 COD_PROPULSION long cilindrada double potenciafiscal long weight_max double num long similar_model1 "T" "4 RUNNER" "0" 3956 24 2858 1 2 "T" "4 RUNNER" "0" 3956 23.9 2575 3 2 "T" "4 RUNNER" "1" 4000 24 2798 1 2 "T" "4 RUNNER" "0" 2366 14.9 2100 1 2 "T" "4 RUNNER" "0" 3956 23.9 2575 5 2 "T" "4 RUNNER" "0" 4000 24 2743 1 2 "T" "4 RUNNER" "0" 3400 21.8 2308 1 2 "T" "4 RUNNER" "0" 3955 24 2440 3 2 "T" "4 RUNNER" "0" 3956 23.9 2766 2 2 "T" "4 RUNNER" "0" 3955 23.59 2600 1 3 end
Could anyone provide me some help, please?
I really wouldn't want to have to do all that cleaning work again. Thanks in advance.
Best,
Michael
Comment