-collapse-: is there a way to come back with data cleaned with collapse to the whole disaggregated data?

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 492

-collapse-: is there a way to come back with data cleaned with collapse to the whole disaggregated data?

20 Feb 2024, 11:45

Hi everyone,

Hello,

I would like to know if it is possible to retrieve the changes made during a collapse of the initial database.

This is a special case using the SSC -strgroup()- command. It uses the Levenstein distance. Here is my code:

Code:

// --- //
// T ---
cd "${path}/stata/demand"
use "cardesc15-19_fordemand_cleaned_withenginecap.dta", clear

keep if description == "T"
replace model = itrim(trim(model))
gen num = 1
collapse (sum) num, by(description model COD_PROPULSION cilindrada potenciafiscal weight_max)
tab model [aw=num]
sort model


strgroup model, generate(similar_model1) threshold(0.15) first normalize(shorter) force

sort similar_model1


replace model = "4 RUNNER" if inrange(similar_model1, 2, 3)

replace model = "GT86" if similar_model1 == 4 ///
                            | inrange(similar_model1, 75, 76) ///
                            | similar_model1 == 148

replace model = "AURIS" if inrange(similar_model1, 5, 29) ///
                            | similar_model1 == 138 ///
                            | similar_model1 == 202

...

and I do that for all car models that I have. For example:

Code:

// P ---
cd "${path}/stata/demand"
use "cardesc15-19_fordemand_cleaned_withenginecap.dta", clear

keep if description == "P"
replace model = itrim(trim(model))
gen num = 1
collapse (sum) num, by(description model COD_PROPULSION cilindrada potenciafiscal weight_max)
tab model [aw=num]
sort model

strgroup model, generate(similar_model1) threshold(0.15) first normalize(shorter) force

sort similar_model1


split model, parse(" ") g(split_)

replace model = split_1 if inrange(similar_model1, 1, 2)
replace model = split_1 if inrange(similar_model1, 8, 71)

replace model = split_1 if inrange(similar_model1, 78, 661)
replace model = split_2 if inrange(similar_model1, 1116, 1123)

replace model = "108" if inrange(similar_model1, 1, 661) & ustrpos(model, "108")>0 ///
                    | inrange(similar_model1, 761, 1026) & ustrpos(model, "108")>0 

...

Is there a way to retrieve those changes applied to my basic dataset?

Here is my basic dataset, w/o cleaning anything. Let say for T car model:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str23 description str31 model str1 COD_PROPULSION long cilindrada double potenciafiscal long(weight_max muni_code)
"T" "AURIS"             "0" 1798 12.49 1815 28005
"T" "T COROLLA"    "0" 1987 13.27 1910 11020
"T" "T YARIS"      "0" 1329 10.42 1490 38006
"T" "T PRIUS PLUS" "0" 1798 12.49 2115  8019
"T" "T RAV4"       "1" 1998 13.31 2135  4052
end

And here the changed dataset with collapse and cleaned:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str23 description str31 model str1 COD_PROPULSION long cilindrada double potenciafiscal long weight_max double num long similar_model1
"T" "4 RUNNER" "0" 3956    24 2858 1 2
"T" "4 RUNNER" "0" 3956  23.9 2575 3 2
"T" "4 RUNNER" "1" 4000    24 2798 1 2
"T" "4 RUNNER" "0" 2366  14.9 2100 1 2
"T" "4 RUNNER" "0" 3956  23.9 2575 5 2
"T" "4 RUNNER" "0" 4000    24 2743 1 2
"T" "4 RUNNER" "0" 3400  21.8 2308 1 2
"T" "4 RUNNER" "0" 3955    24 2440 3 2
"T" "4 RUNNER" "0" 3956  23.9 2766 2 2
"T" "4 RUNNER" "0" 3955 23.59 2600 1 3
end

Could anyone provide me some help, please?
I really wouldn't want to have to do all that cleaning work again. Thanks in advance.

Best,

Michael

Tags: None

George Ford

Join Date: Aug 2014

Posts: 3043
#2

20 Feb 2024, 12:53

Save the collapsed results and then joinby the results on the relevant variables?
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 492
#3

21 Feb 2024, 07:40

Hi George Ford,

Thank you for your help! I have two questions, please:
What did you mean by joinby, please?

When you wrote "save", did you mean using -tempfile-, for example?

Thank you in advance for your answer.
Lovely day!

Michael
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3407
#4

21 Feb 2024, 08:20

joinby means the joinby command
save means the save command. You can use a tempfile if you want, but that is not necessary

If you don't know that a command like joinby does, then you type in Stata help joinby . Don't forget that the pdf manual contains more detailed information than the helpfile, so if you don't understand the help-file, look at the pdf-manual entry for that command. A link to that entry is at the top of the help-file.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 492
#5

21 Feb 2024, 09:20

Hi Maarten Buis,

Thank you very much for your help!
I will have a look into it.

Lovely day!

Michael
Comment
George Ford

Join Date: Aug 2014

Posts: 3043
#6

21 Feb 2024, 10:05

What Maarten said--sorry for the vague reply.

Collapse the data and save the results to collapsed.dta.

Then

joinby id1 id2 (or as many as you need) using collapsed.dta

I prefer joinby to merge for nearly all purposes.
1 like
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 492
#7

23 Feb 2024, 08:45

Hi George Ford,

No worries! Thank you for taking the time to answer me.
Thank you also for your suggestion. I'll take a look at it soon, and get back to you if I hit a bone.

Lovely day (and weekend!).
Best,

Michael
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 492
#8

23 Feb 2024, 09:36

I have been looking into -joinby-:

I was not aware of this command. Thanks for sharing this Maarten and George!

As a first attempt, I used the following code:

Code:

clear* use "cleaned_models.dta", clear joinby description COD_PROPULSION cilindrada potenciafiscal weight_max using "C:/Users/miduarte/Desktop/Ongoing_Projects/LEZ_Project/Data/stata/demand/cardesc15-19_fordemand_uncleaned.dta"

Here are a few explanations to help you understand what I want an do:
I have an uncleaned, unaggregated database with several car models registered between 2015 and 2019. Several models are the same but simply misspelt.

I then use -collapse- on my database to get the number of models according to how they are written, according to several key characteristics: cubic capacity, weight, tax weight and fuel. The number of models with a given model's name and those characteristics are expressed by the variable -num- in my post #1. To clean, I use the SSC -strgroup- command, as many models are the same, but misspelt.

I then clean this data and obtain a second dataset with harmonised model names. However, I would like to be able to go back to my using dataset, but with the clean models.
After applying the code above, I end up with a total number of observations of 8,129,491, whereas this should only have 4,004,455 observations.
Does anyone have an explanation? It could be that several models have (i) the same displacement, (ii) tax weight, (iii) weight (iv) fuel, in case that might play a part.

Finally, if it helps, here are two examples of my data sets. Let's label dataset master "cleaned_models.dta", and using "cardesc15-19_fordemand_uncleaned.dta", as above:

cardesc15-19_fordemand_uncleaned.dta:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str23 description str31 model long cilindrada double potenciafiscal long weight_max str1 COD_PROPULSION "DACIA" "SANDERO" 898 7.34 1550 "0" "DACIA" "LODGY" 1598 11.64 1842 "0" "DACIA" "SANDERO" 1461 11.02 1590 "1" "DACIA" "DUSTER" 1461 11.02 1785 "1" "DACIA" "LODGY" 1197 9.78 1810 "0" "DACIA" "LODGY" 1461 11.02 1958 "1" end

and cleaned_models.dta:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str23 description str39 model long cilindrada double potenciafiscal long weight_max str1 COD_PROPULSION double num "DACIA" "DUSTER" 1461 11.02 1771 "1" 1739 "DACIA" "DUSTER" 1461 11.02 1775 "1" 4286 "DACIA" "DUSTER" 1461 11.02 1780 "1" 10 "DACIA" "DUSTER" 1461 11.02 1785 "1" 3655 "DACIA" "DUSTER" 1461 11.02 1795 "1" 10945 end

Thank you in advance for your help!
Best,

Michael

Edit: I also used the option -unmatched(none)-, but it gives me the same number of observations, in other terms: 8,129,491.

Last edited by Michael Duarte Goncalves; 23 Feb 2024, 09:41.
Comment
George Ford

Join Date: Aug 2014

Posts: 3043
#9

23 Feb 2024, 09:45

joinby will match everything that matches, so your join variables are not unique to an observation. Some of your rows are matching multiple rows in the using file.

I'd check for duplicates in the using file to search for why that's happening. There may be a simple solution, or not.

Also see

HTML Code:

https://libguides.princeton.edu/fuzzy-merge
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 492
#10

23 Feb 2024, 09:49

Hi George,

Thank you for your prompt reply!
Thank you very much for the link. However, my PIs are not very keen on -matchit- and -reclink-, preferring me to clean up the data as much as possible, and then do the merge afterwards.

Thanks again for your help. I'll see what I can do with what I've got.

Best,

Michael
Comment
George Ford

Join Date: Aug 2014

Posts: 3043
#11

23 Feb 2024, 09:53

The other option is to try to clean up the mis-spellings, but I've worked with car registration data before and expect that you'll have a lot to clean up.

If there are not too many models, you may be able to search for a unique part of the name that will identify cleanly. How many ways can Camry be mis-spelled?
1 like
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 492
#12

23 Feb 2024, 10:00

Thanks for the suggestion, George! Seems a good idea indeed!
Yes, there is a lot of cleaning! And many models, too...

I'll look into it and keep you posted.
Thanks a lot!

Michael
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 492
#13

23 Feb 2024, 10:54

Hi George Ford,

A small update about what you suggested me in #11:
I looked for a unique part of the name that identified cleanly models... And everything works perfectly!

Thank you very much again and again for your help. I didn't know this command -joinby-. And it is very useful indeed. I will use it more often from now on.

All the best,

Michael
Comment
George Ford

Join Date: Aug 2014

Posts: 3043
#14

24 Feb 2024, 16:19

Once you use joinby, you'll never go back.

Glad it worked.
1 like
Comment

Announcement

-collapse-: is there a way to come back with data cleaned with collapse to the whole disaggregated data?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment