String variables: Is there a way of transforming a string variable -model- in dataset 2 so that it takes the "values" -model- from dataset 1?

Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#1

String variables: Is there a way of transforming a string variable -model- in dataset 2 so that it takes the "values" -model- from dataset 1?

18 Apr 2024, 04:57

Hi Statalist,

I have to merge two datasets, based on different characteristics. But first, I would like to have the same string variables between both datasets.

Those are datasets about i. Car registrations in Spain between 2015-2019, ii. new car prices between the same period.

My dataset 1 (i. above) has car models' names cleaned:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str23 brand str39 model str1 fuel long engine_cap double fiscal_power str22 model_dup_original "DACIA" "SANDERO" "0" 999 7.82 "SANDERO" "DACIA" "SANDERO" "0" 898 7.34 "SANDERO" "DACIA" "SANDERO" "0" 1149 9.54 "SANDERO" "DACIA" "SANDERO" "0" 898 7.34 "SANDERO" "DACIA" "SANDERO" "1" 1461 11.02 "SANDERO" end

My second dataset (ii. above) needs some cleaning:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str13 brand str147 model int(period_unique engine_cap) str4 fuel str11 pkw double fiscal_power long price "DACIA" "SANDERO 0.9 TCE GLP Ambiance 90" 2016 898 "S" "66" 7.34 8000 "DACIA" "SANDERO 0.9 TCE GLP Laureate 90" 2016 898 "S" "66" 7.34 8500 "DACIA" "SANDERO 0.9 TCE GLP Stepway 90" 2018 898 "S" "66" 7.34 7900 "DACIA" "SANDERO 0.9 TCE Laureate 90" 2016 898 "G" "66" 7.34 7900 "DACIA" "SANDERO 0.9 TCE Laureate Easy-R 90" 2016 898 "G" "66" 7.34 8300 end

I'd like the string variable -model- in dataset 2 to be written in the same way as the string variable -model- in dataset 1.

This case is fairly simple, but other cases are more complex.

I started by doing this (SSC -strgroup- by Julian Reif):

Code:

strgroup model, generate(similar_model) threshold(0.15) first normalize(shorter) force

But this will take far too much time which I can't afford to waste. Any suggestions please?

Thanks in advance!

Edit: My peers don't want to use fuzzy merge. So ideally these variables should have the same name. The final idea is to produce a merge between (i) car registrations database and (ii) car prices, based on several characteristics.

I may have noticed something useful: it's often the first two or three "words" (so a series of strings, separated by a space, another series of strings, separated by another space). It's actually a single space, because I've cleaned it up beforehand using -chartab-.

Last edited by Michael Duarte Goncalves; 18 Apr 2024, 05:13.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

18 Apr 2024, 11:38

I'm interpreting your "something useful" as meaning that the cleaned model name is often a prefix of the uncleaned name. That is, the uncleaned name consists of the cleaned name followed by other material. To fix up the situations that satisfy this condition you could do this:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str23 brand str39 model str1 fuel long engine_cap double fiscal_power str22 model_dup_original
"DACIA" "SANDERO" "0"  999  7.82 "SANDERO"
"DACIA" "SANDERO" "0"  898  7.34 "SANDERO"
"DACIA" "SANDERO" "0" 1149  9.54 "SANDERO"
"DACIA" "SANDERO" "0"  898  7.34 "SANDERO"
"DACIA" "SANDERO" "1" 1461 11.02 "SANDERO"
end

tempfile clean
save `clean'

* Example generated by -dataex-. For more info, type help dataex
clear
input str13 brand str147 model int(period_unique engine_cap) str4 fuel str11 pkw double fiscal_power long price
"DACIA" "SANDERO 0.9 TCE GLP Ambiance 90"    2016 898 "S" "66" 7.34 8000
"DACIA" "SANDERO 0.9 TCE GLP Laureate 90"    2016 898 "S" "66" 7.34 8500
"DACIA" "SANDERO 0.9 TCE GLP Stepway 90"     2018 898 "S" "66" 7.34 7900
"DACIA" "SANDERO 0.9 TCE Laureate 90"        2016 898 "G" "66" 7.34 7900
"DACIA" "SANDERO 0.9 TCE Laureate Easy-R 90" 2016 898 "G" "66" 7.34 8300
end

tempfile unclean
save `unclean'

use `unclean', clear
keep brand model
duplicates drop
tempfile holding
save `holding'

//    CREATE A CROSSWALK
use brand model using `clean', clear
rename model model1
duplicates drop
joinby brand using `holding', unmatched(master)
drop _merge
keep if strpos(model, model1) == 1

//    NOW MERGE THIS WITH THE UNCLEAN DATA SET
merge 1:m brand model using `unclean', keep(match using) nogenerate
replace model = model1 if !missing(model1)
drop model1

I have put all of the original data and intermediate files into tempfiles. But you can use permanent data files instead if you prefer. In fact, saving the crosswalk file as a permanent file might prove useful for later work.

This will not handle any other kind of discrepancy between the two files' model nomenclature. You have already eliminated extraneous spaces, and perhaps some other characters. The other thing you should do in both of your data files is convert all of the strings to the same case. In your example data, everything is already in upper case in the cleaned file, but the uncleaned file has some lower case characters, at least in the extraneous parts. That won't hurt. But if the uncleaned file has lower case characters in the parts that are supposed to match, that will cause the code above to fail, because -strpos()- does case-sensitive comparisons.

Comment

Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#3

19 Apr 2024, 01:16

Hello Clyde,

Thank you so much for your help! Yes exactly, it's a sort of prefix, followed by lots of other characters.

I'll implement your suggestion, and I'm sure it will work. However, I didn't know that -strpos()- was case-sensitive. Very good to know!

Thank you for your very clear explanations. Thank you again for your time.

You're really good at stata! It's always a pleasure to learn from you!

All the best,

Michael
Comment
Benno Schoenberger

Join Date: Apr 2024

Posts: 61
#4

19 Apr 2024, 01:57

Dear Michael,

looking through your question and Clyde's answer, I got the impression that Clyde's focus here was mainly on aligning the "model" variable between both data sets and the solution using strpos is great and easy. But especially when it comes to merging the two data sets, I asked myself whether it wouldn't be better to use the variables "engine_cap" (and perhaps also "fiscal_power") because they apparently allow for more precise identification.

All the best,
Benno
1 like
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#5

19 Apr 2024, 02:22

Hi Benno Schoenberger:

Thanks for your feedback and help!

Yes, I noticed that at the end of this step:

Code:

// CREATE A CROSSWALK use brand model using `clean', clear rename model model1 duplicates drop joinby brand using `holding', unmatched(master) drop _merge keep if strpos(model, model1) == 1

Example. Those are the first four lines:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str23 brand str39 model1 str147 model "ALFA ROMEO" "4C" "4C SPIDER 1.7 TBI TCT 240" "ALFA ROMEO" "4C" "4C SPIDER 1.7 TBI TCT / 1.75 6V 240" "ALFA ROMEO" "4C SPIDER" "4C SPIDER 1.7 TBI TCT 240" "ALFA ROMEO" "4C SPIDER" "4C SPIDER 1.7 TBI TCT / 1.75 6V 240" end

The lines in red should be instead "4C". So, probably some models are not quite correct.
Well, that's not a problem, because afterwards, I might be able to correct it by hand. But, I suppose it can take a little bit of time.

Thank you for your feedback. I'll see what I can do with -engine_cap- and/or -fiscal_power-.
Should I add -engine_cap- and/or -fiscal_power- in the -joinby- process?

Have a nice day Benno. Thank you again for your help!
Michael

---
Edit: Perhaps the unique downside to this is that unfortunately sometimes these values are not necessarily consistent:
For example: sometimes in the registration data, I have a cubic capacity (-engine_cap-) for a brand-model of 12.26, and in the price register, this value is 12.25.

Or sometimes this value is say 11.197 in the registration database, and 11.2 in the price database. Unfortunately, this also happens with the -fiscal_power- variable.

Last edited by Michael Duarte Goncalves; 19 Apr 2024, 02:28.
Comment
Benno Schoenberger

Join Date: Apr 2024

Posts: 61
#6

19 Apr 2024, 04:07

Maybe I haven't yet understood your problem right. I thought you want to merge the information from these two datasets together in order to analyze for example price development over time or something like that.
The more I think about your problem, the more questions I have unfortunately.
Of one the first would be why there are duplicates in your first provided dataset? Are these simply multiple registrations of the same vehicle and wouldn't these be combined in some way for an analysis anyway? If it's about numbers, you would probably form sums, I assume.

My second thought is that I wouldn't try to reduce the detailed model name from dataset2 to somehow match that from dataset1, because doing so would throw away valuable information that distinguishes the vehicles. I can well imagine that your data set contains even more discriminatory features than you used for this example. I'm thinking here of the manufacturer code numbers and type code numbers that are common in Germany, for example, and which always clearly identify the model of vehicles.
If you had such information in both data sets, you wouldn't have to work with model names in strings, which probably have a completely different structure across all manufacturers and would therefore require a lot of effort to harmonize in order to be suitable as identifier variable for merging.

A possible way to deal with inconsistencies between registration and price data could potentially be to round both variables to one decimal place or to the nearest full 100.

And as a final thought on this: The registration data of the vehicles probably does not include what exact equipment the vehicle has, unlike the data on the prices. In this respect, you would have to think about whether you should e.g. calculate the average price in dataset2 for the models "SANDERO 0.9 TCE Laureate 90" and "SANDERO 0.9 TCE Laureate Easy-R 90" and only keep "SANDERO 0.9 TCE" in the data. Unfortunately, you would have to go much deeper into breaking up the strings of model names and that would probably require an incredible amount of manual work to find out which models and different features are the same for all manufacturers.

Last edited by Benno Schoenberger; 19 Apr 2024, 04:19.
1 like
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#7

19 Apr 2024, 05:22

First of all, thank you again for your interest and willingness to help! Really appreciated!

The main idea is to aggregate as many models as possible into a single category, rather than having extremely precise model names. We then probably compute the average price, for a given year, model, etc.
Secondly, we'd like to see how the price evolves, as you've so rightly said. There were a lot of errors in the way the models were written in the registration cars dataset, which is why they have been harmonised.

My second thought is that I wouldn't try to reduce the detailed model name from dataset2 to somehow match that from dataset1, because doing so would throw away valuable information that distinguishes the vehicles. I can well imagine that your data set contains even more discriminatory features than you used for this example. I'm thinking here of the manufacturer code numbers and type code numbers that are common in Germany, for example, and which always clearly identify the model of vehicles.
If you had such information in both data sets, you wouldn't have to work with model names in strings, which probably have a completely different structure across all manufacturers and would therefore require a lot of effort to harmonize in order to be suitable as identifier variable for merging.

In the first dataset, duplicate values exist because a manual cleaning process had already been carried out beforehand in terms of models. Many models have been combined under a single name.

We have a number of characteristics that allow us to distinguish exact model, such as engine capacity, fuel, kilowatts, co2 per gram per kilometre, etc.

Unfortunately, the vehicle manufacturer code you are entering only exists in the registration database. We're looking for better data, and it looks like we'll have this information for vehicle prices in the near future.

A possible way to deal with inconsistencies between registration and price data could potentially be to round both variables to one decimal place or to the nearest full 100.

Yes, I've been thinking about rounding it off to deal with inconsistencies. Are -floor- or -ceil- functions suitable in this case?

And as a final thought on this: The registration data of the vehicles probably does not include what exact equipment the vehicle has, unlike the data on the prices. In this respect, you would have to think about whether you should e.g. calculate the average price in dataset2 for the models "SANDERO 0.9 TCE Laureate 90" and "SANDERO 0.9 TCE Laureate Easy-R 90" and only keep "SANDERO 0.9 TCE" in the data. Unfortunately, you would have to go much deeper into breaking up the strings of model names and that would probably require an incredible amount of manual work to find out which models and different features are the same for all manufacturers.

In our case, the opposite is true. The data in the registration database is much more detailed than that for prices. It contains the car manufacturer code you mentioned, which is not available in the price code.
What's more, in the price database - and this is of interest to us too - it's very difficult for us to name precisely whether a car is a hybrid HEV or PHEV, or an electric BEV or EREV, and so on.
Let see how we can deal with it...

Thank you again for your amazing explanations, and most importantly, for your time!

Michael
Comment
Benno Schoenberger

Join Date: Apr 2024

Posts: 61
#8

19 Apr 2024, 07:33

For example: sometimes in the registration data, I have a cubic capacity (-engine_cap-) for a brand-model of 12.26, and in the price register, this value is 12.25.

Or sometimes this value is say 11.197 in the registration database, and 11.2 in the price database. Unfortunately, this also happens with the -fiscal_power- variable

Whether you use floor, ceil or round here is mainly a question of what your data looks like, what differences there are between the two data sets and what you want to achieve. In a first attempt, I would round to one decimal place and then check whether the models can be assigned to each other more clearly. Your engine_cap data from the first posted dataset looks like cubic centimeters to me, so I would probably go to the nearest whole hundred here. But if you have values similar to those of the fiscal_power variable, you will have to round to a certain decimal place.

Code:

gen engine_cap_rd = round(engine_cap,100) gen fiscal_power_rd=round(fiscal_power,.1)
1 like
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#9

19 Apr 2024, 08:41

Hi Benno Schoenberger:

Everything is clear!
Thank you so much for your help in this thread, and for your useful comments!

Best,

Michael
Comment

Announcement