Problems with rounded numbers when trying to merge two datasets

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 483

Problems with rounded numbers when trying to merge two datasets

14 Nov 2024, 09:53

Hi Statalist community,

I’m reaching out for assistance with a merging issue. I am trying to merge two datasets that appear to be very similar at a glance. However, I’m struggling to get a proper merge due to a type mismatch issue, specifically with the variable litres.

Below is a sample of my attempted merge, generated using -dataex-:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float ym_date str29 make str31 model str22 fueltype double litres str10 powertraintype byte _merge
692 "HONDA" "JAZZ" "0" 1.2999999523162842 "COMBUSTION" 1
692 "HONDA" "JAZZ" "0" 1.2999999523162842 "COMBUSTION" 1
692 "HONDA" "JAZZ" "0" 1.2999999523162842 "COMBUSTION" 1
692 "HONDA" "JAZZ" "0" 1.2999999523162842 "COMBUSTION" 1
692 "HONDA" "JAZZ" "0"                1.3 "COMBUSTION" 2
end
format %tm ym_date
label values _merge _merge
label def _merge 1 "Master only (1)", modify
label def _merge 2 "Using only (2)", modify

When I browse through the datasets in Stata, litres appear identical across both files. However, during the merge, Stata adjusts the variable formats. Here’s my merge code:

Here is my code:

Code:

    // Car Prices -----------------------------------------------------
    
    use "${car_prices_15_20}/avg_prices_per_month_year_201501_202002.dta", clear
    
    /********************************************************************************
    2. Registrations data set preparing
    ********************************************************************************/
    
    use "${dgt_data_cleaned}/dgt_all_cars_cleaned_201501_202002.dta", clear
    
    rename marca_itv make
    
    rename modelo_itv model
    
    rename cod_propulsion_itv fueltype
    
    rename categoría_vehículo_eléctrico powertraintype
    
    rename moyr_matricula ym_date
        
    merge m:1 ym_date make model litres fueltype powertraintype using "${car_prices_15_20}/avg_prices_per_month_year_201501_202002.dta"

Here is what I obtained at the end of the -merge- process:

Code:

.         merge m:1 ym_date make model litres fueltype powertraintype using "`car_prices'"
(variable litres was float, now double to accommodate using data's values)
(variable fueltype was str2, now str22 to accommodate using data's values)

Could anyone help me with that? I tried -recast- and -round()-, but do not work neither.

Thank you very much in advance.

Tags: None

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2339
#2

14 Nov 2024, 19:26

After nearly 500 posts here, you would think to learn how to better structure your post to ask for help, especially the part about showing reproducible data examples and the code you used. Preferably, in an isolated context that does not require us to understand how the code snippets embed in your overall workflow. No (useful) data are presented so no code will be ventured.

There are two behaviours here you need to understand. First, if Stata changes a variable type, it is only to "upcast" it into a data type that will not result in any loss of precision in your data. For example, upcasting float to double. or str2 to str22.

Second, read the PDF entry on -help precision-. Stata has hinted what the problem is:

Code:

(variable litres was float, now double to accommodate using data's values)

What you see when you browse the data, for example, 1.3, is not represented internally as the same number when stored as a float versus double. You might try to recast the -double- litres as a float instead, since I doubt you need that level of precision for car data.
4 likes
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 483
#3

15 Nov 2024, 02:10

Hi Leonardo,

I apologise for not structuring my message correctly. I will read the PDF entry on -help precision- and come back with more questions (and reproducible datasets if needed). And hopefully a better post.

Again, sorry.
Michael
Comment

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 483

15 Nov 2024, 03:00

Hi again, so I think I found the problem in my code. Please let me know if it is not the case:

Code:

/*******************************************************************************
 1. Merging Brands and Models
********************************************************************************/    

    use "./0_cleaned_brands_DGT_data_201501_202002.dta", clear

    gen model_dup_original = modelo_itv
    
    drop modelo_itv
    
    label variable model_dup_original "Duplicated Original Model Name"
    
    replace model_dup_original = itrim(trim(model_dup_original))

    joinby marca_itv model_dup_original cod_propulsion_itv cilindrada_itv kw_itv categoría_vehículo_eléctrico using "./1_cleaned_models_DGT_data_201501_202002.dta", unmatched(both)
    
/*******************************************************************************
 2. Counting Models and Cleaning
********************************************************************************/    
 
    bys marca_itv : egen models_per_brand = count(num) // number of models per brand.
    
    // Just to check: total number of observations. Should be 6,503,711 (Date: 2024-11-14)
    
    egen total_cars = count(models_per_brand)
    
    *drop useless variables

    drop _merge models_per_brand num total_cars

    order fec_matricula_date fec_tramitacion_date year* month* moyr* marca_itv modelo_itv model_dup_original
    
    sort fec_matricula_date marca_itv
    
/*******************************************************************************
 3. Standardizing Vehicle Categories
********************************************************************************/    

    // Cleaning a little bit the categoría_vehiculo_electrico

    replace categoría_vehículo_eléctrico = "HEV" if categoría_vehículo_eléctrico == "HVE" | categoría_vehículo_eléctrico == "NOVC"
    
    replace categoría_vehículo_eléctrico = "COMBUSTION" if categoría_vehículo_eléctrico == "0000"| categoría_vehículo_eléctrico == ""
    
    replace categoría_vehículo_eléctrico = "EREV" if categoría_vehículo_eléctrico == "REEV"
    
/*******************************************************************************
 4. Convert engine capacity from cubic centimetres to litres
********************************************************************************/    

    gen double litres  = round(cilindrada_itv/1000, 0.1)
    
    order litres, after(cilindrada_itv)

/*******************************************************************************
 5. Final Save
********************************************************************************/    

    // Save dataset
    
    save "./dgt_cleaned_201501_202002/dgt_all_cars_cleaned_201501_202002.dta", replace

    exit, clear

Before your explanations, I have written it as gen litres = round(cilindrada_itv/1000, 0.1), transforming it directly into float type.

Thank you very much for your help and advices. And sorry again for for the wrong style of posting.

Michael

Announcement

Problems with rounded numbers when trying to merge two datasets

Comment

Comment

Comment