Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems with rounded numbers when trying to merge two datasets

    Hi Statalist community,

    I’m reaching out for assistance with a merging issue. I am trying to merge two datasets that appear to be very similar at a glance. However, I’m struggling to get a proper merge due to a type mismatch issue, specifically with the variable litres.

    Below is a sample of my attempted merge, generated using -dataex-:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float ym_date str29 make str31 model str22 fueltype double litres str10 powertraintype byte _merge
    692 "HONDA" "JAZZ" "0" 1.2999999523162842 "COMBUSTION" 1
    692 "HONDA" "JAZZ" "0" 1.2999999523162842 "COMBUSTION" 1
    692 "HONDA" "JAZZ" "0" 1.2999999523162842 "COMBUSTION" 1
    692 "HONDA" "JAZZ" "0" 1.2999999523162842 "COMBUSTION" 1
    692 "HONDA" "JAZZ" "0"                1.3 "COMBUSTION" 2
    end
    format %tm ym_date
    label values _merge _merge
    label def _merge 1 "Master only (1)", modify
    label def _merge 2 "Using only (2)", modify
    When I browse through the datasets in Stata, litres appear identical across both files. However, during the merge, Stata adjusts the variable formats. Here’s my merge code:

    Here is my code:

    Code:
        // Car Prices -----------------------------------------------------
        
        use "${car_prices_15_20}/avg_prices_per_month_year_201501_202002.dta", clear
        
        /********************************************************************************
        2. Registrations data set preparing
        ********************************************************************************/
        
        use "${dgt_data_cleaned}/dgt_all_cars_cleaned_201501_202002.dta", clear
        
        rename marca_itv make
        
        rename modelo_itv model
        
        rename cod_propulsion_itv fueltype
        
        rename categoría_vehículo_eléctrico powertraintype
        
        rename moyr_matricula ym_date
            
        merge m:1 ym_date make model litres fueltype powertraintype using "${car_prices_15_20}/avg_prices_per_month_year_201501_202002.dta"
    Here is what I obtained at the end of the -merge- process:

    Code:
    .         merge m:1 ym_date make model litres fueltype powertraintype using "`car_prices'"
    (variable litres was float, now double to accommodate using data's values)
    (variable fueltype was str2, now str22 to accommodate using data's values)
    Could anyone help me with that? I tried -recast- and -round()-, but do not work neither.

    Thank you very much in advance.

  • #2
    After nearly 500 posts here, you would think to learn how to better structure your post to ask for help, especially the part about showing reproducible data examples and the code you used. Preferably, in an isolated context that does not require us to understand how the code snippets embed in your overall workflow. No (useful) data are presented so no code will be ventured.

    There are two behaviours here you need to understand. First, if Stata changes a variable type, it is only to "upcast" it into a data type that will not result in any loss of precision in your data. For example, upcasting float to double. or str2 to str22.

    Second, read the PDF entry on -help precision-. Stata has hinted what the problem is:

    Code:
    (variable litres was float, now double to accommodate using data's values)
    What you see when you browse the data, for example, 1.3, is not represented internally as the same number when stored as a float versus double. You might try to recast the -double- litres as a float instead, since I doubt you need that level of precision for car data.

    Comment


    • #3
      Hi Leonardo,

      I apologise for not structuring my message correctly. I will read the PDF entry on -help precision- and come back with more questions (and reproducible datasets if needed). And hopefully a better post.

      Again, sorry.
      Michael

      Comment


      • #4
        Hi again, so I think I found the problem in my code. Please let me know if it is not the case:

        Code:
        /*******************************************************************************
         1. Merging Brands and Models
        ********************************************************************************/    
        
            use "./0_cleaned_brands_DGT_data_201501_202002.dta", clear
        
            gen model_dup_original = modelo_itv
            
            drop modelo_itv
            
            label variable model_dup_original "Duplicated Original Model Name"
            
            replace model_dup_original = itrim(trim(model_dup_original))
        
            joinby marca_itv model_dup_original cod_propulsion_itv cilindrada_itv kw_itv categoría_vehículo_eléctrico using "./1_cleaned_models_DGT_data_201501_202002.dta", unmatched(both)
            
        /*******************************************************************************
         2. Counting Models and Cleaning
        ********************************************************************************/    
         
            bys marca_itv : egen models_per_brand = count(num) // number of models per brand.
            
            // Just to check: total number of observations. Should be 6,503,711 (Date: 2024-11-14)
            
            egen total_cars = count(models_per_brand)
            
            *drop useless variables
        
            drop _merge models_per_brand num total_cars
        
            order fec_matricula_date fec_tramitacion_date year* month* moyr* marca_itv modelo_itv model_dup_original
            
            sort fec_matricula_date marca_itv
            
        /*******************************************************************************
         3. Standardizing Vehicle Categories
        ********************************************************************************/    
        
            // Cleaning a little bit the categoría_vehiculo_electrico
        
            replace categoría_vehículo_eléctrico = "HEV" if categoría_vehículo_eléctrico == "HVE" | categoría_vehículo_eléctrico == "NOVC"
            
            replace categoría_vehículo_eléctrico = "COMBUSTION" if categoría_vehículo_eléctrico == "0000"| categoría_vehículo_eléctrico == ""
            
            replace categoría_vehículo_eléctrico = "EREV" if categoría_vehículo_eléctrico == "REEV"
            
        /*******************************************************************************
         4. Convert engine capacity from cubic centimetres to litres
        ********************************************************************************/    
        
            gen double litres  = round(cilindrada_itv/1000, 0.1)
            
            order litres, after(cilindrada_itv)
        
        /*******************************************************************************
         5. Final Save
        ********************************************************************************/    
        
            // Save dataset
            
            save "./dgt_cleaned_201501_202002/dgt_all_cars_cleaned_201501_202002.dta", replace
        
            exit, clear
        Before your explanations, I have written it as gen litres = round(cilindrada_itv/1000, 0.1), transforming it directly into float type.

        Thank you very much for your help and advices. And sorry again for for the wrong style of posting.

        Michael

        Comment

        Working...
        X