Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How Can I Clean Strings to Match Those from Another Dataset, with Different Name Conventions?

    Hi everyone,

    I need to do a challenging task that is taking so much time. I need to match string variables from one dataset to another one. Problem: String are written in different ways across datasets

    To do that, I have two different datasets:
    1. The first one is about the new vehicles that have been sold in Spain, from 2015 to 2019. Unfortunately, the name conventions of car models are not precise, so it's difficult to know which model households have bought (Demand).
    2. The second contains exhaustive information on the models available for purchase in Spain, over several years up to 2018. Here, the model names are extremely precise (Prices)
    My final aim is to match the two sets of data to the right car model. Demand contains important information about engine capacity, year of registration, and fuel used, which would help us to match the correct model in Prices, as some of this information is also available in Prices.

    Demand:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str23 description str31 model str8 FEC_MATRICULA str1 COD_PROPULSION long cilindrada
    "ALFA ROMEO" "ALFA GIULIETTA" "12062017" "0" 1368
    "ALFA ROMEO" "STELVIO"        "29112019" "1" 2143
    "ALFA ROMEO" "ALFA GIULIETTA" "19122018" "6" 1368
    "ALFA ROMEO" "ALFA GIULIETTA" "15102015" "0" 1368
    "ALFA ROMEO" "ALFA GIULIETTA" "19072017" "1" 1598
    "ALFA ROMEO" "STELVIO"        "02012019" "1" 2143
    "ALFA ROMEO" "STELVIO"        "08012019" "1" 2143
    "ALFA ROMEO" "STELVIO"        "11022019" "0" 1995
    "ALFA ROMEO" "ALFA GIULIETTA" "17122015" "1" 1598
    "ALFA ROMEO" "STELVIO"        "15122017" "1" 2143
    "ALFA ROMEO" "ALFA GIULIETTA" "16082019" "0" 1368
    "ALFA ROMEO" "STELVIO"        "22062018" "1" 2143
    "ALFA ROMEO" "ALFA GIULIETTA" "16052019" "0" 1368
    "ALFA ROMEO" "GIULIA"         "21082019" "0" 1995
    "ALFA ROMEO" "ALFA GIULIETTA" "30112018" "0" 1368
    end
    • description: Car Brand
    • model: Car Model
    • FEC_MATRICULA: Date of Registration
    • COD_PROPULSION: 0 = Gasoline (Petrol); 1 = Diesel
    • cilindrada: Cubic Capacity
    Prices:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str13 description str77 model str9 periodocomercial int cc str3 gd long price
    "ALFA ROMEO" "GIULIETTA 1.4 TB Distinctive TCT 170"        "2010-2016" 1368 "G" 20600
    "ALFA ROMEO" "GIULIETTA 1.7 TBi QV"                        "2010-2016" 1742 "G" 22600
    "ALFA ROMEO" "GIULIETTA 1.6 JTD m2 S.E.Veloce"             "2010-2016" 1598 "D" 18300
    Code:
    "ALFA ROMEO" "Stelvio 2.0 PETROL 280 CV EXECUTIVE AWD"     "2018-" 1995 "G" 44600
    "ALFA ROMEO" "Stelvio 2.0 PETROL 280 CV SPECIALE AWD"      "2017-" 1995 "G" 45000
    "ALFA ROMEO" "Stelvio 2.0 Turbo 200 Executive Q4 Aut. 8V"  "2018-" 1995 "G" 39500
    "ALFA ROMEO" "Stelvio 2.0 Turbo 200 Super Q4 Aut."         "2017-" 1995 "G" 36600
    "ALFA ROMEO" "Stelvio 2.0 Turbo 280 Executive Q4 Aut. 8V"  "2018-" 1995 "G" 43200
    "ALFA ROMEO" "Stelvio 2.0 Turbo 280 First Edit.II Q4 Aut." "2017-" 1995 "G" 44600
    "ALFA ROMEO" "Stelvio 2.2 D 150 RWD Aut."                  "2017-" 2143 "D" 33000
    "ALFA ROMEO" "Stelvio 2.2 D 180 Q4 Aut."                   "2017-" 2143 "D" 36900
    "ALFA ROMEO" "Stelvio 2.2 D 180 RWD Aut."                  "2017-" 2143 "D" 35100
    "ALFA ROMEO" "Stelvio 2.2 D 180 Super Q4 Aut."             "2017-" 2143 "D" 38600
    "ALFA ROMEO" "Stelvio 2.2 D 180 Super RWD Aut."            "2017-" 2143 "D" 36700
    end
    • description: Car Brand
    • model: Car Model
    • periodocomercial: Commercial Period
    • cc: Cubic Capacity
    • gd: G = Gasoline (Petrol); D = Diesel
    • price: Car model prices
    Can anyone help me with this, please?
    Thank you in advance!

    Best,

    Michael

  • #2
    You can try either reclink (Wasi & Flaan, Stata Journal 2015) or matchit from Julio Raffo (statalist.org/forums/...) Both are available on SSC. Here's a post where Julio compares the two.
    Devra Golbe
    Professor Emerita, Dept. of Economics
    Hunter College, CUNY

    Comment


    • #3
      Hi Devra Golbe,

      Thank you very much for your answer and suggestions!
      Lovely day,

      Michael

      Comment


      • #4
        Perhaps one more question about this topic, please:
        • Is -strgroup- from SSC a suitable command to do that?
        Thank you in advance for your help.
        Michael

        Comment

        Working...
        X