Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -collapse-: is there a way to come back with data cleaned with collapse to the whole disaggregated data?

    Hi everyone,

    Hello,

    I would like to know if it is possible to retrieve the changes made during a collapse of the initial database.

    This is a special case using the SSC -strgroup()- command. It uses the Levenstein distance. Here is my code:

    Code:
    // --- //
    // T ---
    cd "${path}/stata/demand"
    use "cardesc15-19_fordemand_cleaned_withenginecap.dta", clear
    
    keep if description == "T"
    replace model = itrim(trim(model))
    gen num = 1
    collapse (sum) num, by(description model COD_PROPULSION cilindrada potenciafiscal weight_max)
    tab model [aw=num]
    sort model
    
    
    strgroup model, generate(similar_model1) threshold(0.15) first normalize(shorter) force
    
    sort similar_model1
    
    
    replace model = "4 RUNNER" if inrange(similar_model1, 2, 3)
    
    replace model = "GT86" if similar_model1 == 4 ///
                                | inrange(similar_model1, 75, 76) ///
                                | similar_model1 == 148
    
    replace model = "AURIS" if inrange(similar_model1, 5, 29) ///
                                | similar_model1 == 138 ///
                                | similar_model1 == 202
    
    ...
    and I do that for all car models that I have. For example:

    Code:
    // P ---
    cd "${path}/stata/demand"
    use "cardesc15-19_fordemand_cleaned_withenginecap.dta", clear
    
    keep if description == "P"
    replace model = itrim(trim(model))
    gen num = 1
    collapse (sum) num, by(description model COD_PROPULSION cilindrada potenciafiscal weight_max)
    tab model [aw=num]
    sort model
    
    strgroup model, generate(similar_model1) threshold(0.15) first normalize(shorter) force
    
    sort similar_model1
    
    
    split model, parse(" ") g(split_)
    
    replace model = split_1 if inrange(similar_model1, 1, 2)
    replace model = split_1 if inrange(similar_model1, 8, 71)
    
    replace model = split_1 if inrange(similar_model1, 78, 661)
    replace model = split_2 if inrange(similar_model1, 1116, 1123)
    
    replace model = "108" if inrange(similar_model1, 1, 661) & ustrpos(model, "108")>0 ///
                        | inrange(similar_model1, 761, 1026) & ustrpos(model, "108")>0 
    
    ...
    Is there a way to retrieve those changes applied to my basic dataset?

    Here is my basic dataset, w/o cleaning anything. Let say for T car model:



    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str23 description str31 model str1 COD_PROPULSION long cilindrada double potenciafiscal long(weight_max muni_code)
    "T" "AURIS"             "0" 1798 12.49 1815 28005
    "T" "T COROLLA"    "0" 1987 13.27 1910 11020
    "T" "T YARIS"      "0" 1329 10.42 1490 38006
    "T" "T PRIUS PLUS" "0" 1798 12.49 2115  8019
    "T" "T RAV4"       "1" 1998 13.31 2135  4052
    end
    And here the changed dataset with collapse and cleaned:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str23 description str31 model str1 COD_PROPULSION long cilindrada double potenciafiscal long weight_max double num long similar_model1
    "T" "4 RUNNER" "0" 3956    24 2858 1 2
    "T" "4 RUNNER" "0" 3956  23.9 2575 3 2
    "T" "4 RUNNER" "1" 4000    24 2798 1 2
    "T" "4 RUNNER" "0" 2366  14.9 2100 1 2
    "T" "4 RUNNER" "0" 3956  23.9 2575 5 2
    "T" "4 RUNNER" "0" 4000    24 2743 1 2
    "T" "4 RUNNER" "0" 3400  21.8 2308 1 2
    "T" "4 RUNNER" "0" 3955    24 2440 3 2
    "T" "4 RUNNER" "0" 3956  23.9 2766 2 2
    "T" "4 RUNNER" "0" 3955 23.59 2600 1 3
    end

    Could anyone provide me some help, please?
    I really wouldn't want to have to do all that cleaning work again. Thanks in advance.

    Best,

    Michael

  • #2
    Save the collapsed results and then joinby the results on the relevant variables?

    Comment


    • #3
      Hi George Ford,

      Thank you for your help! I have two questions, please:
      1. What did you mean by joinby, please?
      2. When you wrote "save", did you mean using -tempfile-, for example?
      Thank you in advance for your answer.
      Lovely day!

      Michael

      Comment


      • #4
        joinby means the joinby command
        save means the save command. You can use a tempfile if you want, but that is not necessary

        If you don't know that a command like joinby does, then you type in Stata help joinby . Don't forget that the pdf manual contains more detailed information than the helpfile, so if you don't understand the help-file, look at the pdf-manual entry for that command. A link to that entry is at the top of the help-file.
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          Hi Maarten Buis,

          Thank you very much for your help!
          I will have a look into it.

          Lovely day!

          Michael

          Comment


          • #6
            What Maarten said--sorry for the vague reply.

            Collapse the data and save the results to collapsed.dta.

            Then

            joinby id1 id2 (or as many as you need) using collapsed.dta

            I prefer joinby to merge for nearly all purposes.

            Comment


            • #7
              Hi George Ford,

              No worries! Thank you for taking the time to answer me.
              Thank you also for your suggestion. I'll take a look at it soon, and get back to you if I hit a bone.

              Lovely day (and weekend!).
              Best,

              Michael

              Comment


              • #8
                I have been looking into -joinby-:

                I was not aware of this command. Thanks for sharing this Maarten and George!

                As a first attempt, I used the following code:

                Code:
                clear*
                use "cleaned_models.dta", clear
                joinby description COD_PROPULSION cilindrada potenciafiscal weight_max using "C:/Users/miduarte/Desktop/Ongoing_Projects/LEZ_Project/Data/stata/demand/cardesc15-19_fordemand_uncleaned.dta"
                Here are a few explanations to help you understand what I want an do:
                1. I have an uncleaned, unaggregated database with several car models registered between 2015 and 2019. Several models are the same but simply misspelt.
                2. I then use -collapse- on my database to get the number of models according to how they are written, according to several key characteristics: cubic capacity, weight, tax weight and fuel. The number of models with a given model's name and those characteristics are expressed by the variable -num- in my post #1. To clean, I use the SSC -strgroup- command, as many models are the same, but misspelt.
                I then clean this data and obtain a second dataset with harmonised model names. However, I would like to be able to go back to my using dataset, but with the clean models.
                After applying the code above, I end up with a total number of observations of 8,129,491, whereas this should only have 4,004,455 observations.
                • Does anyone have an explanation? It could be that several models have (i) the same displacement, (ii) tax weight, (iii) weight (iv) fuel, in case that might play a part.
                Finally, if it helps, here are two examples of my data sets. Let's label dataset master "cleaned_models.dta", and using "cardesc15-19_fordemand_uncleaned.dta", as above:

                cardesc15-19_fordemand_uncleaned.dta:

                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input str23 description str31 model long cilindrada double potenciafiscal long weight_max str1 COD_PROPULSION
                "DACIA" "SANDERO"  898  7.34 1550 "0"
                "DACIA" "LODGY"   1598 11.64 1842 "0"
                "DACIA" "SANDERO" 1461 11.02 1590 "1"
                "DACIA" "DUSTER"  1461 11.02 1785 "1"
                "DACIA" "LODGY"   1197  9.78 1810 "0"
                "DACIA" "LODGY"   1461 11.02 1958 "1"
                end
                and cleaned_models.dta:


                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input str23 description str39 model long cilindrada double potenciafiscal long weight_max str1 COD_PROPULSION double num
                "DACIA" "DUSTER" 1461 11.02 1771 "1"  1739
                "DACIA" "DUSTER" 1461 11.02 1775 "1"  4286
                "DACIA" "DUSTER" 1461 11.02 1780 "1"    10
                "DACIA" "DUSTER" 1461 11.02 1785 "1"  3655
                "DACIA" "DUSTER" 1461 11.02 1795 "1" 10945
                end



                Thank you in advance for your help!
                Best,

                Michael

                Edit: I also used the option -unmatched(none)-, but it gives me the same number of observations, in other terms: 8,129,491.
                Last edited by Michael Duarte Goncalves; 23 Feb 2024, 09:41.

                Comment


                • #9
                  joinby will match everything that matches, so your join variables are not unique to an observation. Some of your rows are matching multiple rows in the using file.

                  I'd check for duplicates in the using file to search for why that's happening. There may be a simple solution, or not.

                  Also see
                  HTML Code:
                  https://libguides.princeton.edu/fuzzy-merge

                  Comment


                  • #10
                    Hi George,

                    Thank you for your prompt reply!
                    Thank you very much for the link. However, my PIs are not very keen on -matchit- and -reclink-, preferring me to clean up the data as much as possible, and then do the merge afterwards.

                    Thanks again for your help. I'll see what I can do with what I've got.

                    Best,

                    Michael

                    Comment


                    • #11
                      The other option is to try to clean up the mis-spellings, but I've worked with car registration data before and expect that you'll have a lot to clean up.

                      If there are not too many models, you may be able to search for a unique part of the name that will identify cleanly. How many ways can Camry be mis-spelled?

                      Comment


                      • #12
                        Thanks for the suggestion, George! Seems a good idea indeed!
                        Yes, there is a lot of cleaning! And many models, too...


                        I'll look into it and keep you posted.
                        Thanks a lot!

                        Michael

                        Comment


                        • #13
                          Hi George Ford,

                          A small update about what you suggested me in #11:
                          • I looked for a unique part of the name that identified cleanly models... And everything works perfectly!
                          Thank you very much again and again for your help. I didn't know this command -joinby-. And it is very useful indeed. I will use it more often from now on.

                          All the best,

                          Michael

                          Comment


                          • #14
                            Once you use joinby, you'll never go back.

                            Glad it worked.

                            Comment

                            Working...
                            X