Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • String Cleaning: strings apparently written in the same way in stata "visually", but not described like that by stata

    Hi everyone,

    I have a string variable called -model-. I want to solve a problem that is systematic in this database:
    I'll start with two small -dataex-:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str31 model str1 COD_PROPULSION long cilindrada double(potenciafiscal num)
    "ARTEON" "0" 1498 11.19 226
    "ARTEON" "0" 1498 11.19 150
    "ARTEON" "0" 1984 13.25 334
    "ARTEON" "0" 1984 13.25 113
    "ARTEON" "0" 1984 13.25  45
    "ARTEON" "0" 1984 13.25  19
    "ARTEON" "1" 1968 13.19 312
    "ARTEON" "1" 1968 13.19  29
    "ARTEON" "1" 1968 13.19 996
    "ARTEON" "1" 1968 13.19 240
    "ARTEON" "1" 1968 13.19  37
    "ARTEON" "1" 1968 13.19 237
    end
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str31 model str1 COD_PROPULSION long cilindrada double(potenciafiscal num)
    "BEETLE" "0" 1197  9.78  90
    "BEETLE" "0" 1197  9.78 520
    "BEETLE" "0" 1197  9.78  10
    "BEETLE" "0" 1197  9.78 108
    "BEETLE" "0" 1197  9.78   3
    "BEETLE" "0" 1197  9.79   1
    "BEETLE" "0" 1197  9.79   6
    "BEETLE" "0" 1390  10.7   4
    "BEETLE" "0" 1390  10.7   4
    "BEETLE" "0" 1390 10.71   1
    "BEETLE" "0" 1395     0   1
    "BEETLE" "0" 1395 10.72  47
    end

    The variable -num- corresponds to the count of the row's model.
    However, I don't understand why we have a different count to a string variable that appear at sight identical.

    My aim: I want to have for each model a proper count of it.
    I imagine this might be due to spaces or other characters, but I don't know how to harmonise them.

    Does anyone knows how can I circumvent this issue, please?
    Thank you in advance for your help.

    Michael

    Edit: I already tried the following, but does not work:

    Code:
    replace model = itrim(trim(model))
    Last edited by Michael Duarte Goncalves; 10 Jan 2024, 04:30.

  • #2
    I don't understand the problem.

    Comment


    • #3
      There are now several threads, all around the same underlying problem:

      https://www.statalist.org/forums/for...me-conventions

      https://www.statalist.org/forums/for...other-data-set


      I have not responded to any of them mainly because I find them too hard to follow. My suggestion is this: post an example of a couple of (say four or so) observations from the two datasets that you want to combine. Then post the desired result, showing both observations that should and should not be matched. Explain why you want the matched ones matched and the non-matched ones not matched.

      As for this specific thread, please explain how exactly you have created num. The problem most certainly lies there because the strings you show here are indeed identical:

      Code:
      . * Example generated by -dataex-. For more info, type help dataex
      . clear
      
      . input str31 model str1 COD_PROPULSION long cilindrada double(potenciafiscal num)
      
                                     model  COD_PRO~N    cilindrada  potencia~l         num
        1. "ARTEON" "0" 1498 11.19 226
        2. "ARTEON" "0" 1498 11.19 150
        3. "ARTEON" "0" 1984 13.25 334
        4. "ARTEON" "0" 1984 13.25 113
        5. "ARTEON" "0" 1984 13.25  45
        6. "ARTEON" "0" 1984 13.25  19
        7. "ARTEON" "1" 1968 13.19 312
        8. "ARTEON" "1" 1968 13.19  29
        9. "ARTEON" "1" 1968 13.19 996
       10. "ARTEON" "1" 1968 13.19 240
       11. "ARTEON" "1" 1968 13.19  37
       12. "ARTEON" "1" 1968 13.19 237
       13. end
      
      . 
      . sort model
      
      . by model : assert model == model[1]

      Comment


      • #4
        Thank you for your help Jared Greathouse
        Sorry for the confusion.

        I don't understand why in my first -dataex-, I have this:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str31 model str1 COD_PROPULSION long cilindrada double(potenciafiscal num)
        "ARTEON" "0" 1498 11.19 226
        "ARTEON" "0" 1498 11.19 150
        "ARTEON" "0" 1984 13.25 334
        "ARTEON" "0" 1984 13.25 113
        "ARTEON" "0" 1984 13.25 45
        "ARTEON" "0" 1984 13.25 19
        "ARTEON" "1" 1968 13.19 312
        "ARTEON" "1" 1968 13.19 29
        "ARTEON" "1" 1968 13.19 996
        "ARTEON" "1" 1968 13.19 240
        "ARTEON" "1" 1968 13.19 37
        "ARTEON" "1" 1968 13.19 237
        end
        Let's take the first two lines above. Why I obtain two separate values, for apparently the same model, same -COD_PROPULSION-, -cilindrada- and -potenciafiscal-?

        Code:
        "ARTEON" "0" 1498 11.19 226
        "ARTEON" "0" 1498 11.19 150
        and not rather that:

        Code:
        "ARTEON" "0" 1498 11.19 376
        These two lines are the same model, right? Thus, I don't understand why stata does not sum up both values.

        Maybe I have badly written my code. Here is the critical chunk:

        Code:
        // ---- models cleaning ---- //
        
        keep if description == "VOLKSWAGEN"
        replace model = itrim(trim(model))
        gen num = 1
        collapse (sum) num, by(description model COD_PROPULSION cilindrada potenciafiscal weight_max)
        tab model [aw=num]
        strgroup model, generate(similar_model1) threshold(0.15) first normalize(shorter) force
        
        
        sort similar_model
        replace model = "CALIFORNIA" if similar_model1 == 16 | similar_model1 == 43
        
        replace model = "BEETLE" if similar_model1 == 18 |similar_model1 == 19 ///
                                    | similar_model1 == 20 | similar_model1 == 21 ///
                                    | similar_model1 == 22
                                    
                                    
        replace model = "BORA"  if similar_model1 == 23 | similar_model1 == 24 ///
                                    | similar_model1 == 25 | similar_model1 == 26 ///
                                    | similar_model1 == 27
        ...
        I hope everything is clear now.
        Michael

        Edit: daniel klein, I apologise for not having clearly stated what I wanted, and for not having explained myself properly. I'll update the other posts with your feedback. Thank you.
        Last edited by Michael Duarte Goncalves; 10 Jan 2024, 07:55.

        Comment


        • #5
          Not having looked into strgroup (probably from SSC) or any other details, you have description and weight_max in your by() option. These variables do not appear in the example dataset here, so we cannot know whether they are the same for the respective observations as well.

          Comment


          • #6
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str23 description str31 model str1 COD_PROPULSION long cilindrada double potenciafiscal long weight_max double num
            "VOLKSWAGEN" "ARTEON" "0" 1498 11.19 2030 226
            "VOLKSWAGEN" "ARTEON" "0" 1498 11.19 2080 150
            "VOLKSWAGEN" "ARTEON" "0" 1984 13.25 2130 334
            "VOLKSWAGEN" "ARTEON" "0" 1984 13.25 2240 113
            "VOLKSWAGEN" "ARTEON" "0" 1984 13.25 2250  45
            "VOLKSWAGEN" "ARTEON" "0" 1984 13.25 2260  19
            "VOLKSWAGEN" "ARTEON" "1" 1968 13.19 2140 312
            "VOLKSWAGEN" "ARTEON" "1" 1968 13.19 2150  29
            "VOLKSWAGEN" "ARTEON" "1" 1968 13.19 2170 996
            "VOLKSWAGEN" "ARTEON" "1" 1968 13.19 2180 240
            "VOLKSWAGEN" "ARTEON" "1" 1968 13.19 2250  37
            "VOLKSWAGEN" "ARTEON" "1" 1968 13.19 2360 237
            end
            Are the difference due to -weight_max-?

            Comment


            • #7
              Well, there you have your answer. weight_max is 2030 in the first and 2080 in the second observation.

              Comment


              • #8
                Ok, great! Thanks for that clarification. I got my brushes all tangled up. Really sorry about that.

                Comment

                Working...
                X