Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to replace Spanish letters and accents by English letters

    Hi everyone,

    Hope you can help. I need to replace Spanish letters and accents by English letters in the master dataset so that my observations will match with the observationw of my using dataset.
    Any thoughts?

    Thanks a lot.

    Elizabeth K

  • #2
    Forgot to mention that I am using Stata 14. Thanks

    Comment


    • #3
      Code:
      loc s yourvariable
              replace `s' = subinstr(`s', "Á", "a", .)
              replace `s' = subinstr(`s', "É", "e", .)
              replace `s' = subinstr(`s', "Í", "i", .)
              replace `s' = subinstr(`s', "Ó", "o", .)
              replace `s' = subinstr(`s', "Ú", "u", .)
              replace `s' = subinstr(`s', "Ñ", "n", .)
      And also add the lowercase accents áéíóúñ

      Comment


      • #4
        If you are using Stata 14, the data is in Unicode. There are functions to convert to plain ascii:

        Code:
        . dis ustrto(ustrnormalize("ÁÉÍÓÚÑáéíóúñ", "nfd"), "ascii", 2)
        AEIOUNaeioun

        Comment


        • #5
          Thanks very much Sergio and Robert. I really appreciate your help.

          Comment


          • #6
            One more question. I also have another special charachter but which is not recognized by Stata: �. So, for a word that should be "bueno", I have buen�.

            loc s city
            replace `s' = subinstr(`s', "�", "o", .)
            (0 real changes made)

            However, Stata does not make any changes after I run the command.

            Any thoughts?

            Thanks again.


            Comment


            • #7
              I believe that this special character is the Unicode replacement character. You can run the following to display it:
              Code:
              dis ustrunescape("\ufffd")
              Again, if you use the proper Unicode functions, this character will be removed.

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str50 city
              "Córdoba"
              "A Coruña"
              "buen�"
              end
              
              gen cityfix = ustrto(ustrnormalize(city, "nfd"), "ascii", 2)
              list
              and the results:
              Code:
              . list
              
                   +---------------------+
                   |     city    cityfix |
                   |---------------------|
                1. |  Córdoba    Cordoba |
                2. | A Coruña   A Coruna |
                3. |    buen�       buen |
                   +---------------------+

              Comment


              • #8
                Hi Robert,


                Thanks a lot for replying. What I am trying to do is replacing the special character by "O"

                However, this command does not make any changes:

                loc s city
                replace `s' = subinstr(`s', "�", "o", .)
                (0 real changes made)

                Comment


                • #9
                  It can also be an invalid UTF-8 character, try the following


                  Code:
                  loc s city
                  di tobytes(`s')
                  and report back the output.

                  Comment


                  • #10
                    I don't see why your code is not working. Here are two ways to convert the Unicode replacement character to the letter "o" and then convert to ascii:

                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input str50 city
                    "Córdoba"
                    "A Coruña"
                    "buen�"
                    end
                    
                    gen cityfix = subinstr(city, "�", "o", .)
                    replace cityfix = ustrto(ustrnormalize(cityfix, "nfd"), "ascii", 2)
                    list cityfix
                    
                    gen cityfix2 = subinstr(city, ustrunescape("\ufffd"), "o", .)
                    replace cityfix2 = ustrto(ustrnormalize(cityfix2, "nfd"), "ascii", 2)
                    list cityfix2

                    Comment


                    • #11
                      Hi Robert,

                      Obviously the two first commands allows me to get rid of the "�".

                      However, the two last commands do not replace "�" with "o"
                      That's really strange. Basically, "buen�" becomes "buen"
                      Thanks again

                      Comment


                      • #12
                        If the code I posted does not work, then Hua's is right, it must be due to an invalid UTF-8 character. When you ask Stata to show a string that contains such a character, it displays the Unicode replacement character because there's simply no character representation for that invalid UTF-8 character. Fortunately, the ustrfix() function can be used to fix these.

                        Code:
                        * Example generated by -dataex-. To install: ssc install dataex
                        clear
                        input str50 city
                        "Córdoba"
                        "A Coruña"
                        "buen"
                        end
                        
                        replace city = city + char(200) in 3
                        list
                        
                        gen cityfix = ustrfix(city, "o")
                        list
                        
                        replace cityfix = ustrto(ustrnormalize(cityfix, "nfd"), "ascii", 2)
                        list cityfix

                        Comment


                        • #13
                          Hello! That list commands worked. Thanks so much to all of you for your help.

                          Comment


                          • #14
                            Dear Stata List,

                            I have a similar problem. The only difference is that I have several of these "�" and I would like to transform them to different letters for example "c" or "o" or "ue" depending on the word.

                            For example:

                            "Besan�on" > "Besancon"
                            "D�sseldorf" >"Duesseldorf"
                            .....

                            Here is some of my data:

                            Code:
                            * Example generated by -dataex-. To install: ssc install dataex
                            clear
                            input str26 city
                            "Besan�on"              
                            "Bourg en Bresse"       
                            "Cambrai"               
                            "Chamb�ry"                
                            "S�lestat"              
                             "Dessau"                
                            "Dinkelsb�hl"           
                            "D�beln"                
                            "Dortmund"              
                            "Dresden"               
                            "D�ren"                 
                            "D�sseldorf"            
                            "Eichst�tt"             
                            "Glauchau"              
                            "G�rlitz"               
                            "Gotha"                 
                            "G�ttingen"             
                            end
                            How could I solve this? Any help is already appreciated!!

                            Thank you!!!

                            Comment


                            • #15
                              Posting on this topic in case someone has this problem and wants a user-written solution. Here is a command I wrote specifically for this purpose, works with both Stata 13/below (ASCII) and Stata 14/up (Unicode).

                              As to Rick Lich's question, not much you can to at this point besides manually cleaning them, since the data is already corrupted. This usually happens when moving between programs (or versions of Stata) that use different encoding. If what you have are city names, and these repeat multiple times in your code, then you can use regexm to match a portion of the corrupted name and replace it with the correct one.

                              For example:
                              Code:
                              replace city="Dusseldorf" if regexm(city,"sseldorf")
                              You'd have to do this city-by-city and make sure your regexm expressions don't create any false matches. Properly importing the original dataset is a much better solution.
                              Attached Files

                              Comment

                              Working...
                              X