How to replace Spanish letters and accents by English letters

Elizabeth Kay

Join Date: Nov 2016

Posts: 14
#1

How to replace Spanish letters and accents by English letters

08 Nov 2016, 09:59

Hi everyone,

Hope you can help. I need to replace Spanish letters and accents by English letters in the master dataset so that my observations will match with the observationw of my using dataset.
Any thoughts?

Thanks a lot.

Elizabeth K
Tags: None
Elizabeth Kay

Join Date: Nov 2016

Posts: 14
#2

08 Nov 2016, 10:42

Forgot to mention that I am using Stata 14. Thanks
Comment

Sergio Correia

Join Date: Apr 2014
Posts: 420

08 Nov 2016, 11:06

Code:

loc s yourvariable
        replace `s' = subinstr(`s', "Á", "a", .)
        replace `s' = subinstr(`s', "É", "e", .)
        replace `s' = subinstr(`s', "Í", "i", .)
        replace `s' = subinstr(`s', "Ó", "o", .)
        replace `s' = subinstr(`s', "Ú", "u", .)
        replace `s' = subinstr(`s', "Ñ", "n", .)

And also add the lowercase accents áéíóúñ

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#4

08 Nov 2016, 13:48

If you are using Stata 14, the data is in Unicode. There are functions to convert to plain ascii:

Code:

. dis ustrto(ustrnormalize("ÁÉÍÓÚÑáéíóúñ", "nfd"), "ascii", 2) AEIOUNaeioun
2 likes
Comment
Elizabeth Kay

Join Date: Nov 2016

Posts: 14
#5

08 Nov 2016, 15:47

Thanks very much Sergio and Robert. I really appreciate your help.
Comment
Elizabeth Kay

Join Date: Nov 2016

Posts: 14
#6

08 Nov 2016, 17:05

One more question. I also have another special charachter but which is not recognized by Stata: �. So, for a word that should be "bueno", I have buen�.

loc s city
replace `s' = subinstr(`s', "�", "o", .)
(0 real changes made)

However, Stata does not make any changes after I run the command.

Any thoughts?

Thanks again.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

09 Nov 2016, 07:58

I believe that this special character is the Unicode replacement character. You can run the following to display it:

Code:

dis ustrunescape("\ufffd")

Again, if you use the proper Unicode functions, this character will be removed.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str50 city
"Córdoba"
"A Coruña"
"buen�"
end

gen cityfix = ustrto(ustrnormalize(city, "nfd"), "ascii", 2)
list

and the results:

Code:

. list

     +---------------------+
     |     city    cityfix |
     |---------------------|
  1. |  Córdoba    Cordoba |
  2. | A Coruña   A Coruna |
  3. |    buen�       buen |
     +---------------------+

Comment

Elizabeth Kay

Join Date: Nov 2016

Posts: 14
#8

09 Nov 2016, 12:30

Hi Robert,

Thanks a lot for replying. What I am trying to do is replacing the special character by "O"

However, this command does not make any changes:

loc s city
replace `s' = subinstr(`s', "�", "o", .)
(0 real changes made)
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 343
#9

09 Nov 2016, 12:57

It can also be an invalid UTF-8 character, try the following

Code:

loc s city di tobytes(`s')

and report back the output.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#10

09 Nov 2016, 13:10

I don't see why your code is not working. Here are two ways to convert the Unicode replacement character to the letter "o" and then convert to ascii:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str50 city
"Córdoba"
"A Coruña"
"buen�"
end

gen cityfix = subinstr(city, "�", "o", .)
replace cityfix = ustrto(ustrnormalize(cityfix, "nfd"), "ascii", 2)
list cityfix

gen cityfix2 = subinstr(city, ustrunescape("\ufffd"), "o", .)
replace cityfix2 = ustrto(ustrnormalize(cityfix2, "nfd"), "ascii", 2)
list cityfix2

Comment

Elizabeth Kay

Join Date: Nov 2016

Posts: 14
#11

09 Nov 2016, 14:34

Hi Robert,

Obviously the two first commands allows me to get rid of the "�".

However, the two last commands do not replace "�" with "o"
That's really strange. Basically, "buen�" becomes "buen"
Thanks again
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#12

11 Nov 2016, 06:18

If the code I posted does not work, then Hua's is right, it must be due to an invalid UTF-8 character. When you ask Stata to show a string that contains such a character, it displays the Unicode replacement character because there's simply no character representation for that invalid UTF-8 character. Fortunately, the ustrfix() function can be used to fix these.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str50 city "Córdoba" "A Coruña" "buen" end replace city = city + char(200) in 3 list gen cityfix = ustrfix(city, "o") list replace cityfix = ustrto(ustrnormalize(cityfix, "nfd"), "ascii", 2) list cityfix
Comment
Elizabeth Kay

Join Date: Nov 2016

Posts: 14
#13

17 Nov 2016, 10:00

Hello! That list commands worked. Thanks so much to all of you for your help.
Comment

Rike Lich

Join Date: Apr 2018
Posts: 16

#14

22 Aug 2018, 05:37

Dear Stata List,

I have a similar problem. The only difference is that I have several of these "�" and I would like to transform them to different letters for example "c" or "o" or "ue" depending on the word.

For example:
"Besan�on" > "Besancon"
"D�sseldorf" >"Duesseldorf"
.....

Here is some of my data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str26 city
"Besan�on"              
"Bourg en Bresse"       
"Cambrai"               
"Chamb�ry"                
"S�lestat"              
 "Dessau"                
"Dinkelsb�hl"           
"D�beln"                
"Dortmund"              
"Dresden"               
"D�ren"                 
"D�sseldorf"            
"Eichst�tt"             
"Glauchau"              
"G�rlitz"               
"Gotha"                 
"G�ttingen"             
end

How could I solve this? Any help is already appreciated!!

Thank you!!!

Comment

Daniel Jensen

Join Date: Nov 2016

Posts: 18
#15

04 Jan 2021, 18:07

Posting on this topic in case someone has this problem and wants a user-written solution. Here is a command I wrote specifically for this purpose, works with both Stata 13/below (ASCII) and Stata 14/up (Unicode).

As to Rick Lich's question, not much you can to at this point besides manually cleaning them, since the data is already corrupted. This usually happens when moving between programs (or versions of Stata) that use different encoding. If what you have are city names, and these repeat multiple times in your code, then you can use regexm to match a portion of the corrupted name and replace it with the correct one.

For example:

Code:

replace city="Dusseldorf" if regexm(city,"sseldorf")

You'd have to do this city-by-city and make sure your regexm expressions don't create any false matches. Properly importing the original dataset is a much better solution.
Attached Files

accent.ado (1.9 KB, 1 view)

accent.sthlp (1.9 KB, 1 view)
Comment

Announcement