Remove accents on a string variable in stata

Alexis Rodas

Join Date: Mar 2018

Posts: 42
#1

Remove accents on a string variable in stata

09 Mar 2019, 00:24

Dear everyone,

I would like to know if someone knows a STATA code that I can use to remove accents on a string variable in STATA.

My string data is the following:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str26 name_ocu "Ánibal López" "Juána del Arcoíris" "Filómena Agustína" "Anastació Doncristoldo" "Federíco Rigobertino" "Ana Cletá" "Pacnhí Junací" "Asgurtímo Galdó" "Juán Filoméno" "Ánibal López Tercerapío" end

I want to obtain in each row of variable "name_ocu" results in lowercase and letters without accents.

For instance: Ánibal López --> anibal lopez

Thanks a lot for your help

Alexis Rodas
Tags: None

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

09 Mar 2019, 02:29

Code:

gen name = ustrlower( ustrregexra( ustrnormalize( name_ocu, "nfd" ) , "\p{Mark}", "" )  )

Code:

. list

     +---------------------------------------------------+
     |                name_ocu                      name |
     |---------------------------------------------------|
  1. |            Ánibal López              anibal lopez |
  2. |      Juána del Arcoíris        juana del arcoiris |
  3. |       Filómena Agustína         filomena agustina |
  4. |  Anastació Doncristoldo    anastacio doncristoldo |
  5. |    Federíco Rigobertino      federico rigobertino |
     |---------------------------------------------------|
  6. |               Ana Cletá                 ana cleta |
  7. |           Pacnhí Junací             pacnhi junaci |
  8. |         Asgurtímo Galdó           asgurtimo galdo |
  9. |           Juán Filoméno             juan filomeno |
 10. | Ánibal López Tercerapío   anibal lopez tercerapio |
     +---------------------------------------------------+

Comment

Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#3

09 Mar 2019, 14:24

I add some information related to the suggested solution in #2:

Code:

gen name = ustrlower( ustrregexra( ustrnormalize( name_ocu, "nfd" ) , "\p{Mark}", "" ) )

To strip off combining marks (e.g. accents, umlauts, etc.) using ustrregexra( varname , "\p{Mark}", "" ) normalizing to NFD using ustrnormalize() is necessary.

A Unicode "character"/grapheme with an accent can be represented either as a single code point, or combining a code point representing the plain letter followed by a combining accent mark. Normalization will choose one of these forms, and NFD has the characters fully expanded (decomposed form) using a code-point sequence representing the plain letter followed by a combining accent mark.

From help ustrnormalize():

According to the Unicode standard, they [multiple code-point characters] should be treated as the same single character in Unicode string operations, such as in display, comparison, and selection. However, Stata does not support multiple code-point characters; each code point is considered a separate Unicode character.

The following illustrate some relations and functions:

Code:

* Unicode Character 'LATIN SMALL LETTER O WITH ACUTE' (U+00F3) assert uchar(243) == "ó" // LATIN SMALL LETTER O WITH ACUTE assert ustrlen(uchar(243)) == 1 assert ustrregexm(uchar(243), "\p{L}" ) // is a Letter assert ustrlen(ustrnormalize(uchar(243), "nfd")) == 2 assert ustrregexm(ustrnormalize(uchar(243), "nfd"), "\p{L}\p{M}") assert ustrnormalize(uchar(243), "nfd") == ustrunescape("\u006f") + ustrunescape("\u0301") * To reverse going from NFD to NFC: * Unicode Character 'LATIN SMALL LETTER O' (U+006F) * Unicode Character 'COMBINING ACUTE ACCENT' (U+0301) assert uchar(243) == ustrnormalize(ustrunescape("\u006f\u0301"), "nfc" )

References:
https://stackoverflow.com/questions/...tf-8-all-about
https://www.regular-expressions.info/unicode.html
2 likes
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

09 Mar 2019, 16:17

You can use chartab (from SSC) to tabulate all Unicode characters in a string variable (and from other sources as well).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str26 name_ocu
"Ánibal López"            
"Juána del Arcoíris"      
"Filómena Agustína"       
"Anastació Doncristoldo"   
"Federíco Rigobertino"     
"Ana Cletá"                
"Pacnhí Junací"           
"Asgurtímo Galdó"         
"Juán Filoméno"           
"Ánibal López Tercerapío"
end

chartab name_ocu, noascii

and the result:

Code:

. chartab name_ocu, noascii

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+---------------------------------------------------
       193       \u00c1       Á     |             2    LATIN CAPITAL LETTER A WITH ACUTE
       225       \u00e1       á     |             3    LATIN SMALL LETTER A WITH ACUTE
       233       \u00e9       é     |             1    LATIN SMALL LETTER E WITH ACUTE
       237       \u00ed       í     |             7    LATIN SMALL LETTER I WITH ACUTE
       243       \u00f3       ó     |             5    LATIN SMALL LETTER O WITH ACUTE
------------------------------------+---------------------------------------------------

                                    freq. count   distinct
ASCII characters              =               0          0
Multibyte UTF-8 characters    =              18          5
Unicode replacement character =               0          0
Total Unicode characters      =              18          5

Given the above, it is sufficient to simply convert the variable to ASCII after separating the accents from the letters:

Code:

. gen name = lower(ustrto(ustrnormalize(name_ocu, "nfd"), "ascii", 2))

. list

     +---------------------------------------------------+
     |                name_ocu                      name |
     |---------------------------------------------------|
  1. |            Ánibal López              anibal lopez |
  2. |      Juána del Arcoíris        juana del arcoiris |
  3. |       Filómena Agustína         filomena agustina |
  4. |  Anastació Doncristoldo    anastacio doncristoldo |
  5. |    Federíco Rigobertino      federico rigobertino |
     |---------------------------------------------------|
  6. |               Ana Cletá                 ana cleta |
  7. |           Pacnhí Junací             pacnhi junaci |
  8. |         Asgurtímo Galdó           asgurtimo galdo |
  9. |           Juán Filoméno             juan filomeno |
 10. | Ánibal López Tercerapío   anibal lopez tercerapio |
     +---------------------------------------------------+

.

Announcement

Remove accents on a string variable in stata

Comment

Comment

Comment