Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Remove accents on a string variable in stata

    Dear everyone,

    I would like to know if someone knows a STATA code that I can use to remove accents on a string variable in STATA.

    My string data is the following:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str26 name_ocu
    "Ánibal López"            
    "Juána del Arcoíris"      
    "Filómena Agustína"       
    "Anastació Doncristoldo"   
    "Federíco Rigobertino"     
    "Ana Cletá"                
    "Pacnhí Junací"           
    "Asgurtímo Galdó"         
    "Juán Filoméno"           
    "Ánibal López Tercerapío"
    end
    I want to obtain in each row of variable "name_ocu" results in lowercase and letters without accents.

    For instance: Ánibal López --> anibal lopez

    Thanks a lot for your help

    Alexis Rodas

  • #2
    Code:
    gen name = ustrlower( ustrregexra( ustrnormalize( name_ocu, "nfd" ) , "\p{Mark}", "" )  )
    Code:
    . list
    
         +---------------------------------------------------+
         |                name_ocu                      name |
         |---------------------------------------------------|
      1. |            Ánibal López              anibal lopez |
      2. |      Juána del Arcoíris        juana del arcoiris |
      3. |       Filómena Agustína         filomena agustina |
      4. |  Anastació Doncristoldo    anastacio doncristoldo |
      5. |    Federíco Rigobertino      federico rigobertino |
         |---------------------------------------------------|
      6. |               Ana Cletá                 ana cleta |
      7. |           Pacnhí Junací             pacnhi junaci |
      8. |         Asgurtímo Galdó           asgurtimo galdo |
      9. |           Juán Filoméno             juan filomeno |
     10. | Ánibal López Tercerapío   anibal lopez tercerapio |
         +---------------------------------------------------+

    Comment


    • #3

      I add some information related to the suggested solution in #2:
      Code:
      gen name = ustrlower( ustrregexra( ustrnormalize( name_ocu, "nfd" ) , "\p{Mark}", "" ) )
      To strip off combining marks (e.g. accents, umlauts, etc.) using ustrregexra( varname , "\p{Mark}", "" ) normalizing to NFD using ustrnormalize() is necessary.

      A Unicode "character"/grapheme with an accent can be represented either as a single code point, or combining a code point representing the plain letter followed by a combining accent mark. Normalization will choose one of these forms, and NFD has the characters fully expanded (decomposed form) using a code-point sequence representing the plain letter followed by a combining accent mark.

      From help ustrnormalize():
      According to the Unicode standard, they [multiple code-point characters] should be treated as the same single character in Unicode string operations, such as in display, comparison, and selection. However, Stata does not support multiple code-point characters; each code point is considered a separate Unicode character.
      The following illustrate some relations and functions:
      Code:
      * Unicode Character 'LATIN SMALL LETTER O WITH ACUTE' (U+00F3)
      
      assert uchar(243) == "ó" // LATIN SMALL LETTER O WITH ACUTE
      assert ustrlen(uchar(243)) == 1
      assert ustrregexm(uchar(243), "\p{L}" ) // is a Letter
      
      assert ustrlen(ustrnormalize(uchar(243), "nfd")) == 2
      assert ustrregexm(ustrnormalize(uchar(243), "nfd"), "\p{L}\p{M}")
      assert ustrnormalize(uchar(243), "nfd") == ustrunescape("\u006f") + ustrunescape("\u0301") 
      
      * To reverse going from NFD to NFC:
      
      * Unicode Character 'LATIN SMALL LETTER O' (U+006F)
      * Unicode Character 'COMBINING ACUTE ACCENT' (U+0301)
      
      assert uchar(243) == ustrnormalize(ustrunescape("\u006f\u0301"), "nfc" )
      References:
      https://stackoverflow.com/questions/...tf-8-all-about
      https://www.regular-expressions.info/unicode.html

      Comment


      • #4
        You can use chartab (from SSC) to tabulate all Unicode characters in a string variable (and from other sources as well).
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str26 name_ocu
        "Ánibal López"            
        "Juána del Arcoíris"      
        "Filómena Agustína"       
        "Anastació Doncristoldo"   
        "Federíco Rigobertino"     
        "Ana Cletá"                
        "Pacnhí Junací"           
        "Asgurtímo Galdó"         
        "Juán Filoméno"           
        "Ánibal López Tercerapío"
        end
        
        chartab name_ocu, noascii
        and the result:
        Code:
        . chartab name_ocu, noascii
        
           decimal  hexadecimal   character |     frequency    unique name
        ------------------------------------+---------------------------------------------------
               193       \u00c1       Á     |             2    LATIN CAPITAL LETTER A WITH ACUTE
               225       \u00e1       á     |             3    LATIN SMALL LETTER A WITH ACUTE
               233       \u00e9       é     |             1    LATIN SMALL LETTER E WITH ACUTE
               237       \u00ed       í     |             7    LATIN SMALL LETTER I WITH ACUTE
               243       \u00f3       ó     |             5    LATIN SMALL LETTER O WITH ACUTE
        ------------------------------------+---------------------------------------------------
        
                                            freq. count   distinct
        ASCII characters              =               0          0
        Multibyte UTF-8 characters    =              18          5
        Unicode replacement character =               0          0
        Total Unicode characters      =              18          5
        Given the above, it is sufficient to simply convert the variable to ASCII after separating the accents from the letters:
        Code:
        . gen name = lower(ustrto(ustrnormalize(name_ocu, "nfd"), "ascii", 2))
        
        . list
        
             +---------------------------------------------------+
             |                name_ocu                      name |
             |---------------------------------------------------|
          1. |            Ánibal López              anibal lopez |
          2. |      Juána del Arcoíris        juana del arcoiris |
          3. |       Filómena Agustína         filomena agustina |
          4. |  Anastació Doncristoldo    anastacio doncristoldo |
          5. |    Federíco Rigobertino      federico rigobertino |
             |---------------------------------------------------|
          6. |               Ana Cletá                 ana cleta |
          7. |           Pacnhí Junací             pacnhi junaci |
          8. |         Asgurtímo Galdó           asgurtimo galdo |
          9. |           Juán Filoméno             juan filomeno |
         10. | Ánibal López Tercerapío   anibal lopez tercerapio |
             +---------------------------------------------------+
        
        .

        Comment

        Working...
        X