Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • In my humble opinion unexpected behavior of trim()

    Dear Stata list,

    I thought over this a lot of times as I did not expect to find anything unexpected with such a basic string function as trim(), but hear me out on this one.

    While not necessary to understand what I find to be unexpected behavior, here is some background to what I am working on. I am working on a command that is used to test ODK based questionnaire forms. Read more about it here as well if you want. Command is still in development so not all documentation is not yet at the level it will be at the time we publish this, but I always want to give some context to when I am asking a question.

    Since the questionnaire are written in Excel files I use import excel to import several columns from an excel sheet, many of them consist of string values. Some of the test my command run are sensitive to leading and trailing spaces so I must remove spaces so that "ABC " becomes "ABC". I use the function trim() for that. However, in the data set that can be accessed here I have one string variable with 10 observation for which 3 has values for which I do not think trim() works as I expected it to do.

    When I used
    Code:
    replace name = trim(name)
    I get that no real changed were made but when browsing or tabbing the variable it is obvious that three values have multiple trailing spaces. I used Nick Cox's command charlist to investigate and saw that some of the spaces were regular spaces with char code 32, but some of the spaces hare char code 160. I have found on the internet that those are so called non-breaking spaces. See for example here. Since UTF-8 characters have been introduced to Stata in later versions I would have expected that trim() would work on those alternative spaces too.

    If you load the data set linked to above and run this code you can see what I see:

    Code:
    *Open data and show string values
    use trim_example.dta
    tab
    
    *Replacing leading and trailing spaces
    replace name = trim(name)
    tab //check that spaces where not removed
    
    *Keep only one of these values for less cluttered charlist result
    keep in 7
    
    *Install charlist and return all char codes used in the string value.
    ssc install charlist
    charlist name
    return list //See that regular space 32 and non-breaking space 160 are both used.
    I am using Stata 13.1 but my colleague who first reported this bug during testing was using Stata 15, and I am pretty sure it is the same bug that I reproduced.

  • #2
    Yes. Actually, if you run these commands with the browser open, you will see that the ASCII 160 characters show up in the browser as a little box, and they persist after the application of -trim()-. To be honest, this doesn't surprise me as -trim()- was definitely designed with the 0-128 ASCII character set in mind, and it is described in the -help- file as specifically removing only char(32).

    What does surprise me is that the newer function ustrtrim() also leaves char(160) unaltered. ustrtrim() is described as having an expanded scope and removes char(10) through char(13), all of which are forms of whitespace. But it does not touch char(160). I really don't know why it was designed that way. Perhaps in some Unicode encodings, char(160) is something other than a "non-breaking space" and is worthy of preserving. I don't really deal with Unicode much and can't say.

    All of that said, it is a recurring theme here in the Forum that people get tripped up by the presence of non-printing characters in string variables. And given the frequency with which Stata data sets arise by importing data from sources that include non-printing characters (spreadsheets, word processing documents, etc.) it surprises me that nobody has added to the Wish List thread a request for a function that purges strings of all of them. Maybe I'll go and do that myself.

    Comment


    • #3
      EDIT: This crossed with Clyde's post.

      So I tried downloading the trim_example.dta file (which most people won't do, BTW, far better to paste data here using Stata's dataex command)., but got the following error:

      Code:
      . use "C:\Downloads\Statalist\trim_example.dta"
      file C:\Downloads\Statalist\trim_example.dta not Stata format
      r(610);
      A couple of things though:
      1) If any of the words in name had interior spaces in them, you would still have char(32) in the text after running the trim function (since trim() only removes leading & trailing spaces. In fact, you would still have them after running itrim() as well, since that only removes *extra* interior spaces, but will leave once space between words. I also note this for those coming afterwards that trim() and itrim() have been replaced by strtrim(s) and stritrim(s)

      2) Under help strtrim, Stata states,
      Description: s without leading and trailing blanks (ASCII space character char(32));
      3) Under help ustrtrim, Stata states:
      ustrtrim(s)
      Description: removes leading and trailing Unicode whitespace characters and blanks from the Unicode string s

      Note that, in addition to char(32), ASCII characters char(9), char(10), char(11), char(12), and char(13) are
      considered whitespace characters in the Unicode standard.
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte line str33 orig_text
      1 " Mary had   a little lamb  "      
      2 "Fleece as white as snow  "        
      3 "  Everywhere   that  Mary  went "
      4 "          The lamb was sure to go"
      end
      
      gen long_text2 = char(160) + orig_text  // adding char(160) to the beginning of each string
      . charlist long_text
       EFMTabcdeghilmnorstuvwy�
      
      . di r(ascii)
      32 69 70 77 84 97 98 99 100 101 103 104 105 108 109 110 111 114 115 116 117 118 119 121 160 
      
      dataex line orig_text long_text2
      input byte line str33 orig_text str34 long_text2
      1 " Mary had   a little lamb  "       "� Mary had   a little lamb  "      
      2 "Fleece as white as snow  "         "�Fleece as white as snow  "        
      3 "  Everywhere   that  Mary  went "  "�  Everywhere   that  Mary  went "
      4 "          The lamb was sure to go" "�          The lamb was sure to go"
      end
      
      replace long_text2 = trim( long_text2)
      dataex line orig_text long_text2
      _________________________________________________________________
      1 " Mary had   a little lamb  "       "� Mary had   a little lamb"        
      2 "Fleece as white as snow  "         "�Fleece as white as snow"          
      3 "  Everywhere   that  Mary  went "  "�  Everywhere   that  Mary  went"  
      4 "          The lamb was sure to go" "�          The lamb was sure to go"
      4) In the code window above, since trim() considers char(160) just another character, trim(long_text2) will remove leading spaces in front of char(160) , but doesn't remove the 6 spaces between char(160) and "The lamb was sure to go" in line 4.

      5) I often run the following cleaning command for text variables in Stata
      Code:
      *** Looping over ALL string variables
      ds, has(type string)
      foreach var of varlist `r(varlist)' {
           replace `var' = ustrtrim(`var')
           replace `var' =  itrim(`var')
      }
      
      
      *** Looping over specified variables (& characters)
      foreach var of varlist notes exit_notes industry_comments company_desc {
          foreach  i in 9 10 13 146 147 148 150 153 160 161 174 226 {
                replace `var' = subinstr(`var', char(`i'), " ",.)
           }
      }
      Last edited by David Benson; 04 Jan 2019, 17:32.

      Comment


      • #4
        3 other quick things:
        1) Clyde Schechter please do add something to the Stata wishlist . For example, Excel has a CLEAN() function that removes the first 32 nonprinting characters in the 7-bit ASCII code (values 0 through 31). It doesn't remove some of the higher ones (like 160 or 161), but it would be nice if Stata had the same. Although char(160) might need to be omitted. Note that the characters "†" and "‡" often used to indicate statistical significance are char 160, 161, 226. (It looks like "‡" is char 160 & 161 combined). See here.

        2) William Lisowski has a clever regex method for cleaning strings. His method uses a regular expression that contains a list of acceptable characters and deletes all others. But it requires Stata 14 or later. See here and here.
        Code:
        * NOTE: This one removes ALL spaces
        generate str clean =  ustrregexra(dirty,"[^a-zA-Z0-9]","")
        
        * Someone asked him if it could be modified to retain spaces, hypens, and ampersands:
        * Note that he puts the hyphen at the front of the list, so that it doesn't look like part of a character range like a-z
        generate str clean =  ustrregexra(dirty,"[^-& a-zA-Z0-9]","")

        3) If you'd rather do the cleaning in Excel, I've attached some VBA code I had written to do that. In Excel, hit ALT + F11 and paste the below into a module. You can also attach a keyboard shortcut to it. For example, my shortcut for this is CTRL + SHIFT + T.
        ​​​​​​​
        Code:
        '*** NOTE: This is Excel VBA, *not* Stata code
        Sub Clean_text()
        ' Trim is just like above (removes leading & ending spaces, also removes extra interior spaces)
        ' Clean removes ASCII characters 1-32 (so ALT+ENTER, Line Break, etc)
        
            Application.ScreenUpdating = False
            Dim cell As Range
            On Error Resume Next
            
            For Each cell In Selection.Cells
                If cell.HasFormula = False Then
                        cell.Value = WorksheetFunction.Clean(Trim(cell.Value))  ' Trims AND cleans!
                        cell.Value = Replace(cell.Value, Chr(160), "")  
                End If
            Next cell
            On Error GoTo 0
            
            Application.ScreenUpdating = True
        End Sub​​​​​​​
        *****************************************************

        Comment


        • #5
          I do not think this is a bug. The function ustrtrim() will strip off NO-BREAK SPACE.

          And, I don't think one can blame the ustrtrim() function here, because this file (may be called corrupt because it) has elements which is not a code point in UTF-8, the UTF-8 code point decimal 160 do not exist. If the file was correct translated to UTF-8 NO-BREAK SPACE would be translated to the two byte UTF-8 U+00A0 decimal 194 160 see https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec

          Well, at some point some things where mixed up (this happens). Below follows some code to illustrate:
          Code:
          gen name2 = ustrtrim(name) /* dont understand codepoint "160" because it dont exist ! */
          
          * stripp spaces and controll characters:
          gen name3 = ustrregexra(name,"[\p{Separator} \p{Cc} ]","")
          
          * NB. THIS WILL DESTROY UTF-8 2-4 byte charachters where 160 is used:
           
          gen name5 = subinstr(name, char(160) , char(194)+char(160), . )
          
          * stripp spaces and controll characters:
          gen name6 = ustrregexra(name5,"[\p{Separator} \p{Cc}]","")
          
          * SAFER to define valid chars:
          
          gen name7 = ustrtrim( ustrregexra(name,"[^\p{Letter}\p{Mark}\p{Nd}_]","") )


          https://www.regular-expressions.info/unicode.html

          Code:
          list name            
          
               +-------------------+
               |              name |
               |-------------------|
            1. |         module_k2 |
            2. |    module_kitchen |
            3. |    module_kitchen |
            4. | module_laborbasic |
            5. | module_laborbasic |
               |-------------------|
            6. |         module_m2 |
            7. |   n1_1��������� � |
            8. |   n1_2��������� � |
            9. |   n1_3��������� � |
           10. |              n_hh |
               +-------------------+
          
          . list name2                            
          
               +-------------------+
               |             name2 |
               |-------------------|
            1. |         module_k2 |
            2. |    module_kitchen |
            3. |    module_kitchen |
            4. | module_laborbasic |
            5. | module_laborbasic |
               |-------------------|
            6. |         module_m2 |
            7. |   n1_1��������� � |
            8. |   n1_2��������� � |
            9. |   n1_3��������� � |
           10. |              n_hh |
               +-------------------+
          
          . list name3                            
          
               +-------------------+
               |             name3 |
               |-------------------|
            1. |         module_k2 |
            2. |    module_kitchen |
            3. |    module_kitchen |
            4. | module_laborbasic |
            5. | module_laborbasic |
               |-------------------|
            6. |         module_m2 |
            7. |    n1_1���������� |
            8. |    n1_2���������� |
            9. |    n1_3���������� |
           10. |              n_hh |
               +-------------------+
          
          . list name5                            
          
               +-------------------+
               |             name5 |
               |-------------------|
            1. |         module_k2 |
            2. |    module_kitchen |
            3. |    module_kitchen |
            4. | module_laborbasic |
            5. | module_laborbasic |
               |-------------------|
            6. |         module_m2 |
            7. |   n1_1            |
            8. |   n1_2            |
            9. |   n1_3            |
           10. |              n_hh |
               +-------------------+
          
          . list name6                            
          
               +-------------------+
               |             name6 |
               |-------------------|
            1. |         module_k2 |
            2. |    module_kitchen |
            3. |    module_kitchen |
            4. | module_laborbasic |
            5. | module_laborbasic |
               |-------------------|
            6. |         module_m2 |
            7. |              n1_1 |
            8. |              n1_2 |
            9. |              n1_3 |
           10. |              n_hh |
               +-------------------+
          
          . list name7                            
          
               +-------------------+
               |             name7 |
               |-------------------|
            1. |         module_k2 |
            2. |    module_kitchen |
            3. |    module_kitchen |
            4. | module_laborbasic |
            5. | module_laborbasic |
               |-------------------|
            6. |         module_m2 |
            7. |              n1_1 |
            8. |              n1_2 |
            9. |              n1_3 |
           10. |              n_hh |
               +-------------------+
          Last edited by Bjarte Aagnes; 05 Jan 2019, 14:18.

          Comment


          • #6
            First, above I think I did not use the term "code point" correctly. In character encoding terminology, a code point or code position is any of the numerical values that make up the code space. So, I think the following should be correct use of terms: UTF-8 code point 0xA0 (160) holds the byte sequences C2 A0 (194 160).

            If -import excel- was used, a fix for the problem could be:
            Code:
            version 15.1    
            
            cd "`c(tempdir)'"
            
            local dta "trim_example.dta"
            
            copy https://github.com/kbjarkefur/trim_example_data/raw/master/`dta' .
            
            ********************************************************************************
            * Lets asume this was the result after import from excel:
            
            use "`dta'", clear
            datasignature
            
            gen n_invalid_UTF8 = ustrinvalidcnt(name)
            
            gen invalid_UTF8 = ustrfrom(name, "utf-8", 4)
            
            gen name_valid = ustrtrim(subinstr(invalid_UTF8,"%XA0","",.))
            
            list name_valid n_invalid_UTF8 invalid_UTF8 name , abbrev(15)
            
            ********************************************************************************
            Code:
                +--------------------------------------------------------------------------------------------------------+
                 |        name_valid   n_invalid_UTF8                                    invalid_UTF8                name |
                 |--------------------------------------------------------------------------------------------------------|
              1. |         module_k2                0                                       module_k2           module_k2 |
              2. |    module_kitchen                0                                  module_kitchen      module_kitchen |
              3. |    module_kitchen                0                                  module_kitchen      module_kitchen |
              4. | module_laborbasic                0                               module_laborbasic   module_laborbasic |
              5. | module_laborbasic                0                               module_laborbasic   module_laborbasic |
                 |--------------------------------------------------------------------------------------------------------|
              6. |         module_m2                0                                       module_m2           module_m2 |
              7. |              n1_1               10   n1_1%XA0%XA0%XA0%XA0%XA0%XA0%XA0%XA0%XA0 %XA0     n1_1��������� � |
              8. |              n1_2               10   n1_2%XA0%XA0%XA0%XA0%XA0%XA0%XA0%XA0%XA0 %XA0     n1_2��������� � |
              9. |              n1_3               10   n1_3%XA0%XA0%XA0%XA0%XA0%XA0%XA0%XA0%XA0 %XA0     n1_3��������� � |
             10. |              n_hh                0                                            n_hh                n_hh |
                 +--------------------------------------------------------------------------------------------------------+
            Alternatively, if a dta or UTF-8 text file exists, use -unicode analyze- and -unicode translate- to fix the illegal byte sequences:
            Code:
            clear
            capture unicode restore "`dta'" , replace
            
            unicode analyze "`dta'"
            unicode encoding set "Windows-1252"
            unicode retranslate `dta', invalid(ignore) replace
            
            use "`dta'" , clear
            gen name_valid = ustrtrim(name)
            list
            Code:
                 +---------------------------------------+
                 |              name          name_valid |
                 |---------------------------------------|
              1. |         module_k2           module_k2 |
              2. |    module_kitchen      module_kitchen |
              3. |    module_kitchen      module_kitchen |
              4. | module_laborbasic   module_laborbasic |
              5. | module_laborbasic   module_laborbasic |
                 |---------------------------------------|
              6. |         module_m2           module_m2 |
              7. |   n1_1                           n1_1 |
              8. |   n1_2                           n1_2 |
              9. |   n1_3                           n1_3 |
             10. |              n_hh                n_hh |
                 +---------------------------------------+
            Although "No-Break Space" is not usually typed in by users, it is often copy/pasted into fields. To avoid importing data not following the UTF-8, you may validate the data early in the process.

            1) Validate input to ODK: You might want to set constraints on the length, character set allowed
            and various other fields in the data input by the user
            .

            2) Validate export from ODK to Excel or possibly skip Excel exporting to UTF-8 text files? or reading from the ODK database.

            3) Confirm that string data are valid; either one of the "ANSI" (ISO/IEC 8859) character sets, or UTF-8, before import to Stata.


            Refs:

            Table 3-7. Well-Formed UTF-8 Byte Sequences

            https://www.stata.com/manuals/dunicode.pdf

            Last edited by Bjarte Aagnes; 07 Jan 2019, 06:42.

            Comment


            • #7
              Thanks Clyde for suggesting that to the wish list! I saw your post in the other thread!

              I was not suggesting that it necessarily was a bug. I just expected there to be an all purpose utility function for this developed by Stata. I understand now that trim() was developed with an older context in mind and has not been updated since then. A related question to that, what is your best recommendation to learn more about such a thing? In the documentation it says that trim() returns s without leading and trailing blanks. The word blanks is ambiguous to me.

              EDIT: I just found the answer to my own question above, It is more exactly documented in newer documentation. I work in Stata 13 as that is the version we target, but I have access to the latest version of Stata and found that when reading the documentation for for trim() it clearly says that it is only char 32 that will be considered a space.


              Anyways, thanks for all your suggestions! This gives a lot of inspiration to work with. We are targeting Stata 13 as the organizations we support and work with cannot afford to update Stata to the newest version all the time and the ustr functions that you do mention seem to be newer than that. And since we work in many different language contexts we will have many more special cases than just ASCII 160, at the same time we must allow more letters and symbols than only English alphanumeric characters.

              I should have used dataex in the first place, but in case someone would still want that here is the code from dataex to generate the data set:

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str21 name
              "module_k2"        
              "module_kitchen"  
              "module_kitchen"  
              "module_laborbasic"
              "module_laborbasic"
              "module_m2"        
              "n1_1           "  
              "n1_2           "  
              "n1_3           "  
              "n_hh"            
              end
              Thanks again!
              Last edited by Kristoffer Bjarkefur; 07 Jan 2019, 08:54.

              Comment


              • #8
                Seems like downloading the dta is neccessary to preserve the problem on Stata15.1 (datasignatures differ).

                Comment


                • #9
                  Interesting, seems like this forum has an input handler that replaces ASCII code 160 and similar whites paces into regular ASCII 32 spaces. If exactly that function was available in Stata an not only on Statalist my problem would be solved!

                  I know I can replace all ASCII 160 with ASCII 32 myself, but next time it will be another type of white space that causes error... In other languages there are such utility functions. It is probably what the web developer who implemented this forum is using.

                  EDIT: When I copy and pasted the dataex output from my Stata result window and run it in the do-file editor the ASCII 160 space was not replaced. I tested that before posting, so it must be the javascript or the backend of the forum that replaces them.
                  Last edited by Kristoffer Bjarkefur; 07 Jan 2019, 13:15.

                  Comment


                  • #10
                    Kristoffer Bjarkefur , you can utilize Java in a modern Stata, which would open up a whitespace-probing function for you. See this thread in a different forum. Note, however, that as per documentation 160(dec) = 0xA0(hex) is not present in the list of whitespace chars.

                    PS: ASCII covers only 128 characters, but additional code pages exist, though the meaning of character with code 160 would vary. For example in Cyrillic it is a "Cyrillic lower case a", not a whitespace.

                    Comment

                    Working...
                    X