Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • removing special characters from a string var

    Dear Statalis Members,

    I have a variable called region (please see below). It is string with labels containing special characters.

    I would like to get gird of these special characters. I have used the "
    replace region= subinstr(region, "`=char(160)'", "", .) "
    However, it says (0 real changes made).

    Can you please help me?

    Thanks a lot!!


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str14 region int var1
    "cabinda"         571
    "zaire"           740
    "u�ge"          818
    "luanda"         1255
    "cuanza norte"    647
    "cuanza sul"      753
    "malanje"         808
    "lunda norte"     792
    "benguela"        870
    "huambo"          951
    "bi�"           834
    "moxico"          559
    "cuando cubango"  659
    "namibe"          787
    "hu�la"         931
    "cunene"          858
    "lunda sul"       901
    "bengo"           588
    end
    Last edited by Cansu Oymak; 26 Mar 2024, 04:46.

  • #2
    First, I'd note that you want simply - char() - rather than - "`=`char()'" -

    Further: The solution here would depend on what are "special" characters in your data set, and what their numbers (ASCII values or Unicode code points) are. Let's suppose that the special characters are those with ASCII numbers < 33 or > 127, which in my implementation don't display on the screen, in which case you could do this:
    Code:
    foreach i of numlist 0/32, 128/255 {
      quiet replace region = subinstr(region, char(`i'), "", .)
    }
    If instead you have non-ASCII characters that are legitimate Unicode characters in your dataset, you might have something like this:
    Code:
    // Removing the characters between 0 and 255 that on my machine don't display with Unicode uchar()
    foreach i of numlist  1, 9, 10, 13, 28/32, 128/160  {
      quiet replace region = subinstr(region, uchar(`i'), "", .)
    }
    Given that I'm not very knowledgeable about Unicode, corrections to my suggestions on that front would be welcome.

    Comment


    • #3
      I guess the question is what makes you think that the offending characters are char(160)? They could be lots of things. The way to figure out what they are is to install Robert Picard's -chartab- from SSC and run -chartab region-. This will give you a list of all the characters in variable region along with their decimal and hexadecimal character codes. Then you can use -subinstr()- to remove whatever they actually are.

      When I run this in your example data, I find:
      Code:
      . chartab region
      
         decimal  hexadecimal   character |     frequency    unique name
      ------------------------------------+---------------------------------------
              32       \u0020             |             5    SPACE
              97       \u0061       a     |            19    LATIN SMALL LETTER A
              98       \u0062       b     |             7    LATIN SMALL LETTER B
              99       \u0063       c     |             7    LATIN SMALL LETTER C
             100       \u0064       d     |             5    LATIN SMALL LETTER D
             101       \u0065       e     |            11    LATIN SMALL LETTER E
             103       \u0067       g     |             4    LATIN SMALL LETTER G
             104       \u0068       h     |             2    LATIN SMALL LETTER H
             105       \u0069       i     |             5    LATIN SMALL LETTER I
             106       \u006a       j     |             1    LATIN SMALL LETTER J
             108       \u006c       l     |             8    LATIN SMALL LETTER L
             109       \u006d       m     |             4    LATIN SMALL LETTER M
             110       \u006e       n     |            16    LATIN SMALL LETTER N
             111       \u006f       o     |             8    LATIN SMALL LETTER O
             114       \u0072       r     |             3    LATIN SMALL LETTER R
             115       \u0073       s     |             2    LATIN SMALL LETTER S
             116       \u0074       t     |             2    LATIN SMALL LETTER T
             117       \u0075       u     |            14    LATIN SMALL LETTER U
             120       \u0078       x     |             1    LATIN SMALL LETTER X
             122       \u007a       z     |             3    LATIN SMALL LETTER Z
          65,533       \ufffd       �     |             3    REPLACEMENT CHARACTER
      ------------------------------------+---------------------------------------
      
                                          freq. count   distinct
      ASCII characters              =             127         20
      Multibyte UTF-8 characters    =               0          0
      Unicode replacement character =               3          1
      Total Unicode characters      =             130         21
      So the offending characters are all unicode 65533. And I can eliminate them with:
      Code:
      . replace region = subinstr(region, "`=uchar(65533)'", "", .)
      (3 real changes made)
      But you need to run -chartab- in your actual data, because it is possible that in the course of rendering your post here in the Forum, the software has changed the actual coding of the non-printable characters. So they may be something else in your real data.

      Added: Crossed with #2.

      Comment


      • #4
        Aha, -chartab- is just the thing here -- saves trying to eliminate the problem characters by a brute force loop across a potentiall huge character space. I had forgotten about it.

        Comment


        • #5
          The following will show unsupported characters using an escaped hex digit sequence (4 or 8 digits):
          Code:
          gen region_invalid = ustrto(region, "Windows-1252", 4)
          list if region != region_invalid
          At some point translation to UTF-8 went wrong. The following can help in finding the problem:
          Code:
          help unicode_translate##analyze
          help unicode translate
          And, it is possible to replace using:
          Code:
          gen region_replace =  ustrregexra(region, "\P{Latin}", "")

          Comment

          Working...
          X