removing special characters from a string var

Cansu Oymak

Join Date: Feb 2017
Posts: 135

removing special characters from a string var

26 Mar 2024, 03:43

Dear Statalis Members,

I have a variable called region (please see below). It is string with labels containing special characters.

I would like to get gird of these special characters. I have used the "
replace region= subinstr(region, "`=char(160)'", "", .) "
However, it says (0 real changes made).

Can you please help me?

Thanks a lot!!

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str14 region int var1
"cabinda"         571
"zaire"           740
"u�ge"          818
"luanda"         1255
"cuanza norte"    647
"cuanza sul"      753
"malanje"         808
"lunda norte"     792
"benguela"        870
"huambo"          951
"bi�"           834
"moxico"          559
"cuando cubango"  659
"namibe"          787
"hu�la"         931
"cunene"          858
"lunda sul"       901
"bengo"           588
end

Last edited by Cansu Oymak; 26 Mar 2024, 03:46.

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2404
#2

26 Mar 2024, 08:52

First, I'd note that you want simply - char() - rather than - "`=`char()'" -

Further: The solution here would depend on what are "special" characters in your data set, and what their numbers (ASCII values or Unicode code points) are. Let's suppose that the special characters are those with ASCII numbers < 33 or > 127, which in my implementation don't display on the screen, in which case you could do this:

Code:

foreach i of numlist 0/32, 128/255 { quiet replace region = subinstr(region, char(`i'), "", .) }

If instead you have non-ASCII characters that are legitimate Unicode characters in your dataset, you might have something like this:

Code:

// Removing the characters between 0 and 255 that on my machine don't display with Unicode uchar() foreach i of numlist 1, 9, 10, 13, 28/32, 128/160 { quiet replace region = subinstr(region, uchar(`i'), "", .) }

Given that I'm not very knowledgeable about Unicode, corrections to my suggestions on that front would be welcome.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29956

26 Mar 2024, 09:05

I guess the question is what makes you think that the offending characters are char(160)? They could be lots of things. The way to figure out what they are is to install Robert Picard's -chartab- from SSC and run -chartab region-. This will give you a list of all the characters in variable region along with their decimal and hexadecimal character codes. Then you can use -subinstr()- to remove whatever they actually are.

When I run this in your example data, I find:

Code:

. chartab region

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+---------------------------------------
        32       \u0020             |             5    SPACE
        97       \u0061       a     |            19    LATIN SMALL LETTER A
        98       \u0062       b     |             7    LATIN SMALL LETTER B
        99       \u0063       c     |             7    LATIN SMALL LETTER C
       100       \u0064       d     |             5    LATIN SMALL LETTER D
       101       \u0065       e     |            11    LATIN SMALL LETTER E
       103       \u0067       g     |             4    LATIN SMALL LETTER G
       104       \u0068       h     |             2    LATIN SMALL LETTER H
       105       \u0069       i     |             5    LATIN SMALL LETTER I
       106       \u006a       j     |             1    LATIN SMALL LETTER J
       108       \u006c       l     |             8    LATIN SMALL LETTER L
       109       \u006d       m     |             4    LATIN SMALL LETTER M
       110       \u006e       n     |            16    LATIN SMALL LETTER N
       111       \u006f       o     |             8    LATIN SMALL LETTER O
       114       \u0072       r     |             3    LATIN SMALL LETTER R
       115       \u0073       s     |             2    LATIN SMALL LETTER S
       116       \u0074       t     |             2    LATIN SMALL LETTER T
       117       \u0075       u     |            14    LATIN SMALL LETTER U
       120       \u0078       x     |             1    LATIN SMALL LETTER X
       122       \u007a       z     |             3    LATIN SMALL LETTER Z
    65,533       \ufffd       �     |             3    REPLACEMENT CHARACTER
------------------------------------+---------------------------------------

                                    freq. count   distinct
ASCII characters              =             127         20
Multibyte UTF-8 characters    =               0          0
Unicode replacement character =               3          1
Total Unicode characters      =             130         21

So the offending characters are all unicode 65533. And I can eliminate them with:

Code:

. replace region = subinstr(region, "`=uchar(65533)'", "", .)
(3 real changes made)

But you need to run -chartab- in your actual data, because it is possible that in the course of rendering your post here in the Forum, the software has changed the actual coding of the non-printable characters. So they may be something else in your real data.

Added: Crossed with #2.

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2404
#4

26 Mar 2024, 11:16

Aha, -chartab- is just the thing here -- saves trying to eliminate the problem characters by a brute force loop across a potentiall huge character space. I had forgotten about it.
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#5

26 Mar 2024, 16:04

The following will show unsupported characters using an escaped hex digit sequence (4 or 8 digits):

Code:

gen region_invalid = ustrto(region, "Windows-1252", 4) list if region != region_invalid

At some point translation to UTF-8 went wrong. The following can help in finding the problem:

Code:

help unicode_translate##analyze help unicode translate

And, it is possible to replace using:

Code:

gen region_replace = ustrregexra(region, "\P{Latin}", "")
Comment

Announcement

removing special characters from a string var

Comment

Comment

Comment

Comment