How to delete special characters in string variables?

Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#1

How to delete special characters in string variables?

19 Dec 2023, 07:50

Hi everyone,

I would like to know how to delete special characters in string variables.

I used chartab from SSC, but some of special characters remained there. Here is my code:

Code:

chartab description, noascii chartab model, noascii

And here is the ex-post result:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str23 description str25 model "FERROSITE" "C-ELYSEE ECO GL" "FERROSITE" "C-ELYSEE ECO-GLV" "FERROSITE" "CITR™EN C-ELYSE ECO" "FERROSITE" "CITR™EN C-ELYSE ECO" "FERROSITE" "CITR™EN C-ELYSE ECO-G" end

I should have a "O" instead of a TM sign for some description "value" labels. Everything should be kept in capital letters, please.
I should have a "E" instead of "".

Thank you in advance for your help.
Best,

Michael

Last edited by Michael Duarte Goncalves; 19 Dec 2023, 07:53.
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10195

19 Dec 2023, 09:19

chartab just shows you what those non-ascii characters are, it does not replace anything. Here, we have:

Code:

. chartab model, noascii

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+---------------------------------------
       144       \u0090            |             3    DEVICE CONTROL STRING
     8,482       \u2122       ™     |             3    TRADE MARK SIGN
------------------------------------+---------------------------------------

So you replace these as:

Code:

replace model= subinstr(model, "`=ustrunescape("\u2122")'", "O", .)
replace model= subinstr(model, "`=ustrunescape("\u0090")'", "E", .)

Res.:

Code:

. l

     +------------------------------------+
     | descrip~n                    model |
     |------------------------------------|
  1. | FERROSITE          C-ELYSEE ECO GL |
  2. | FERROSITE         C-ELYSEE ECO-GLV |
  3. | FERROSITE     CITROEN C-ELYSEE ECO |
  4. | FERROSITE     CITROEN C-ELYSEE ECO |
  5. | FERROSITE   CITROEN C-ELYSEE ECO-G |
     +------------------------------------+

But all this might have been unnecessary if you chose the correct encoding when importing these data. See

Code:

help import delimited##encoding

Comment

Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#3

19 Dec 2023, 09:25

Hi Andrew Musau:

Thank you very much for your suggestion!
I didn't know about encoding! Thank you so much.

Unfortunately, these data come from a source in which we cannot recover a CSV file.

Lovely day.
Michael

Last edited by Michael Duarte Goncalves; 19 Dec 2023, 09:29.
Comment

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 500

19 Dec 2023, 09:29

I have just a question:

Is another way to do that? Because my "true" sample is really big. Here is the output of chartab model:

Code:

. chartab model, noascii

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+-------------------------------------------------------
       144       \u0090            |            59    DEVICE CONTROL STRING
       161       \u00a1       ¡     |             6    INVERTED EXCLAMATION MARK
       166       \u00a6       ¦     |            12    BROKEN BAR
       179       \u00b3       ³     |             2    SUPERSCRIPT THREE
       192       \u00c0       À     |             4    LATIN CAPITAL LETTER A WITH GRAVE
       193       \u00c1       Á     |            16    LATIN CAPITAL LETTER A WITH ACUTE
       195       \u00c3       Ã     |             7    LATIN CAPITAL LETTER A WITH TILDE
       199       \u00c7       Ç     |             3    LATIN CAPITAL LETTER C WITH CEDILLA
       200       \u00c8       È     |             2    LATIN CAPITAL LETTER E WITH GRAVE
       201       \u00c9       É     |         1,896    LATIN CAPITAL LETTER E WITH ACUTE
       203       \u00cb       Ë     |             8    LATIN CAPITAL LETTER E WITH DIAERESIS
       204       \u00cc       Ì     |             2    LATIN CAPITAL LETTER I WITH GRAVE
       205       \u00cd       Í     |            64    LATIN CAPITAL LETTER I WITH ACUTE
       209       \u00d1       Ñ     |            49    LATIN CAPITAL LETTER N WITH TILDE
       210       \u00d2       Ò     |            47    LATIN CAPITAL LETTER O WITH GRAVE
       211       \u00d3       Ó     |            50    LATIN CAPITAL LETTER O WITH ACUTE
       214       \u00d6       Ö     |             6    LATIN CAPITAL LETTER O WITH DIAERESIS
       218       \u00da       Ú     |            17    LATIN CAPITAL LETTER U WITH ACUTE
       233       \u00e9       é     |             6    LATIN SMALL LETTER E WITH ACUTE
       237       \u00ed       í     |             6    LATIN SMALL LETTER I WITH ACUTE
     8,218       \u201a       ‚     |            29    SINGLE LOW-9 QUOTATION MARK
     8,482       \u2122       ™     |             3    TRADE MARK SIGN
    65,533       \ufffd       �     |             1    REPLACEMENT CHARACTER
------------------------------------+-------------------------------------------------------

                                    freq. count   distinct
ASCII characters              =               0          0
Multibyte UTF-8 characters    =           2,294         22
Unicode replacement character =               1          1
Total Unicode characters      =           2,295         23

Thank you in advance for your help!

Last edited by Michael Duarte Goncalves; 19 Dec 2023, 09:36.

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10195
#5

19 Dec 2023, 09:58

See https://www.statalist.org/forums/for...place-subinstr for one way to get all ASCII characters, but it won't encode the text for you. If you want complete encoding, you have to put in the work following #2.
1 like
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#6

19 Dec 2023, 10:05

Thanks for your advice Andrew Musau and for the link.

I'll do what you suggested in #2. Thanks again for your help.
All the best,
Comment

Announcement