Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to delete special characters in string variables?

    Hi everyone,

    I would like to know how to delete special characters in string variables.

    I used chartab from SSC, but some of special characters remained there. Here is my code:

    Code:
    chartab description, noascii
    chartab model, noascii
    And here is the ex-post result:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str23 description str25 model
    "FERROSITE" "C-ELYSEE ECO GL"          
    "FERROSITE" "C-ELYSEE ECO-GLV"        
    "FERROSITE" "CITR™EN C-ELYSE ECO"  
    "FERROSITE" "CITR™EN C-ELYSE ECO"  
    "FERROSITE" "CITR™EN C-ELYSE ECO-G"
    end
    I should have a "O" instead of a TM sign for some description "value" labels. Everything should be kept in capital letters, please.
    I should have a "E" instead of "".

    Thank you in advance for your help.
    Best,

    Michael
    Last edited by Michael Duarte Goncalves; 19 Dec 2023, 07:53.

  • #2
    chartab just shows you what those non-ascii characters are, it does not replace anything. Here, we have:

    Code:
    . chartab model, noascii
    
       decimal  hexadecimal   character |     frequency    unique name
    ------------------------------------+---------------------------------------
           144       \u0090            |             3    DEVICE CONTROL STRING
         8,482       \u2122       ™     |             3    TRADE MARK SIGN
    ------------------------------------+---------------------------------------
    So you replace these as:

    Code:
    replace model= subinstr(model, "`=ustrunescape("\u2122")'", "O", .)
    replace model= subinstr(model, "`=ustrunescape("\u0090")'", "E", .)
    Res.:

    Code:
    . l
    
         +------------------------------------+
         | descrip~n                    model |
         |------------------------------------|
      1. | FERROSITE          C-ELYSEE ECO GL |
      2. | FERROSITE         C-ELYSEE ECO-GLV |
      3. | FERROSITE     CITROEN C-ELYSEE ECO |
      4. | FERROSITE     CITROEN C-ELYSEE ECO |
      5. | FERROSITE   CITROEN C-ELYSEE ECO-G |
         +------------------------------------+
    But all this might have been unnecessary if you chose the correct encoding when importing these data. See

    Code:
    help import delimited##encoding

    Comment


    • #3
      Hi Andrew Musau:

      Thank you very much for your suggestion!
      I didn't know about encoding! Thank you so much.

      Unfortunately, these data come from a source in which we cannot recover a CSV file.

      Lovely day.
      Michael
      Last edited by Michael Duarte Goncalves; 19 Dec 2023, 09:29.

      Comment


      • #4
        I have just a question:

        Is another way to do that? Because my "true" sample is really big. Here is the output of chartab model:

        Code:
        . chartab model, noascii
        
           decimal  hexadecimal   character |     frequency    unique name
        ------------------------------------+-------------------------------------------------------
               144       \u0090            |            59    DEVICE CONTROL STRING
               161       \u00a1       ¡     |             6    INVERTED EXCLAMATION MARK
               166       \u00a6       ¦     |            12    BROKEN BAR
               179       \u00b3       ³     |             2    SUPERSCRIPT THREE
               192       \u00c0       À     |             4    LATIN CAPITAL LETTER A WITH GRAVE
               193       \u00c1       Á     |            16    LATIN CAPITAL LETTER A WITH ACUTE
               195       \u00c3       Ã     |             7    LATIN CAPITAL LETTER A WITH TILDE
               199       \u00c7       Ç     |             3    LATIN CAPITAL LETTER C WITH CEDILLA
               200       \u00c8       È     |             2    LATIN CAPITAL LETTER E WITH GRAVE
               201       \u00c9       É     |         1,896    LATIN CAPITAL LETTER E WITH ACUTE
               203       \u00cb       Ë     |             8    LATIN CAPITAL LETTER E WITH DIAERESIS
               204       \u00cc       Ì     |             2    LATIN CAPITAL LETTER I WITH GRAVE
               205       \u00cd       Í     |            64    LATIN CAPITAL LETTER I WITH ACUTE
               209       \u00d1       Ñ     |            49    LATIN CAPITAL LETTER N WITH TILDE
               210       \u00d2       Ò     |            47    LATIN CAPITAL LETTER O WITH GRAVE
               211       \u00d3       Ó     |            50    LATIN CAPITAL LETTER O WITH ACUTE
               214       \u00d6       Ö     |             6    LATIN CAPITAL LETTER O WITH DIAERESIS
               218       \u00da       Ú     |            17    LATIN CAPITAL LETTER U WITH ACUTE
               233       \u00e9       é     |             6    LATIN SMALL LETTER E WITH ACUTE
               237       \u00ed       í     |             6    LATIN SMALL LETTER I WITH ACUTE
             8,218       \u201a       ‚     |            29    SINGLE LOW-9 QUOTATION MARK
             8,482       \u2122       ™     |             3    TRADE MARK SIGN
            65,533       \ufffd       �     |             1    REPLACEMENT CHARACTER
        ------------------------------------+-------------------------------------------------------
        
                                            freq. count   distinct
        ASCII characters              =               0          0
        Multibyte UTF-8 characters    =           2,294         22
        Unicode replacement character =               1          1
        Total Unicode characters      =           2,295         23
        Thank you in advance for your help!
        Last edited by Michael Duarte Goncalves; 19 Dec 2023, 09:36.

        Comment


        • #5
          See https://www.statalist.org/forums/for...place-subinstr for one way to get all ASCII characters, but it won't encode the text for you. If you want complete encoding, you have to put in the work following #2.

          Comment


          • #6
            Thanks for your advice Andrew Musau and for the link.

            I'll do what you suggested in #2. Thanks again for your help.
            All the best,

            Comment

            Working...
            X