Detecting Non-ASCII and other special characters

Mutanen Lau

Join Date: Sep 2016

Posts: 19
#1

Detecting Non-ASCII and other special characters

21 May 2021, 14:27

Dear Stata list,

We are working on a households survey.

Initially interviewers enter the name of districts themselves but we later realize that there are a lot of inconsistencies because same district can be written differently by interviewers we then preload all the district within the sampled areas.

But this leave us with the initial data that contains a lot of typos.

I therefore soliciting your support on how to clean up the name of districts that contain Non-ASCII and special characters.

we want to use the following rule in dealing with the situation:
Only character A-Z are allow

No training/leading or embedded spaces are allowed

A singly space, dash(-) or underscore can be used to separate compound words.

Here is example of district names that need to be cleaned:

UNG-BAKO-Ã€.,
K-YAMMA-Ã€
JAURO-SULEI-(CHAKAMIDARI)
KAIKABAYAS-Ã€.
JOSÃ©
AÃ®DUN-MANGWARO
etc.

.

our aim is to isolate all district names that are not captured base on the above rules.

we are using Stata 15 MP.

Thanks in anticipation of you support
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10191

21 May 2021, 15:38

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str29 text
"UNG-BAKO-Ã€.,"         
"K-YAMMA-Ã€"            
"JAURO-SULEI-(CHAKAMIDARI)"
"KAIKABAYAS-Ã€."        
"JOSÃ©"                  
"AÃ®DUN-MANGWARO"        
end

gen wanted=trim(itrim(ustrregexra(ustrto(text, "ascii", 2), "[^a-zA-Z0-9]", " ", .)))

Res.:

Code:

. l, sep(0)

     +-----------------------------------------------------+
     |                      text                    wanted |
     |-----------------------------------------------------|
  1. |             UNG-BAKO-Ã€.,                  UNG BAKO |
  2. |                K-YAMMA-Ã€                   K YAMMA |
  3. | JAURO-SULEI-(CHAKAMIDARI)   JAURO SULEI CHAKAMIDARI |
  4. |            KAIKABAYAS-Ã€.                KAIKABAYAS |
  5. |                     JOSÃ©                       JOS |
  6. |           AÃ®DUN-MANGWARO             ADUN MANGWARO |
     +-----------------------------------------------------+

.

Comment

Mutanen Lau

Join Date: Sep 2016

Posts: 19
#3

22 May 2021, 00:54

Dear Andrew,

Thanks for your response.

The district names are clean-up nicely.
but the caveat here is, we don't know if the Non-ASCII character should have been something.

Therefore rather than just replacing it with empty string we wanted to isolate all district names with irregularities into a separate list.

However, your response help me to come up with the following solution.

Code:

gen wanted=trim(itrim(ustrregexra(ustrto(text, "ascii", 2), "[^a-zA-Z0-9]", " ", .))) //your solution gen flag = wanted==text //this will assign 1 to to all district names without Non-ASCII character and 0 otherwise preserve keep if flag == 0 save ="invalid_names.dta" //list of district names with Non-ASCII character restore drop if flag ==1 save "cleaned_names.dta" //list of valid district names,

This is what i was able to come up with.
But i still feel there is a better way to handle it.

please feel free to offer any suggestion that may improve the solution.

Regards
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10191

22 May 2021, 04:16

To directly identify strings with non-ascii characters:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str29 text
"UNG-BAKO-Ã€.,"        
"K-YAMMA-Ã€"            
"JAURO-SULEI-(CHAKAMIDARI)"
"KAIKABAYAS-Ã€."        
"JOSÃ©"                  
"AÃ®DUN-MANGWARO"
"THIS IS OK!"        
end

gen toreview=ustrto(text, "ascii", 2)!=text

Res.:

Code:

. l, sep(0)

     +--------------------------------------+
     |                      text   toreview |
     |--------------------------------------|
  1. |             UNG-BAKO-Ã€.,          1 |
  2. |                K-YAMMA-Ã€          1 |
  3. | JAURO-SULEI-(CHAKAMIDARI)          0 |
  4. |            KAIKABAYAS-Ã€.          1 |
  5. |                     JOSÃ©          1 |
  6. |           AÃ®DUN-MANGWARO          1 |
  7. |               THIS IS OK!          0 |
     +--------------------------------------+

Then you can use contract in case of duplicate entries. See

Code:

help contract

Comment

Mutanen Lau

Join Date: Sep 2016

Posts: 19
#5

24 May 2021, 03:54

Dear Andrew,

Thank you very much.

the solution you provided really works very well for me.
Comment

Announcement

Detecting Non-ASCII and other special characters

Comment

Comment

Comment

Comment