Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Detecting Non-ASCII and other special characters

    Dear Stata list,

    We are working on a households survey.

    Initially interviewers enter the name of districts themselves but we later realize that there are a lot of inconsistencies because same district can be written differently by interviewers we then preload all the district within the sampled areas.

    But this leave us with the initial data that contains a lot of typos.

    I therefore soliciting your support on how to clean up the name of districts that contain Non-ASCII and special characters.

    we want to use the following rule in dealing with the situation:
    1. Only character A-Z are allow
    2. No training/leading or embedded spaces are allowed
    3. A singly space, dash(-) or underscore can be used to separate compound words.
    Here is example of district names that need to be cleaned:

    UNG-BAKO-À.,
    K-YAMMA-À
    JAURO-SULEI-(CHAKAMIDARI)
    KAIKABAYAS-À.
    JOSé
    AîDUN-MANGWARO
    etc.
    .

    our aim is to isolate all district names that are not captured base on the above rules.

    we are using Stata 15 MP.

    Thanks in anticipation of you support

  • #2
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str29 text
    "UNG-BAKO-À.,"         
    "K-YAMMA-À"            
    "JAURO-SULEI-(CHAKAMIDARI)"
    "KAIKABAYAS-À."        
    "JOSé"                  
    "AîDUN-MANGWARO"        
    end
    
    gen wanted=trim(itrim(ustrregexra(ustrto(text, "ascii", 2), "[^a-zA-Z0-9]", " ", .)))
    Res.:

    Code:
    . l, sep(0)
    
         +-----------------------------------------------------+
         |                      text                    wanted |
         |-----------------------------------------------------|
      1. |             UNG-BAKO-À.,                  UNG BAKO |
      2. |                K-YAMMA-À                   K YAMMA |
      3. | JAURO-SULEI-(CHAKAMIDARI)   JAURO SULEI CHAKAMIDARI |
      4. |            KAIKABAYAS-À.                KAIKABAYAS |
      5. |                     JOSé                       JOS |
      6. |           AîDUN-MANGWARO             ADUN MANGWARO |
         +-----------------------------------------------------+
    
    .

    Comment


    • #3
      Dear Andrew,

      Thanks for your response.

      The district names are clean-up nicely.
      but the caveat here is, we don't know if the Non-ASCII character should have been something.

      Therefore rather than just replacing it with empty string we wanted to isolate all district names with irregularities into a separate list.

      However, your response help me to come up with the following solution.

      Code:
      gen wanted=trim(itrim(ustrregexra(ustrto(text, "ascii", 2), "[^a-zA-Z0-9]", " ", .))) //your solution
      
      gen flag = wanted==text //this will assign 1 to to all district names without Non-ASCII character and 0 otherwise
      
      preserve
      keep if flag == 0
      save ="invalid_names.dta" //list of district names with Non-ASCII character restore
      
      drop if flag ==1
      save "cleaned_names.dta" //list of valid district names,
      This is what i was able to come up with.
      But i still feel there is a better way to handle it.

      please feel free to offer any suggestion that may improve the solution.

      Regards

      Comment


      • #4
        To directly identify strings with non-ascii characters:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str29 text
        "UNG-BAKO-À.,"        
        "K-YAMMA-À"            
        "JAURO-SULEI-(CHAKAMIDARI)"
        "KAIKABAYAS-À."        
        "JOSé"                  
        "AîDUN-MANGWARO"
        "THIS IS OK!"        
        end
        
        gen toreview=ustrto(text, "ascii", 2)!=text
        Res.:

        Code:
        . l, sep(0)
        
             +--------------------------------------+
             |                      text   toreview |
             |--------------------------------------|
          1. |             UNG-BAKO-À.,          1 |
          2. |                K-YAMMA-À          1 |
          3. | JAURO-SULEI-(CHAKAMIDARI)          0 |
          4. |            KAIKABAYAS-À.          1 |
          5. |                     JOSé          1 |
          6. |           AîDUN-MANGWARO          1 |
          7. |               THIS IS OK!          0 |
             +--------------------------------------+
        Then you can use contract in case of duplicate entries. See

        Code:
        help contract

        Comment


        • #5
          Dear Andrew,

          Thank you very much.

          the solution you provided really works very well for me.

          Comment

          Working...
          X