Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Delete English letters in strings

    Dear Stata users,

    I have a data composed of strings. There're Chinese characters, English letters, and numeric characters. I want to remove all English letters so that only Chinese characters and numeric characters are kept (well, the spaces between these characters should also be kept). In the below, Code 1 is an example of my data, and Code 2 is what I want to achieve. Can anyone tell my how to get this. Thank you very much.

    Code:
    煤炭开采和洗选业 Mining and Washing of Coal 3 1 6.06 0.21
    石油和天然气开采业 Extraction of Petroleum and Natural Gas 3 15.96 0.04
    黑色金属矿采选业 Mining and Processing of Ferrous Metal Ores 13 1 119.43 0.81
    有色金属矿采选业 Mining and Processing of Non-Ferrous Metal Ores
    非金属矿采选业 Mining and Processing of Non-metal Ores 7 3 11.24 0.04
    Code:
    煤炭开采和洗选业 3 1 6.06 0.21
    石油和天然气开采业 3 15.96 0.04
    黑色金属矿采选业 13 1 119.43 0.81
    有色金属矿采选业 
    非金属矿采选业 7 3 11.24 0.04

  • #2
    This is not tested as your example needs some tinkering first. The following exploits the fact that the English alphabet has the letters A-Z.

    Code:
    gen wanted= ustrregexra(stringvar, "[a-zA-Z]", "")

    Comment


    • #3
      Thank you very much Andrew Musau, the code above gets what I wanted. I am really not good at regular expression.

      Comment


      • #4
        Alternative using named capture groups, (modify replacement string to get Chinese + number part only
        Code:
        clear
        input str85 example
        "煤炭开采和洗选业 Mining and Washing of Coal 3 1 6.06 0.21"
        "石油和天然气开采业 Extraction of Petroleum and Natural Gas 3 15.96 0.04"
        "黑色金属矿采选业 Mining and Processing of Ferrous Metal Ores 13 1 119.43 0.81"
        "有色金属矿采选业 Mining and Processing of Non-Ferrous Metal Ores"
        "非金属矿采选业 Mining and Processing of Non-metal Ores 7 3 11.24 0.04"
        end 
        
        #delim;
        
        replace example = 
        
            ustrregexrf(example, 
            
                "(?x) (?# flag allowing comments and ignoring white space)
              
                (?<Han>         \p{script=Han}+ )            (?# 1. capture group)
                (?<NOTdigits>   \D+             )            (?# 2. capture group)
                (?<digits>      [\d\s.]*        )            (?# 3. capture group)
                
                ",
                "\${Han} || \${digits} || ( \${NOTdigits} )" /* replacement string */
                )
        ;
        #delim cr
        
        format %-100s example
        list
        Code:
            example                                                                                    
            煤炭开采和洗选业 || 3 1 6.06 0.21 || (  Mining and Washing of Coal  )                      
            石油和天然气开采业 || 3 15.96 0.04 || (  Extraction of Petroleum and Natural Gas  )        
            黑色金属矿采选业 || 13 1 119.43 0.81 || (  Mining and Processing of Ferrous Metal Ores  )  
            有色金属矿采选业 ||  || (  Mining and Processing of Non-Ferrous Metal Ores )               
            非金属矿采选业 || 7 3 11.24 0.04 || (  Mining and Processing of Non-metal Ores  )
        Maybe, split on "||"
        Code:
        . list example? , clean noobs
        
                       example1             example2                                                example3  
              煤炭开采和洗选业        3 1 6.06 0.21                         (  Mining and Washing of Coal  )  
            石油和天然气开采业         3 15.96 0.04            (  Extraction of Petroleum and Natural Gas  )  
              黑色金属矿采选业     13 1 119.43 0.81        (  Mining and Processing of Ferrous Metal Ores  )  
              有色金属矿采选业                          (  Mining and Processing of Non-Ferrous Metal Ores )  
                非金属矿采选业       7 3 11.24 0.04            (  Mining and Processing of Non-metal Ores  )

        Comment


        • #5
          Thank you very much Bjarte Aagnes

          Comment

          Working...
          X