Delete English letters in strings

Chen Samulsion

Join Date: Jan 2018
Posts: 789

Delete English letters in strings

26 May 2024, 03:33

Dear Stata users,

I have a data composed of strings. There're Chinese characters, English letters, and numeric characters. I want to remove all English letters so that only Chinese characters and numeric characters are kept (well, the spaces between these characters should also be kept). In the below, Code 1 is an example of my data, and Code 2 is what I want to achieve. Can anyone tell my how to get this. Thank you very much.

Code:

煤炭开采和洗选业 Mining and Washing of Coal 3 1 6.06 0.21
石油和天然气开采业 Extraction of Petroleum and Natural Gas 3 15.96 0.04
黑色金属矿采选业 Mining and Processing of Ferrous Metal Ores 13 1 119.43 0.81
有色金属矿采选业 Mining and Processing of Non-Ferrous Metal Ores
非金属矿采选业 Mining and Processing of Non-metal Ores 7 3 11.24 0.04

Code:

煤炭开采和洗选业 3 1 6.06 0.21
石油和天然气开采业 3 15.96 0.04
黑色金属矿采选业 13 1 119.43 0.81
有色金属矿采选业 
非金属矿采选业 7 3 11.24 0.04

Tags: None

Andrew Musau

Join Date: Oct 2014

Posts: 9948
#2

26 May 2024, 03:49

This is not tested as your example needs some tinkering first. The following exploits the fact that the English alphabet has the letters A-Z.

Code:

gen wanted= ustrregexra(stringvar, "[a-zA-Z]", "")
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 789
#3

26 May 2024, 04:14

Thank you very much Andrew Musau, the code above gets what I wanted. I am really not good at regular expression.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 782

26 May 2024, 13:06

Alternative using named capture groups, (modify replacement string to get Chinese + number part only

Code:

clear
input str85 example
"煤炭开采和洗选业 Mining and Washing of Coal 3 1 6.06 0.21"
"石油和天然气开采业 Extraction of Petroleum and Natural Gas 3 15.96 0.04"
"黑色金属矿采选业 Mining and Processing of Ferrous Metal Ores 13 1 119.43 0.81"
"有色金属矿采选业 Mining and Processing of Non-Ferrous Metal Ores"
"非金属矿采选业 Mining and Processing of Non-metal Ores 7 3 11.24 0.04"
end 

#delim;

replace example = 

    ustrregexrf(example, 
    
        "(?x) (?# flag allowing comments and ignoring white space)
      
        (?<Han>         \p{script=Han}+ )            (?# 1. capture group)
        (?<NOTdigits>   \D+             )            (?# 2. capture group)
        (?<digits>      [\d\s.]*        )            (?# 3. capture group)
        
        ",
        "\${Han} || \${digits} || ( \${NOTdigits} )" /* replacement string */
        )
;
#delim cr

format %-100s example
list

Code:

    example                                                                                    
    煤炭开采和洗选业 || 3 1 6.06 0.21 || (  Mining and Washing of Coal  )                      
    石油和天然气开采业 || 3 15.96 0.04 || (  Extraction of Petroleum and Natural Gas  )        
    黑色金属矿采选业 || 13 1 119.43 0.81 || (  Mining and Processing of Ferrous Metal Ores  )  
    有色金属矿采选业 ||  || (  Mining and Processing of Non-Ferrous Metal Ores )               
    非金属矿采选业 || 7 3 11.24 0.04 || (  Mining and Processing of Non-metal Ores  )

Maybe, split on "||"

Code:

. list example? , clean noobs

               example1             example2                                                example3  
      煤炭开采和洗选业        3 1 6.06 0.21                         (  Mining and Washing of Coal  )  
    石油和天然气开采业         3 15.96 0.04            (  Extraction of Petroleum and Natural Gas  )  
      黑色金属矿采选业     13 1 119.43 0.81        (  Mining and Processing of Ferrous Metal Ores  )  
      有色金属矿采选业                          (  Mining and Processing of Non-Ferrous Metal Ores )  
        非金属矿采选业       7 3 11.24 0.04            (  Mining and Processing of Non-metal Ores  )

Comment

Chen Samulsion

Join Date: Jan 2018

Posts: 789
#5

28 May 2024, 20:49

Thank you very much Bjarte Aagnes
Comment

Announcement

Delete English letters in strings

Comment

Comment

Comment

Comment