Adding "," to separate characters in a string

ChungLi Wu

Join Date: May 2020

Posts: 13
#1

Adding "," to separate characters in a string

11 Aug 2020, 14:35

Hi guys,

I have a string variable, say have, as follows

id have
1
2
3 6
4 2
5 125
6
7 123456
8 136
9 45
10

As you can see, sometimes it can be "", sometimes it can be "123456". what I want to do is to separate the string that has more than (includes) two characters with ",". The results would be like ths want variable below:

id have want
1
2
3 6 6
4 2 2
5 125 1,2,5
6
7 123456 1,2,3,4,5,6
8 136 1,3,6
9 45. 4,5
10

Any solution/suggestion would be appreciated.

Thanks in advance,
Ali
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10089

11 Aug 2020, 15:43

Please use dataex to present data examples in the future (FAQ Advice #12).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str2 id str6 have
"1"  ""      
"2"  ""      
"3"  "6"    
"4"  "2"    
"5"  "125"  
"6"  ""      
"7"  "123456"
"8"  "136"  
"9"  "45"    
"10" ""      
end

gen length= length(have)
gen wanted= cond(length>1, substr(have, 1, 1) + ",", have) 
qui sum length
forval i=2/`r(max)'{
     replace wanted= wanted + substr(have, `i', 1) + "," if length>1
}
replace wanted= ustrregexs(1) if ustrregexm(wanted,"([0-9\,]+[0-9])[\,]+")

Code:

. l, sep(0)

     +------------------------------------+
     | id     have   length        wanted |
     |------------------------------------|
  1. |  1                 0               |
  2. |  2                 0               |
  3. |  3        6        1             6 |
  4. |  4        2        1             2 |
  5. |  5      125        3         1,2,5 |
  6. |  6                 0               |
  7. |  7   123456        6   1,2,3,4,5,6 |
  8. |  8      136        3         1,3,6 |
  9. |  9       45        2           4,5 |
 10. | 10                 0               |
     +------------------------------------+

Last edited by Andrew Musau; 11 Aug 2020, 15:47.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

11 Aug 2020, 17:26

Another approach using regular expressions. I have to admit that while I knew this approach was possible, I don't believe I've ever used it before.

Code:

. generate wanted = ustrregexra(have,"(.)(?=.)","$1,")
(4 missing values generated)

. list, sep(0)

     +---------------------------+
     | id     have        wanted |
     |---------------------------|
  1. |  1                        |
  2. |  2                        |
  3. |  3        6             6 |
  4. |  4        2             2 |
  5. |  5      125         1,2,5 |
  6. |  6                        |
  7. |  7   123456   1,2,3,4,5,6 |
  8. |  8      136         1,3,6 |
  9. |  9       45           4,5 |
 10. | 10                        |
     +---------------------------+

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

11 Aug 2020, 18:40

For what it's worth, here's a symmetrical solution that uses look-behind pattern matching instead of look-ahead. The way to think about these is that look-ahead matches something that has at least one more character following it, while look-behind matches something that followed a previous character.

Code:

. generate wanted = ustrregexra(have,"(?<=.)(.)",",$1") (4 missing values generated) . list, sep(0) +---------------------------+ | id have wanted | |---------------------------| 1. | 1 | 2. | 2 | 3. | 3 6 6 | 4. | 4 2 2 | 5. | 5 125 1,2,5 | 6. | 6 | 7. | 7 123456 1,2,3,4,5,6 | 8. | 8 136 1,3,6 | 9. | 9 45 4,5 | 10. | 10 | +---------------------------+

I'm glad to have had an excuse to learn these. I've hacked my way around these sorts of problems in the past, it's good to learn a tool to deal with them elegantly.

Last edited by William Lisowski; 11 Aug 2020, 18:42.
1 like
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

12 Aug 2020, 07:21

Thanks William for providing examples of both look-ahead and look-behind pattern matching. Below are som timings to compare those, also adding a !mi() restriction. (A substr solution is also included (here assuming max str len is known))

Data are expanded (10^6) and each generate is done 10 times.

Code:

11: 170.63 / 10 = 17.06 ustrregexra(have, "(.)(?=.)", "$1,")
12: 161.46 / 10 = 16.14 ustrregexra(have, "(?<=.)(.)", ",$1")

21: 147.03 / 10 = 14.70 ustrregexra(have, "(.)(?=.)","$1,") if !mi(have)
22: 135.07 / 10 = 13.50 ustrregexra(have, "(?<=.)(.)", ",$1") if !mi(have)

99:  77.26 /  10 =  7.72  subinstr(trim( substr(have,1,1) + " " + ...

Code:

    gen wanted_99 =  subinstr(trim( substr(have,1,1) + " " ///
        + substr(have,2,1) + " " ///
        + substr(have,3,1) + " " ///
        + substr(have,4,1) + " " ///
        + substr(have,5,1) + " " ///
        + substr(have,6,1) + " ")," ",",",.)

Code:

clear all

* Example generated by -dataex-. To install: ssc install dataex
clear
input str2 id str6 have
"1"  ""      
"2"  ""      
"3"  "6"    
"4"  "2"    
"5"  "125"  
"6"  ""      
"7"  "123456"
"8"  "136"  
"9"  "45"    
"10" ""      
end

expand `= 10^6 '

forvalues i = 1/10 {

    keep id have
    
    timer on 11
    generate wanted_rx11 = ustrregexra(have, "(.)(?=.)", "$1,")
    timer off 11

    timer on 12
    generate wanted_rx12 = ustrregexra(have, "(?<=.)(.)", ",$1")
    timer off 12
    
    timer on 21    
    generate wanted_rx21 = ustrregexra(have, "(.)(?=.)","$1,") if !mi(have)
    timer off 21

    timer on 22
    generate wanted_rx22 = ustrregexra(have, "(?<=.)(.)", ",$1") if !mi(have)
    timer off 22
    
    timer on 99
    
    gen wanted_99 =  subinstr(trim( substr(have,1,1) + " " ///
        + substr(have,2,1) + " " ///
        + substr(have,3,1) + " " ///
        + substr(have,4,1) + " " ///
        + substr(have,5,1) + " " ///
        + substr(have,6,1) + " ")," ",",",.)

    timer off 99    
}

assert wanted_rx21 == wanted_rx11          
assert wanted_rx22 == wanted_rx11
assert wanted_99   == wanted_rx11
  
timer list

Last edited by Bjarte Aagnes; 12 Aug 2020, 07:33.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

12 Aug 2020, 07:55

Interesting. I'm pleased that I don't deal with datasets of 10^7 observations so that in my planning I can weigh programming time and clarity over processing speed. Of course both programming time and processing speed vary from user to user, so this has to be a personal calculation.

In any event, on my low-end iMac Amateur (as in, not iMac Pro) with Stata/SE I find that the times for rx21, rx22, and wanted_99 are very nearly identical, and rx11 and rx12 take just 25% longer.

Code:

. timer list 11: 84.61 / 10 = 8.4609 12: 84.03 / 10 = 8.4032 21: 69.97 / 10 = 6.9973 22: 69.28 / 10 = 6.9282 99: 68.56 / 10 = 6.8555
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#7

12 Aug 2020, 09:39

Thanks for providing timings with details on hardware/OS and Stata flavor. The main differences (#5 Stata/MP Windows vs #6 Stata/SE Mac) might even be due to differences in the implementation of the ICU libraries or the underlying ICU libraries (for different OS). (BTW on Windows 10 restrictiong processors to one did not make any substantial difference (to the regex vs substr)). I will ask tech support about this.
1 like
Comment
ChungLi Wu

Join Date: May 2020

Posts: 13
#8

12 Aug 2020, 10:22

Thanks, Andrew, William, and Bjarte!

Always learn so much form you guys here.
Comment
Chinh Nguyen (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 177
#9

13 Aug 2020, 08:31

I'll share the response we sent to Bjarte through tech support:

Stata uses the same ICU library on all platforms. String functions in Stata which use ICU use the same code on all platforms. The only differences are the compilers, OS, and hardware used to build them. ICU and string function performance can vary widely depending on the OS. However, our developers will look into whether we can improve the ICU performance on Windows.

-Chinh Nguyen
2 likes
Comment

Announcement