Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Adding "," to separate characters in a string

    Hi guys,

    I have a string variable, say have, as follows

    id have
    1
    2
    3 6
    4 2
    5 125
    6
    7 123456
    8 136
    9 45
    10

    As you can see, sometimes it can be "", sometimes it can be "123456". what I want to do is to separate the string that has more than (includes) two characters with ",". The results would be like ths want variable below:

    id have want
    1
    2
    3 6 6
    4 2 2
    5 125 1,2,5
    6
    7 123456 1,2,3,4,5,6
    8 136 1,3,6
    9 45. 4,5
    10


    Any solution/suggestion would be appreciated.

    Thanks in advance,
    Ali

  • #2
    Please use dataex to present data examples in the future (FAQ Advice #12).

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str2 id str6 have
    "1"  ""      
    "2"  ""      
    "3"  "6"    
    "4"  "2"    
    "5"  "125"  
    "6"  ""      
    "7"  "123456"
    "8"  "136"  
    "9"  "45"    
    "10" ""      
    end
    
    gen length= length(have)
    gen wanted= cond(length>1, substr(have, 1, 1) + ",", have) 
    qui sum length
    forval i=2/`r(max)'{
         replace wanted= wanted + substr(have, `i', 1) + "," if length>1
    }
    replace wanted= ustrregexs(1) if ustrregexm(wanted,"([0-9\,]+[0-9])[\,]+")
    Code:
    . l, sep(0)
    
         +------------------------------------+
         | id     have   length        wanted |
         |------------------------------------|
      1. |  1                 0               |
      2. |  2                 0               |
      3. |  3        6        1             6 |
      4. |  4        2        1             2 |
      5. |  5      125        3         1,2,5 |
      6. |  6                 0               |
      7. |  7   123456        6   1,2,3,4,5,6 |
      8. |  8      136        3         1,3,6 |
      9. |  9       45        2           4,5 |
     10. | 10                 0               |
         +------------------------------------+
    Last edited by Andrew Musau; 11 Aug 2020, 15:47.

    Comment


    • #3
      Another approach using regular expressions. I have to admit that while I knew this approach was possible, I don't believe I've ever used it before.
      Code:
      . generate wanted = ustrregexra(have,"(.)(?=.)","$1,")
      (4 missing values generated)
      
      . list, sep(0)
      
           +---------------------------+
           | id     have        wanted |
           |---------------------------|
        1. |  1                        |
        2. |  2                        |
        3. |  3        6             6 |
        4. |  4        2             2 |
        5. |  5      125         1,2,5 |
        6. |  6                        |
        7. |  7   123456   1,2,3,4,5,6 |
        8. |  8      136         1,3,6 |
        9. |  9       45           4,5 |
       10. | 10                        |
           +---------------------------+

      Comment


      • #4
        For what it's worth, here's a symmetrical solution that uses look-behind pattern matching instead of look-ahead. The way to think about these is that look-ahead matches something that has at least one more character following it, while look-behind matches something that followed a previous character.
        Code:
        . generate wanted = ustrregexra(have,"(?<=.)(.)",",$1")
        (4 missing values generated)
        
        . list, sep(0)
        
             +---------------------------+
             | id     have        wanted |
             |---------------------------|
          1. |  1                        |
          2. |  2                        |
          3. |  3        6             6 |
          4. |  4        2             2 |
          5. |  5      125         1,2,5 |
          6. |  6                        |
          7. |  7   123456   1,2,3,4,5,6 |
          8. |  8      136         1,3,6 |
          9. |  9       45           4,5 |
         10. | 10                        |
             +---------------------------+
        I'm glad to have had an excuse to learn these. I've hacked my way around these sorts of problems in the past, it's good to learn a tool to deal with them elegantly.
        Last edited by William Lisowski; 11 Aug 2020, 18:42.

        Comment


        • #5
          Thanks William for providing examples of both look-ahead and look-behind pattern matching. Below are som timings to compare those, also adding a !mi() restriction. (A substr solution is also included (here assuming max str len is known))

          Data are expanded (10^6) and each generate is done 10 times.

          Code:
          11: 170.63 / 10 = 17.06 ustrregexra(have, "(.)(?=.)", "$1,")
          12: 161.46 / 10 = 16.14 ustrregexra(have, "(?<=.)(.)", ",$1")
          
          21: 147.03 / 10 = 14.70 ustrregexra(have, "(.)(?=.)","$1,") if !mi(have)
          22: 135.07 / 10 = 13.50 ustrregexra(have, "(?<=.)(.)", ",$1") if !mi(have)
          
          99:  77.26 /  10 =  7.72  subinstr(trim( substr(have,1,1) + " " + ...
          Code:
              gen wanted_99 =  subinstr(trim( substr(have,1,1) + " " ///
                  + substr(have,2,1) + " " ///
                  + substr(have,3,1) + " " ///
                  + substr(have,4,1) + " " ///
                  + substr(have,5,1) + " " ///
                  + substr(have,6,1) + " ")," ",",",.)
          Code:
          clear all
          
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str2 id str6 have
          "1"  ""      
          "2"  ""      
          "3"  "6"    
          "4"  "2"    
          "5"  "125"  
          "6"  ""      
          "7"  "123456"
          "8"  "136"  
          "9"  "45"    
          "10" ""      
          end
          
          expand `= 10^6 '
          
          forvalues i = 1/10 {
          
              keep id have
              
              timer on 11
              generate wanted_rx11 = ustrregexra(have, "(.)(?=.)", "$1,")
              timer off 11
          
              timer on 12
              generate wanted_rx12 = ustrregexra(have, "(?<=.)(.)", ",$1")
              timer off 12
              
              timer on 21    
              generate wanted_rx21 = ustrregexra(have, "(.)(?=.)","$1,") if !mi(have)
              timer off 21
          
              timer on 22
              generate wanted_rx22 = ustrregexra(have, "(?<=.)(.)", ",$1") if !mi(have)
              timer off 22
              
              timer on 99
              
              gen wanted_99 =  subinstr(trim( substr(have,1,1) + " " ///
                  + substr(have,2,1) + " " ///
                  + substr(have,3,1) + " " ///
                  + substr(have,4,1) + " " ///
                  + substr(have,5,1) + " " ///
                  + substr(have,6,1) + " ")," ",",",.)
          
              timer off 99    
          }
          
          assert wanted_rx21 == wanted_rx11          
          assert wanted_rx22 == wanted_rx11
          assert wanted_99   == wanted_rx11
            
          timer list
          Last edited by Bjarte Aagnes; 12 Aug 2020, 07:33.

          Comment


          • #6
            Interesting. I'm pleased that I don't deal with datasets of 10^7 observations so that in my planning I can weigh programming time and clarity over processing speed. Of course both programming time and processing speed vary from user to user, so this has to be a personal calculation.

            In any event, on my low-end iMac Amateur (as in, not iMac Pro) with Stata/SE I find that the times for rx21, rx22, and wanted_99 are very nearly identical, and rx11 and rx12 take just 25% longer.
            Code:
            . timer list
              11:     84.61 /       10 =       8.4609
              12:     84.03 /       10 =       8.4032
              21:     69.97 /       10 =       6.9973
              22:     69.28 /       10 =       6.9282
              99:     68.56 /       10 =       6.8555

            Comment


            • #7
              Thanks for providing timings with details on hardware/OS and Stata flavor. The main differences (#5 Stata/MP Windows vs #6 Stata/SE Mac) might even be due to differences in the implementation of the ICU libraries or the underlying ICU libraries (for different OS). (BTW on Windows 10 restrictiong processors to one did not make any substantial difference (to the regex vs substr)). I will ask tech support about this.

              Comment


              • #8
                Thanks, Andrew, William, and Bjarte!

                Always learn so much form you guys here.

                Comment


                • #9
                  I'll share the response we sent to Bjarte through tech support:

                  Stata uses the same ICU library on all platforms. String functions in Stata which use ICU use the same code on all platforms. The only differences are the compilers, OS, and hardware used to build them. ICU and string function performance can vary widely depending on the OS. However, our developers will look into whether we can improve the ICU performance on Windows.
                  -Chinh Nguyen

                  Comment

                  Working...
                  X