Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • String Concatenation

    Hello! I am working with a database, monitoring health workers month by month according to their movement through the units they work in. Example: a worker another unit, was on work leave for 4 months and left this job in the last month. I need to combine 12 variables into one. I tried concatenating the variables:

    egen t_2012= concat( T1201 T1202 T1203 T1204 T1205 T1206 T1207 T1208 T1209 T1210 T1211 T1212)
    encode t_2012, generate(nt_2012) gen per2012=.
    replace per2012=0 if nt_2012==1
    replace per2012=1 if nt_2012==11
    replace per2012=2 if (nt_2012==2 | nt_2012==3 | nt_2012==4 | nt_2012==5 | nt_2012==6 | nt_2012==7 | nt_2012==8 | nt_2012==9 | nt_2012==10 | nt_2012==12 | nt_2012==13 | nt_2012== | nt_2012==15 | nt_2012==16 | nt_2012==17 | nt_2012==18)
    label define per2012 0 "Absent" 1 "Present at work" 2 "Turnover"
    label values per2012 per2012
    label var per2012 "worker's journey 2012"

    However, I need to concatenate the 12 months as follows: It would be a code like this: 6P/1M/4A/1S.
    I need help.

  • #2
    you can add strings with formatting to concatenate

    Code:
    g date_group = strvar1 + "/" + strvar2 + "/" + strvar3

    Comment


    • #3
      Dear Mr. Ford was not successful according to this code.
      The intention is to unify the categories with numbers and letters like this: as an example, the second line 7A/5P

      tab t_2012

      t_2012 | Freq. Percent Cum.
      -------------+-----------------------------------
      AAAAAAAAAAAA | 641 44.58 44.58
      AAAAAAAPPPPP | 2 0.14 44.71
      AAAAAAPPPPPP | 1 0.07 44.78
      AAAAAPPPPPPP | 4 0.28 45.06
      AAAAPPPPPPPP | 1 0.07 45.13
      APPPPPPPPPPP | 2 0.14 45.27
      PPMPPPMPPPPP | 1 0.07 45.34
      PPMPPPPMPPPP | 2 0.14 45.48
      PPPPPMPPPPPP | 1 0.07 45.55
      PPPPPPAPPPPP | 1 0.07 45.62
      PPPPPPPPPPPP | 770 53.55 99.17
      PPPPPPPPPPPS | 1 0.07 99.24
      PPPPPPPSAAAA | 2 0.14 99.37
      PPPPPPSAAAAA | 4 0.28 99.65
      PPPPPSAAAAAA | 2 0.14 99.79
      PPPSAAAAAAAA | 1 0.07 99.86
      PPSAAAAAAAAA | 1 0.07 99.93
      PSAAAAAAAAAA | 1 0.07 100.00
      -------------+-----------------------------------
      Total | 1,438 100.00


      I tried to use it with:

      split t_2012, parse ("/") generate (a2012)

      ...but I can't either

      Comment


      • #4
        My guess is there's a more elegant way to do this in Mata, or maybe using regular expressions. But this will get the job done:
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str12 var1
        "AAAAAAAAAAAA"
        "AAAAAAAPPPPP"
        "AAAAAAPPPPPP"
        "AAAAAPPPPPPP"
        "AAAAPPPPPPPP"
        "APPPPPPPPPPP"
        "PPMPPPMPPPPP"
        "PPMPPPPMPPPP"
        "PPPPPMPPPPPP"
        "PPPPPPAPPPPP"
        "PPPPPPPPPPPP"
        "PPPPPPPPPPPS"
        "PPPPPPPSAAAA"
        "PPPPPPSAAAAA"
        "PPPPPSAAAAAA"
        "PPPSAAAAAAAA"
        "PPSAAAAAAAAA"
        "PSAAAAAAAAAA"
        end
        
        gen expander = strlen(var1)
        gen `c(obs_t)' obs_no = _n
        expand expander
        sort obs_no, stable
        by obs_no: gen one_char = substr(var1, _n, 1)
        by obs_no: gen run = sum(one_char != one_char[_n-1])
        by obs_no run, sort: gen char_count = _N
        by obs_no run: keep if _n == 1
        by obs_no (run): gen wanted = one_char + string(char_count) if _n == 1
        by obs_no (run): replace wanted = wanted[_n-1] + "/" + one_char ///
            + string(char_count) if _n > 1
        by obs_no (run): keep if _n == _N
        drop expander obs_no one_char run char_count

        Comment


        • #5
          Oh, thank you. I will try this way and also try using REGEX
          Thank you very much!

          Comment


          • #6
            A variant splitting on spaces after inserting spaces using regex:
            Code:
            tempvar tosplit
            tempname token  
            
            gen `tosplit' = trim(regexreplaceall(var1, "([A-Z])\1*", "$0 "))
            split `tosplit', gen(`token')
            
            gen new = ""
             
            foreach v of varlist `token'?* {
            
                replace new = new ///
                    + "`sep'" ///
                    + substr(`v',1,1) + strofreal(strlen(`v')) ///
                    if !mi(`v')
                    
                    local sep = "/"
            }

            Comment


            • #7
              Also, you could identify the sequences then count the length of the runs.

              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str12 var1
              "AAAAAAAAAAAA"
              "AAAAAAAPPPPP"
              "AAAAAAPPPPPP"
              "AAAAAPPPPPPP"
              "AAAAPPPPPPPP"
              "APPPPPPPPPPP"
              "PPMPPPMPPPPP"
              "PPMPPPPMPPPP"
              "PPPPPMPPPPPP"
              "PPPPPPAPPPPP"
              "PPPPPPPPPPPP"
              "PPPPPPPPPPPS"
              "PPPPPPPSAAAA"
              "PPPPPPSAAAAA"
              "PPPPPSAAAAAA"
              "PPPSAAAAAAAA"
              "PPSAAAAAAAAA"
              "PSAAAAAAAAAA"
              end
              
              gen seq= ustrregexra(var1, "(.)\1+", "$1")
              gen wanted=""
              gen residual= var1
              forval i= 1/`=strlen(var1[1])'{
                  qui gen run= cond(`i'==strlen(seq), residual, substr(residual, strpos(residual, substr(seq, `i', 1)), strpos(residual, substr(seq,`i'+1, 1))- strpos(residual, substr(seq, `i', 1))))
                  qui replace wanted= wanted + substr(seq, `i', 1)+ string(strlen(run)) + cond(`i'<strlen(seq), "/", "") if !missing(residual)
                  qui replace residual= subinstr(residual, run, "", 1)
                  drop run
              }
              Res.:


              Code:
              . l, sep(0)
              
                   +--------------------------------------------------+
                   |         var1     seq           wanted   residual |
                   |--------------------------------------------------|
                1. | AAAAAAAAAAAA       A              A12            |
                2. | AAAAAAAPPPPP      AP            A7/P5            |
                3. | AAAAAAPPPPPP      AP            A6/P6            |
                4. | AAAAAPPPPPPP      AP            A5/P7            |
                5. | AAAAPPPPPPPP      AP            A4/P8            |
                6. | APPPPPPPPPPP      AP           A1/P11            |
                7. | PPMPPPMPPPPP   PMPMP   P2/M1/P3/M1/P5            |
                8. | PPMPPPPMPPPP   PMPMP   P2/M1/P4/M1/P4            |
                9. | PPPPPMPPPPPP     PMP         P5/M1/P6            |
               10. | PPPPPPAPPPPP     PAP         P6/A1/P5            |
               11. | PPPPPPPPPPPP       P              P12            |
               12. | PPPPPPPPPPPS      PS           P11/S1            |
               13. | PPPPPPPSAAAA     PSA         P7/S1/A4            |
               14. | PPPPPPSAAAAA     PSA         P6/S1/A5            |
               15. | PPPPPSAAAAAA     PSA         P5/S1/A6            |
               16. | PPPSAAAAAAAA     PSA         P3/S1/A8            |
               17. | PPSAAAAAAAAA     PSA         P2/S1/A9            |
               18. | PSAAAAAAAAAA     PSA        P1/S1/A10            |
                   +--------------------------------------------------+

              Comment


              • #8
                Code:
                gen res = ""
                gen tokens = regexreplaceall(var1, "([A-Z])\1*", "$0 ")
                gen ntokens = wordcount(tokens)
                qui su ntokens, meanonly
                
                forvalues token = 1/`r(max)' {
                
                        replace res = res ///
                                     + "`sep'" ///
                                     + strofreal(strlen(word(tokens, `token'))) ///
                                     + substr(word(tokens, `token'), 1, 1) ///
                                     if (`token' <= ntokens)
                
                        local sep /
                }
                
                drop tokens ntokens
                Last edited by Bjarte Aagnes; 24 Dec 2023, 06:28.

                Comment

                Working...
                X