String Concatenation

Inez Calazans

Join Date: Dec 2023

Posts: 3
#1

String Concatenation

22 Dec 2023, 06:24

Hello! I am working with a database, monitoring health workers month by month according to their movement through the units they work in. Example: a worker another unit, was on work leave for 4 months and left this job in the last month. I need to combine 12 variables into one. I tried concatenating the variables:

egen t_2012= concat( T1201 T1202 T1203 T1204 T1205 T1206 T1207 T1208 T1209 T1210 T1211 T1212)
encode t_2012, generate(nt_2012) gen per2012=.
replace per2012=0 if nt_2012==1
replace per2012=1 if nt_2012==11
replace per2012=2 if (nt_2012==2 | nt_2012==3 | nt_2012==4 | nt_2012==5 | nt_2012==6 | nt_2012==7 | nt_2012==8 | nt_2012==9 | nt_2012==10 | nt_2012==12 | nt_2012==13 | nt_2012== | nt_2012==15 | nt_2012==16 | nt_2012==17 | nt_2012==18)
label define per2012 0 "Absent" 1 "Present at work" 2 "Turnover"
label values per2012 per2012
label var per2012 "worker's journey 2012"

However, I need to concatenate the 12 months as follows: It would be a code like this: 6P/1M/4A/1S.
I need help.
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3151
#2

22 Dec 2023, 08:25

you can add strings with formatting to concatenate

Code:

g date_group = strvar1 + "/" + strvar2 + "/" + strvar3
1 like
Comment
Inez Calazans

Join Date: Dec 2023

Posts: 3
#3

22 Dec 2023, 11:15

Dear Mr. Ford was not successful according to this code.
The intention is to unify the categories with numbers and letters like this: as an example, the second line 7A/5P

tab t_2012

t_2012 | Freq. Percent Cum.
-------------+-----------------------------------
AAAAAAAAAAAA | 641 44.58 44.58
AAAAAAAPPPPP | 2 0.14 44.71
AAAAAAPPPPPP | 1 0.07 44.78
AAAAAPPPPPPP | 4 0.28 45.06
AAAAPPPPPPPP | 1 0.07 45.13
APPPPPPPPPPP | 2 0.14 45.27
PPMPPPMPPPPP | 1 0.07 45.34
PPMPPPPMPPPP | 2 0.14 45.48
PPPPPMPPPPPP | 1 0.07 45.55
PPPPPPAPPPPP | 1 0.07 45.62
PPPPPPPPPPPP | 770 53.55 99.17
PPPPPPPPPPPS | 1 0.07 99.24
PPPPPPPSAAAA | 2 0.14 99.37
PPPPPPSAAAAA | 4 0.28 99.65
PPPPPSAAAAAA | 2 0.14 99.79
PPPSAAAAAAAA | 1 0.07 99.86
PPSAAAAAAAAA | 1 0.07 99.93
PSAAAAAAAAAA | 1 0.07 100.00
-------------+-----------------------------------
Total | 1,438 100.00

I tried to use it with:

split t_2012, parse ("/") generate (a2012)

...but I can't either
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

22 Dec 2023, 11:29

My guess is there's a more elegant way to do this in Mata, or maybe using regular expressions. But this will get the job done:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str12 var1
"AAAAAAAAAAAA"
"AAAAAAAPPPPP"
"AAAAAAPPPPPP"
"AAAAAPPPPPPP"
"AAAAPPPPPPPP"
"APPPPPPPPPPP"
"PPMPPPMPPPPP"
"PPMPPPPMPPPP"
"PPPPPMPPPPPP"
"PPPPPPAPPPPP"
"PPPPPPPPPPPP"
"PPPPPPPPPPPS"
"PPPPPPPSAAAA"
"PPPPPPSAAAAA"
"PPPPPSAAAAAA"
"PPPSAAAAAAAA"
"PPSAAAAAAAAA"
"PSAAAAAAAAAA"
end

gen expander = strlen(var1)
gen `c(obs_t)' obs_no = _n
expand expander
sort obs_no, stable
by obs_no: gen one_char = substr(var1, _n, 1)
by obs_no: gen run = sum(one_char != one_char[_n-1])
by obs_no run, sort: gen char_count = _N
by obs_no run: keep if _n == 1
by obs_no (run): gen wanted = one_char + string(char_count) if _n == 1
by obs_no (run): replace wanted = wanted[_n-1] + "/" + one_char ///
    + string(char_count) if _n > 1
by obs_no (run): keep if _n == _N
drop expander obs_no one_char run char_count

Comment

Inez Calazans

Join Date: Dec 2023

Posts: 3
#5

22 Dec 2023, 12:02

Oh, thank you. I will try this way and also try using REGEX
Thank you very much!
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

22 Dec 2023, 14:09

A variant splitting on spaces after inserting spaces using regex:

Code:

tempvar tosplit
tempname token  

gen `tosplit' = trim(regexreplaceall(var1, "([A-Z])\1*", "$0 "))
split `tosplit', gen(`token')

gen new = ""
 
foreach v of varlist `token'?* {

    replace new = new ///
        + "`sep'" ///
        + substr(`v',1,1) + strofreal(strlen(`v')) ///
        if !mi(`v')
        
        local sep = "/"
}

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10192

23 Dec 2023, 03:59

Also, you could identify the sequences then count the length of the runs.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str12 var1
"AAAAAAAAAAAA"
"AAAAAAAPPPPP"
"AAAAAAPPPPPP"
"AAAAAPPPPPPP"
"AAAAPPPPPPPP"
"APPPPPPPPPPP"
"PPMPPPMPPPPP"
"PPMPPPPMPPPP"
"PPPPPMPPPPPP"
"PPPPPPAPPPPP"
"PPPPPPPPPPPP"
"PPPPPPPPPPPS"
"PPPPPPPSAAAA"
"PPPPPPSAAAAA"
"PPPPPSAAAAAA"
"PPPSAAAAAAAA"
"PPSAAAAAAAAA"
"PSAAAAAAAAAA"
end

gen seq= ustrregexra(var1, "(.)\1+", "$1")
gen wanted=""
gen residual= var1
forval i= 1/`=strlen(var1[1])'{
    qui gen run= cond(`i'==strlen(seq), residual, substr(residual, strpos(residual, substr(seq, `i', 1)), strpos(residual, substr(seq,`i'+1, 1))- strpos(residual, substr(seq, `i', 1))))
    qui replace wanted= wanted + substr(seq, `i', 1)+ string(strlen(run)) + cond(`i'<strlen(seq), "/", "") if !missing(residual)
    qui replace residual= subinstr(residual, run, "", 1)
    drop run
}

Res.:

Code:

. l, sep(0)

     +--------------------------------------------------+
     |         var1     seq           wanted   residual |
     |--------------------------------------------------|
  1. | AAAAAAAAAAAA       A              A12            |
  2. | AAAAAAAPPPPP      AP            A7/P5            |
  3. | AAAAAAPPPPPP      AP            A6/P6            |
  4. | AAAAAPPPPPPP      AP            A5/P7            |
  5. | AAAAPPPPPPPP      AP            A4/P8            |
  6. | APPPPPPPPPPP      AP           A1/P11            |
  7. | PPMPPPMPPPPP   PMPMP   P2/M1/P3/M1/P5            |
  8. | PPMPPPPMPPPP   PMPMP   P2/M1/P4/M1/P4            |
  9. | PPPPPMPPPPPP     PMP         P5/M1/P6            |
 10. | PPPPPPAPPPPP     PAP         P6/A1/P5            |
 11. | PPPPPPPPPPPP       P              P12            |
 12. | PPPPPPPPPPPS      PS           P11/S1            |
 13. | PPPPPPPSAAAA     PSA         P7/S1/A4            |
 14. | PPPPPPSAAAAA     PSA         P6/S1/A5            |
 15. | PPPPPSAAAAAA     PSA         P5/S1/A6            |
 16. | PPPSAAAAAAAA     PSA         P3/S1/A8            |
 17. | PPSAAAAAAAAA     PSA         P2/S1/A9            |
 18. | PSAAAAAAAAAA     PSA        P1/S1/A10            |
     +--------------------------------------------------+

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

24 Dec 2023, 06:15

Code:

gen res = ""
gen tokens = regexreplaceall(var1, "([A-Z])\1*", "$0 ")
gen ntokens = wordcount(tokens)
qui su ntokens, meanonly

forvalues token = 1/`r(max)' {

        replace res = res ///
                     + "`sep'" ///
                     + strofreal(strlen(word(tokens, `token'))) ///
                     + substr(word(tokens, `token'), 1, 1) ///
                     if (`token' <= ntokens)

        local sep /
}

drop tokens ntokens

Last edited by Bjarte Aagnes; 24 Dec 2023, 06:28.

Announcement

String Concatenation

Comment

Comment

Comment

Comment

Comment

Comment

Comment