Concatenating Strings when collapsing data

Milan Quentel

Join Date: Nov 2016

Posts: 52
#1

Concatenating Strings when collapsing data

20 Nov 2016, 04:14

I would like to collapse variables by an ID. For numeric variables everything is working fine. For string variables I realized that there is no option to concatenate strings. So I am looking for some kind of work around (possibly including bysort and egen) before collapsing variables.

This is how my data looks like.
ID Text

1 AD AR

1 BD KL AD

2 AD SJ

2 FD WE RS

And I would like to have:
ID Text Concat

1 AD AR AD AR BD KL AD

1 BD KL AD AD AR BD KL AD

2 AD SJ AD SJ FD WE RS

2 FD WE RS AD SJ FD WE RS

So that when collapsing by ID, I can simply use the first concat value.

Even better would be if repeating values in Text would not be added to Concat, so that for ID 1 Concat would be "AD AR BD KL" and not "AD AR BD KL AD".

How do I do this in Stata?

Many thanks,
Milan
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 34719

20 Nov 2016, 05:23

Please use dataex (SSC) and CODE delimiters as requested in the Statalist FAQ Advice #12 for data examples.

You need to concatenate before you can collapse.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str8 text
1 "AD AR"  
1 "BD KL AD"
2 "AD SJ"  
2 "FD WE RS"
end

sort id, stable
by id : gen alltext = text[1]
by id : replace alltext = alltext[_n-1] + " " + text if _n > 1
by id : replace alltext = alltext[_N]


list, sepby(id)

     +--------------------------------+
     | id       text          alltext |
     |--------------------------------|
  1. |  1      AD AR   AD AR BD KL AD |
  2. |  1   BD KL AD   AD AR BD KL AD |
     |--------------------------------|
  3. |  2      AD SJ   AD SJ FD WE RS |
  4. |  2   FD WE RS   AD SJ FD WE RS |
     +--------------------------------+

Removing duplicates seems harder work. This strategy should work, but I would to like to see a better one.

Code:

save mydata
keep id text
split text
drop text
bysort id : gen which = _n
reshape long text , i(id which)
drop if missing(text)
bysort id text (_j) : keep if _n == 1

after which the code is as before, capped by a merge with the main data.

Comment

Milan Quentel

Join Date: Nov 2016

Posts: 52
#3

21 Nov 2016, 02:55

Thank you, the first part works beautifully.

For removing the duplicates, I thought about something like

Code:

for i in "AD" "AR" "BD" "KL" "SJ" { k = (strlen(alltext)-strlen(subinstr(alltext,"`i'","",1)))/strlen("`i'") replace text=substr(alltext, "`i'", "", k-1) }

i.e. calculate how often "AD" occurs in text and then replace text with a substring were the first k-1 instances of "AD" are removed and then run the loop for all individual strings.

However, I do not know if you can use variables (like k in this case) in stata in this way..
Comment

Nick Cox

Join Date: Mar 2014
Posts: 34719

21 Nov 2016, 04:36

That might be the start of a strategy but I can't see yet that it is better than my own (!), not least because you have to spell out all the possible elements. Perhaps you just did that, but a touch more generality in the code is a good idea when possible.

What you sketch is some distance from legal syntax, as you'd need to generate a variable explicitly.

You have another problem, not mentioned earlier, in that Stata will regard "foo bar" and "bar foo" as different strings, but presumably you regard them as identical.

But my code works on that too.

This spells out what my previous left tacit:

Code:

. * sandbox 
. clear

. input byte id str8 text something 

           id       text  something
  1. 1 "AD AR"           42
  2. 1 "BD KL AD"       666   
  3. 2 "AD SJ"          789 
  4. 2 "FD WE RS"        64 
  5. end

. save milan, replace 
file milan.dta saved

. * row concatenation of text
. * -- remove duplicates too 
. * -- sort alphabetically too 
 
. keep id text

. split text
variables created as string: 
text1  text2  text3

. drop text

. bysort id : gen which = _n

. reshape long text , i(id which)
(note: j = 1 2 3)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        4   ->      12
Number of variables                   5   ->       4
j variable (3 values)                     ->   _j
xij variables:
                      text1 text2 text3   ->   text
-----------------------------------------------------------------------------

. drop if missing(text)
(2 observations deleted)

. bysort id text (_j) : keep if _n == 1
(1 observation deleted)

. by id : gen alltext = text[1] 

. by id : replace alltext = alltext[_n-1] + " " + text if _n > 1 
variable alltext was str11 now str14
(7 real changes made)

. by id : keep if _n == _N
(7 observations deleted)

. drop text 

. rename alltext text  

. * merge with original data
. merge 1:m id using milan

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                                 4  (_merge==3)
    -----------------------------------------

. sort id 

. list, sepby(id)

     +-----------------------------------------------------------+
     | id   which   _j             text   someth~g        _merge |
     |-----------------------------------------------------------|
  1. |  1       2    2      AD AR BD KL        666   matched (3) |
  2. |  1       2    2      AD AR BD KL         42   matched (3) |
     |-----------------------------------------------------------|
  3. |  2       2    2   AD FD RS SJ WE         64   matched (3) |
  4. |  2       2    2   AD FD RS SJ WE        789   matched (3) |
     +-----------------------------------------------------------+

. 
. * now we can collapse 
. collapse something, by(id text) 

. 
. list, sepby(id) 

     +--------------------------------+
     | id             text   someth~g |
     |--------------------------------|
  1. |  1      AD AR BD KL        354 |
     |--------------------------------|
  2. |  2   AD FD RS SJ WE      426.5 |
     +--------------------------------+

Here is the code in one to make it easier to run. If you want to see more of what is happening step by step, insert more list statements.

Code:

* sandbox 
clear

input byte id str8 text something 
1 "AD AR"           42
1 "BD KL AD"       666   
2 "AD SJ"          789 
2 "FD WE RS"        64 
end

save milan, replace 

* row concatenation of text
* -- remove duplicates too 
* -- sort alphabetically too 

keep id text
split text
drop text
bysort id : gen which = _n
reshape long text , i(id which)
drop if missing(text)
bysort id text (_j) : keep if _n == 1
by id : gen alltext = text[1] 
by id : replace alltext = alltext[_n-1] + " " + text if _n > 1 
by id : keep if _n == _N
drop text 
rename alltext text  

* merge with original data
merge 1:m id using milan
sort id 
list, sepby(id)

* now we can collapse 
collapse something, by(id text) 

list, sepby(id)

Comment

Milan Quentel

Join Date: Nov 2016

Posts: 52
#5

22 Nov 2016, 03:07

You're code is more general, I agree. At first I felt a bit reluctant to reshape the data (the actual data is a bit more complex, more variables etc.), but now it is working.

This being said, I did not put up my idea because I felt it was better (it is not even working yet, as you rightly pointed out) but because you were looking for alternative ideas/strategies. Possibly, the idea of approaching the problem in this way might help someone who later finds this thread and is interested in a similar problem.

Thank you also for the details. Makes it a lot easier for me to study the code and understand and appreciate what's going on.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 34719
#6

22 Nov 2016, 03:13

That's all fine by me. Note that the reshape is just a device to get the string elements concatenated, and the reshape long is in essence reversed.

There probably should be a concatenation command that works on blocks of observations, and I might write one if no-one else does.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 34719

22 Nov 2016, 05:39

Here's a start. It just works for Milan's example, although some generalisation is possible, e.g. to strings with internal spaces.

It will not be fast for large datasets.

Code:

*! 1.0.0 NJC 22 Nov 2016 
program blockconcat, sortpreserve 
    version 11.2 
    syntax varname(string) [if] [in] , by(varname) Generate(str) ///
    [ Distinct Sort ] 

    local g `generate' 

    quietly { 
        marksample touse, novarlist 
        count if `touse' 
        if r(N) == 0 error 2000 

        confirm new var `generate' 

        sort `touse' `by', stable 
        by `touse' `by': gen `g' = `varlist'[1] if `touse' 

        by `touse' `by': replace `g' = ///
        `g'[_n-1] + " " + `varlist' if _n > 1 & `touse' 
        by `touse' `by': replace `g' = ///
        trim(itrim(`g'[_N]))  
    

        if "`distinct'`sort'" != "" { 
            by `touse' `by': replace `touse' = -(`touse' & _n == 1) 
            sort `touse' 
            count if `touse'

            forval j = 1/`r(N)' { 
                local work = `g'[`j'] 
                if "`distinct'" != "" { 
                    local work : list uniq work 
                }
                if "`sort'" != "" { 
                    local work : list sort work 
                } 
                replace `g' = "`work'" in `j' 
            } 
            
            bysort `by' (`touse') : replace `g' = `g'[1] 
        } 
    } 
end

Here's the example above done this way:

Code:

* sandbox 
clear

input byte id str8 text something 
1 "AD AR"           42
1 "BD KL AD"       666   
2 "AD SJ"          789 
2 "FD WE RS"        64 
end 

blockconcat text , by(id) generate(alltext2) 

blockconcat text , by(id) sort distinct generate(alltext3) 

list 

     +------------------------------------------------------------+
     | id       text   someth~g         alltext2         alltext3 |
     |------------------------------------------------------------|
  1. |  1      AD AR         42   AD AR BD KL AD      AD AR BD KL |
  2. |  1   BD KL AD        666   AD AR BD KL AD      AD AR BD KL |
  3. |  2      AD SJ        789   AD SJ FD WE RS   AD FD RS SJ WE |
  4. |  2   FD WE RS         64   AD SJ FD WE RS   AD FD RS SJ WE |
     +------------------------------------------------------------+

Comment

Nick Cox

Join Date: Mar 2014
Posts: 34719

22 Nov 2016, 07:41

With one tweak to the code and one preparatory step, this could be extended so that (for example) string values "some stuff" and "yet more" get concatenated as "some stuff" "yet more".

This is posted here largely so anyone interested has a record and some code to improve on if their problem is different. Separators other than spaces are harder, I guess, without a different approach, but earlier posts in this thread remain pertinent.

Revised program:

Code:

*! 1.0.1 NJC 22 Nov 2016
*! 1.0.0 NJC 22 Nov 2016
program blockconcat, sortpreserve
    version 11.2
    syntax varname(string) [if] [in] , by(varname) Generate(str) ///
    [ Distinct Sort ]

    local g `generate'

    quietly {
        marksample touse, novarlist
        count if `touse'
        if r(N) == 0 error 2000

        confirm new var `generate'

        sort `touse' `by', stable
        by `touse' `by': gen `g' = `varlist'[1] if `touse'
        by `touse' `by': replace `g' = ///
        `g'[_n-1] + " " + `varlist' if _n > 1 & `touse'
        by `touse' `by': replace `g' = trim(itrim(`g'[_N]))  
    
        if "`distinct'`sort'" != "" {
            by `touse' `by': replace `touse' = -(`touse' & _n == 1)
            sort `touse'
            count if `touse'

            forval j = 1/`r(N)' {
                local work = `g'[`j']
                if "`distinct'" != "" {
                    local work : list uniq work
                }
                if "`sort'" != "" {
                    local work : list sort work
                }
                replace `g' = `"`work'"' in `j'
            }
            
            bysort `by' (`touse') : replace `g' = `g'[1]
        }
    }
end

Test script:

Code:

* sandbox
clear

input byte id str8 text something
1 "AD AR"           42
1 "BD KL AD"       666  
2 "AD SJ"          789
2 "FD WE RS"        64
end

blockconcat text , by(id) generate(alltext2)

blockconcat text , by(id) sort distinct generate(alltext3)

list

sysuse auto, clear

egen tag = tag(rep78)

blockconcat make, by(rep78) gen(allmake)
l allmake if tag

gen make2 = `"""' + make + `"""'

blockconcat make2, by(rep78) gen(allmake2)
l allmake2 if tag

blockconcat make2, by(rep78) gen(allmake2_2) sort distinct
l allmake2_2 if tag

Last edited by Nick Cox; 22 Nov 2016, 07:50.

Comment

Straso Jovanovski

Join Date: Feb 2016

Posts: 70
#9

28 May 2018, 03:07

I was only wondering what the code would look like if one wanted to concatenate a string variable by another id variable, but before that remove duplicates - not only parts of a string variable separated by spaces like in the case above, but if entire contents of the string variable repeated.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 34719

#10

28 May 2018, 03:24

This works, but I think you need to concatenate before you can (easily) find duplicates. I didn't try to rewrite the code. I just wrote fresh code. I see that independently I chose a similar approach.

Code:

clear
input ID str42 Text
1    "AD AR"
1    "BD KL AD"
2    "AD SJ"
2    "FD WE RS"
end 

sort ID, stable 
by ID : gen AllText = Text[1] 
by ID : replace AllText = AllText[_n-1] + " " + Text if _n > 1 
by ID : replace AllText = AllText[_N] 

* insist on identifiers 1 up, even though that is true for example 
egen NewID = group(ID) 
su NewID, meanonly 
gen long ObsNo = _n 

quietly forval i = 1/`r(max)' { 
    su ObsNo if NewID == `i', meanonly 
    local this = AllText[r(min)] 
    local this : list uniq this 
    local this : list sort this 
    replace AllText = "`this'" if NewID == `i' 
} 

tabdisp ID, c(AllText) 

--------------------------
       ID |        AllText
----------+---------------
        1 |    AD AR BD KL
        2 | AD FD RS SJ WE
--------------------------

Comment

Straso Jovanovski

Join Date: Feb 2016

Posts: 70
#11

28 May 2018, 17:59

I appreciate, only how would the code look like when the duplication was over the entire contents of variable values. Something like with the following for example:

Code:

clear input byte ID str4 Text 1 "ABCD" 1 "ABCD" 2 "ABCD" 2 "EFGH" 3 "IJKL" 3 "MNOP" end
Comment
Nick Cox

Join Date: Mar 2014

Posts: 34719
#12

28 May 2018, 18:02

Sorry, but you'll have to explain to me what results you want there, unless the answer is levelsof
Comment

Straso Jovanovski

Join Date: Feb 2016
Posts: 70

#13

28 May 2018, 18:14

For example:

Code:

. clear

. input byte ID str4 Text

           ID       Text
  1. 1    "ABCD"
  2. 1    "ABCD"
  3. 2    "ABCD"
  4. 2    "EFGH"
  5. 3    "IJKL"
  6. 3    "MNOP"
  7. end 

. 
. sort ID, stable 

. by ID : gen AllText = Text[1] 

. by ID : replace AllText = AllText[_n-1] + " " + Text if _n > 1 
(3 real changes made)

. by ID : replace AllText = AllText[_N]
(3 real changes made)

. 
. collapse (firstnm) AllText, by(ID)

. 
.  tabdisp ID, c(AllText) 

-----------------------------
       ID | (firstnm) AllText
----------+------------------
        1 |         ABCD ABCD
        2 |         ABCD EFGH
        3 |         IJKL MNOP
-----------------------------

I would like not to have the extra "ABCD" repeated for the first group above (ID==1).

Comment

Nick Cox

Join Date: Mar 2014

Posts: 34719
#14

28 May 2018, 18:23

Doesn't #10 already do this?

Last edited by Nick Cox; 28 May 2018, 18:26.
Comment
Philipp Poppitz

Join Date: Jan 2016

Posts: 5
#15

21 Nov 2018, 09:22

Thank you Nick for the program, which worked like charm for me. However, I would suggest to replace the option by(varname) with by(varlist) in case blocks are not identified by a single id variable. Of course on could generate such a variable, but this way it would be more comparable to the collapse command.
Comment

ID	Text	Concat
1	AD AR	AD AR BD KL AD
1	BD KL AD	AD AR BD KL AD
2	AD SJ	AD SJ FD WE RS
2	FD WE RS	AD SJ FD WE RS

Announcement