Vertical 'concat' of string observations

Victoria Sevcenko

Join Date: May 2015

Posts: 2
#1

Vertical 'concat' of string observations

22 May 2015, 06:21

Hi,

I am using Stata/SE 13.0 on Windows.

I am struggling with vertically combining string variables into a single string (combining several ‘observations’ into a single one). I have text data that I had to import into stata in a way that puts each line of text as a separate observation. This is both useful for isolating certain lines of text that I need to isolate (so I do not want to merge all text into a single observation before importing into stata), but also bad because I do have to now combine some of the lines back together.

I am trying to combine them using the following command (and a loop around it, as I will describe shortly):

gen var2 = var1[_n]+var1[_n+1]

This works fine, but the problem is that the number of lines of text that need to be combined vary unpredictably. I want to combine all lines of text until an empty line is encountered, after which I want to start over, as there are many such lines. In other words, my data looks like var1 and I want to create a new variable, var2, that will look as follows, without knowing how many lines will need to be added in each case (here I first need to combine 3 lines and then 4 lines):

Var 1 Var 2
Some text describing Some text describing something but only until some point
something but only until
some point

Then again some text Then again some text Describing something But this time it runs until Line four
Describing something
But this time it runs until
Line four

Is there a way to do this with a loop? E.g. ‘carry out gen var2 = var1[_n]+var1[_n+1]+…+var1[_n+X] if observations 1-X are non-empty and observation X+1 is empty’.

I think I need a loop with a stopping rule, but I do not know how to write one that will do exactly this.

Thank you so much for all your help in advance!!

All the best,
Victoria
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

22 May 2015, 06:34

This resembles a question asked yesterday by someone else: http://www.statalist.org/forums/foru...vations-per-id

The twist is doing it in blocks, which start either at the start or after a blank line. It's easy to mark the start of blocks and from that define blocks as such:

Code:

gen block = missing(var1[_n-1]) replace block = sum(block)

Then it's the same problem, more or less:

Code:

sort block, stable by block : gen var2 = var1 if _n == 1 by block : replace var2 = var2[_n-1] + var1 if _n > 1

Last edited by Nick Cox; 22 May 2015, 07:30.
Comment
Victoria Sevcenko

Join Date: May 2015

Posts: 2
#3

22 May 2015, 07:09

Thank you so much Nick! This is exactly what I needed!

Best,
Victoria
Comment
Tom Konzack

Join Date: Jul 2018

Posts: 16
#4

20 Jul 2018, 07:24

Version: Stata/SE 12.0

Dear Statalist,

if I may I would like to post a seemingly related question. I have been unable to implement the aforementioned solution to my case.
I have a dataset with approx. 3,200 groups. Each group has an ID and one primary name. Some groups have one or multiple name alias’. The primary name is marked by variable ‘primary’=1, the alias names by ‘primary’ = 0. ‘orgname’ is a string variable. The current dataset looks similar to the following:
orgname ID primary

A 23 1

B 23 0

C 23 0

D 23 0

E 24 0

F 24 0

G 24 1

H 25 0

I 25 1

This is how it should look like:
orgname ID orgalias

A 23 B, C, D

G 24 E, F

I 25 H

I have read all I could find, including ‘concat’ and ‘bysort id: replace…’, but it has been unsuccessful.

I would very much appreciate your advice. If you need more information, I am happy to provide them.

Thanks, Tom.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

20 Jul 2018, 07:43

I can't see any unsuccessful code here to explain what you're doing wrong. But I would call this a tricky problem. Here is some technique.

Code:

clear
input str1 orgname    ID    primary
A    23    1
B    23    0
C    23    0
D    23    0
E    24    0
F    24    0
G    24    1
H    25    0
I    25    1
end

replace primary = -primary 
sort ID primary orgname 
by ID : gen orgalias = orgname[2] if _n == 1 
by ID : replace orgalias = orgalias[_n-1] + ", " + orgname[_n+1] if inrange(_n, 2, _N-1) 
by ID : replace orgalias = orgalias[_N-1] 
by ID : keep if _n == 1 
drop primary 
list  

     +-------------------------+
     | orgname   ID   orgalias |
     |-------------------------|
  1. |       A   23    B, C, D |
  2. |       G   24       E, F |
  3. |       I   25          H |
     +-------------------------+

Comment

Tom Konzack

Join Date: Jul 2018

Posts: 16
#6

20 Jul 2018, 11:12

Dear Nick, thanks so much for the quick reply. This is exactly the solution to my problem, which I was unable to solve. Best, Tom.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#7

08 Jun 2020, 04:06

Now written up at https://www.stata-journal.com/articl...article=pr0071
Comment

Announcement

Vertical 'concat' of string observations

Comment

Comment

Comment

Comment

Comment

Comment