Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Vertical 'concat' of string observations

    Hi,

    I am using Stata/SE 13.0 on Windows.

    I am struggling with vertically combining string variables into a single string (combining several ‘observations’ into a single one). I have text data that I had to import into stata in a way that puts each line of text as a separate observation. This is both useful for isolating certain lines of text that I need to isolate (so I do not want to merge all text into a single observation before importing into stata), but also bad because I do have to now combine some of the lines back together.

    I am trying to combine them using the following command (and a loop around it, as I will describe shortly):

    gen var2 = var1[_n]+var1[_n+1]

    This works fine, but the problem is that the number of lines of text that need to be combined vary unpredictably. I want to combine all lines of text until an empty line is encountered, after which I want to start over, as there are many such lines. In other words, my data looks like var1 and I want to create a new variable, var2, that will look as follows, without knowing how many lines will need to be added in each case (here I first need to combine 3 lines and then 4 lines):

    Var 1 Var 2
    Some text describing Some text describing something but only until some point
    something but only until
    some point

    Then again some text Then again some text Describing something But this time it runs until Line four
    Describing something
    But this time it runs until
    Line four


    Is there a way to do this with a loop? E.g. ‘carry out gen var2 = var1[_n]+var1[_n+1]+…+var1[_n+X] if observations 1-X are non-empty and observation X+1 is empty’.

    I think I need a loop with a stopping rule, but I do not know how to write one that will do exactly this.

    Thank you so much for all your help in advance!!

    All the best,
    Victoria

  • #2
    This resembles a question asked yesterday by someone else: http://www.statalist.org/forums/foru...vations-per-id

    The twist is doing it in blocks, which start either at the start or after a blank line. It's easy to mark the start of blocks and from that define blocks as such:

    Code:
    gen block = missing(var1[_n-1])
    replace block = sum(block)
    Then it's the same problem, more or less:

    Code:
    sort block, stable
    by block : gen var2 = var1 if _n == 1
    by block : replace var2 = var2[_n-1] + var1 if _n > 1



    Last edited by Nick Cox; 22 May 2015, 08:30.

    Comment


    • #3
      Thank you so much Nick! This is exactly what I needed!

      Best,
      Victoria

      Comment


      • #4
        Version: Stata/SE 12.0

        Dear Statalist,

        if I may I would like to post a seemingly related question. I have been unable to implement the aforementioned solution to my case.
        I have a dataset with approx. 3,200 groups. Each group has an ID and one primary name. Some groups have one or multiple name alias’. The primary name is marked by variable ‘primary’=1, the alias names by ‘primary’ = 0. ‘orgname’ is a string variable. The current dataset looks similar to the following:
        orgname ID primary
        A 23 1
        B 23 0
        C 23 0
        D 23 0
        E 24 0
        F 24 0
        G 24 1
        H 25 0
        I 25 1
        This is how it should look like:
        orgname ID orgalias
        A 23 B, C, D
        G 24 E, F
        I 25 H
        I have read all I could find, including ‘concat’ and ‘bysort id: replace…’, but it has been unsuccessful.

        I would very much appreciate your advice. If you need more information, I am happy to provide them.

        Thanks, Tom.

        Comment


        • #5
          I can't see any unsuccessful code here to explain what you're doing wrong. But I would call this a tricky problem. Here is some technique.


          Code:
          clear
          input str1 orgname    ID    primary
          A    23    1
          B    23    0
          C    23    0
          D    23    0
          E    24    0
          F    24    0
          G    24    1
          H    25    0
          I    25    1
          end
          
          replace primary = -primary 
          sort ID primary orgname 
          by ID : gen orgalias = orgname[2] if _n == 1 
          by ID : replace orgalias = orgalias[_n-1] + ", " + orgname[_n+1] if inrange(_n, 2, _N-1) 
          by ID : replace orgalias = orgalias[_N-1] 
          by ID : keep if _n == 1 
          drop primary 
          list  
          
               +-------------------------+
               | orgname   ID   orgalias |
               |-------------------------|
            1. |       A   23    B, C, D |
            2. |       G   24       E, F |
            3. |       I   25          H |
               +-------------------------+

          Comment


          • #6
            Dear Nick, thanks so much for the quick reply. This is exactly the solution to my problem, which I was unable to solve. Best, Tom.

            Comment


            • #7
              Now written up at https://www.stata-journal.com/articl...article=pr0071

              Comment

              Working...
              X