Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Concatenating Strings when collapsing data

    I would like to collapse variables by an ID. For numeric variables everything is working fine. For string variables I realized that there is no option to concatenate strings. So I am looking for some kind of work around (possibly including bysort and egen) before collapsing variables.

    This is how my data looks like.
    ID Text
    1 AD AR
    1 BD KL AD
    2 AD SJ
    2 FD WE RS

    And I would like to have:
    ID Text Concat
    1 AD AR AD AR BD KL AD
    1 BD KL AD AD AR BD KL AD
    2 AD SJ AD SJ FD WE RS
    2 FD WE RS AD SJ FD WE RS

    So that when collapsing by ID, I can simply use the first concat value.

    Even better would be if repeating values in Text would not be added to Concat, so that for ID 1 Concat would be "AD AR BD KL" and not "AD AR BD KL AD".

    How do I do this in Stata?

    Many thanks,
    Milan

  • #2
    Please use dataex (SSC) and CODE delimiters as requested in the Statalist FAQ Advice #12 for data examples.

    You need to concatenate before you can collapse.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id str8 text
    1 "AD AR"  
    1 "BD KL AD"
    2 "AD SJ"  
    2 "FD WE RS"
    end
    
    sort id, stable
    by id : gen alltext = text[1]
    by id : replace alltext = alltext[_n-1] + " " + text if _n > 1
    by id : replace alltext = alltext[_N]
    
    
    list, sepby(id)
    
         +--------------------------------+
         | id       text          alltext |
         |--------------------------------|
      1. |  1      AD AR   AD AR BD KL AD |
      2. |  1   BD KL AD   AD AR BD KL AD |
         |--------------------------------|
      3. |  2      AD SJ   AD SJ FD WE RS |
      4. |  2   FD WE RS   AD SJ FD WE RS |
         +--------------------------------+
    Removing duplicates seems harder work. This strategy should work, but I would to like to see a better one.

    Code:
    save mydata
    keep id text
    split text
    drop text
    bysort id : gen which = _n
    reshape long text , i(id which)
    drop if missing(text)
    bysort id text (_j) : keep if _n == 1
    after which the code is as before, capped by a merge with the main data.

    Comment


    • #3
      Thank you, the first part works beautifully.

      For removing the duplicates, I thought about something like

      Code:
      for i in "AD" "AR" "BD" "KL" "SJ" {
         k = (strlen(alltext)-strlen(subinstr(alltext,"`i'","",1)))/strlen("`i'")
         replace text=substr(alltext, "`i'", "", k-1)
      }
      i.e. calculate how often "AD" occurs in text and then replace text with a substring were the first k-1 instances of "AD" are removed and then run the loop for all individual strings.

      However, I do not know if you can use variables (like k in this case) in stata in this way..

      Comment


      • #4
        That might be the start of a strategy but I can't see yet that it is better than my own (!), not least because you have to spell out all the possible elements. Perhaps you just did that, but a touch more generality in the code is a good idea when possible.

        What you sketch is some distance from legal syntax, as you'd need to generate a variable explicitly.

        You have another problem, not mentioned earlier, in that Stata will regard "foo bar" and "bar foo" as different strings, but presumably you regard them as identical.

        But my code works on that too.

        This spells out what my previous left tacit:

        Code:
        . * sandbox 
        . clear
        
        . input byte id str8 text something 
        
                   id       text  something
          1. 1 "AD AR"           42
          2. 1 "BD KL AD"       666   
          3. 2 "AD SJ"          789 
          4. 2 "FD WE RS"        64 
          5. end
        
        . save milan, replace 
        file milan.dta saved
        
        . * row concatenation of text
        . * -- remove duplicates too 
        . * -- sort alphabetically too 
         
        . keep id text
        
        . split text
        variables created as string: 
        text1  text2  text3
        
        . drop text
        
        . bysort id : gen which = _n
        
        . reshape long text , i(id which)
        (note: j = 1 2 3)
        
        Data                               wide   ->   long
        -----------------------------------------------------------------------------
        Number of obs.                        4   ->      12
        Number of variables                   5   ->       4
        j variable (3 values)                     ->   _j
        xij variables:
                              text1 text2 text3   ->   text
        -----------------------------------------------------------------------------
        
        . drop if missing(text)
        (2 observations deleted)
        
        . bysort id text (_j) : keep if _n == 1
        (1 observation deleted)
        
        . by id : gen alltext = text[1] 
        
        . by id : replace alltext = alltext[_n-1] + " " + text if _n > 1 
        variable alltext was str11 now str14
        (7 real changes made)
        
        . by id : keep if _n == _N
        (7 observations deleted)
        
        . drop text 
        
        . rename alltext text  
        
        . * merge with original data
        . merge 1:m id using milan
        
            Result                           # of obs.
            -----------------------------------------
            not matched                             0
            matched                                 4  (_merge==3)
            -----------------------------------------
        
        . sort id 
        
        . list, sepby(id)
        
             +-----------------------------------------------------------+
             | id   which   _j             text   someth~g        _merge |
             |-----------------------------------------------------------|
          1. |  1       2    2      AD AR BD KL        666   matched (3) |
          2. |  1       2    2      AD AR BD KL         42   matched (3) |
             |-----------------------------------------------------------|
          3. |  2       2    2   AD FD RS SJ WE         64   matched (3) |
          4. |  2       2    2   AD FD RS SJ WE        789   matched (3) |
             +-----------------------------------------------------------+
        
        . 
        . * now we can collapse 
        . collapse something, by(id text) 
        
        . 
        . list, sepby(id) 
        
             +--------------------------------+
             | id             text   someth~g |
             |--------------------------------|
          1. |  1      AD AR BD KL        354 |
             |--------------------------------|
          2. |  2   AD FD RS SJ WE      426.5 |
             +--------------------------------+
        Here is the code in one to make it easier to run. If you want to see more of what is happening step by step, insert more list statements.

        Code:
        * sandbox 
        clear
        
        input byte id str8 text something 
        1 "AD AR"           42
        1 "BD KL AD"       666   
        2 "AD SJ"          789 
        2 "FD WE RS"        64 
        end
        
        save milan, replace 
        
        * row concatenation of text
        * -- remove duplicates too 
        * -- sort alphabetically too 
        
        keep id text
        split text
        drop text
        bysort id : gen which = _n
        reshape long text , i(id which)
        drop if missing(text)
        bysort id text (_j) : keep if _n == 1
        by id : gen alltext = text[1] 
        by id : replace alltext = alltext[_n-1] + " " + text if _n > 1 
        by id : keep if _n == _N
        drop text 
        rename alltext text  
        
        * merge with original data
        merge 1:m id using milan
        sort id 
        list, sepby(id)
        
        * now we can collapse 
        collapse something, by(id text) 
        
        list, sepby(id)

        Comment


        • #5
          You're code is more general, I agree. At first I felt a bit reluctant to reshape the data (the actual data is a bit more complex, more variables etc.), but now it is working.

          This being said, I did not put up my idea because I felt it was better (it is not even working yet, as you rightly pointed out) but because you were looking for alternative ideas/strategies. Possibly, the idea of approaching the problem in this way might help someone who later finds this thread and is interested in a similar problem.

          Thank you also for the details. Makes it a lot easier for me to study the code and understand and appreciate what's going on.

          Comment


          • #6
            That's all fine by me. Note that the reshape is just a device to get the string elements concatenated, and the reshape long is in essence reversed.

            There probably should be a concatenation command that works on blocks of observations, and I might write one if no-one else does.

            Comment


            • #7
              Here's a start. It just works for Milan's example, although some generalisation is possible, e.g. to strings with internal spaces.

              It will not be fast for large datasets.


              Code:
              *! 1.0.0 NJC 22 Nov 2016 
              program blockconcat, sortpreserve 
                  version 11.2 
                  syntax varname(string) [if] [in] , by(varname) Generate(str) ///
                  [ Distinct Sort ] 
              
                  local g `generate' 
              
                  quietly { 
                      marksample touse, novarlist 
                      count if `touse' 
                      if r(N) == 0 error 2000 
              
                      confirm new var `generate' 
              
                      sort `touse' `by', stable 
                      by `touse' `by': gen `g' = `varlist'[1] if `touse' 
              
                      by `touse' `by': replace `g' = ///
                      `g'[_n-1] + " " + `varlist' if _n > 1 & `touse' 
                      by `touse' `by': replace `g' = ///
                      trim(itrim(`g'[_N]))  
                  
              
                      if "`distinct'`sort'" != "" { 
                          by `touse' `by': replace `touse' = -(`touse' & _n == 1) 
                          sort `touse' 
                          count if `touse'
              
                          forval j = 1/`r(N)' { 
                              local work = `g'[`j'] 
                              if "`distinct'" != "" { 
                                  local work : list uniq work 
                              }
                              if "`sort'" != "" { 
                                  local work : list sort work 
                              } 
                              replace `g' = "`work'" in `j' 
                          } 
                          
                          bysort `by' (`touse') : replace `g' = `g'[1] 
                      } 
                  } 
              end
              Here's the example above done this way:

              Code:
              * sandbox 
              clear
              
              input byte id str8 text something 
              1 "AD AR"           42
              1 "BD KL AD"       666   
              2 "AD SJ"          789 
              2 "FD WE RS"        64 
              end 
              
              blockconcat text , by(id) generate(alltext2) 
              
              blockconcat text , by(id) sort distinct generate(alltext3) 
              
              list 
              
                   +------------------------------------------------------------+
                   | id       text   someth~g         alltext2         alltext3 |
                   |------------------------------------------------------------|
                1. |  1      AD AR         42   AD AR BD KL AD      AD AR BD KL |
                2. |  1   BD KL AD        666   AD AR BD KL AD      AD AR BD KL |
                3. |  2      AD SJ        789   AD SJ FD WE RS   AD FD RS SJ WE |
                4. |  2   FD WE RS         64   AD SJ FD WE RS   AD FD RS SJ WE |
                   +------------------------------------------------------------+

              Comment


              • #8
                With one tweak to the code and one preparatory step, this could be extended so that (for example) string values "some stuff" and "yet more" get concatenated as "some stuff" "yet more".

                This is posted here largely so anyone interested has a record and some code to improve on if their problem is different. Separators other than spaces are harder, I guess, without a different approach, but earlier posts in this thread remain pertinent.

                Revised program:


                Code:
                *! 1.0.1 NJC 22 Nov 2016
                *! 1.0.0 NJC 22 Nov 2016
                program blockconcat, sortpreserve
                    version 11.2
                    syntax varname(string) [if] [in] , by(varname) Generate(str) ///
                    [ Distinct Sort ]
                
                    local g `generate'
                
                    quietly {
                        marksample touse, novarlist
                        count if `touse'
                        if r(N) == 0 error 2000
                
                        confirm new var `generate'
                
                        sort `touse' `by', stable
                        by `touse' `by': gen `g' = `varlist'[1] if `touse'
                        by `touse' `by': replace `g' = ///
                        `g'[_n-1] + " " + `varlist' if _n > 1 & `touse'
                        by `touse' `by': replace `g' = trim(itrim(`g'[_N]))  
                    
                        if "`distinct'`sort'" != "" {
                            by `touse' `by': replace `touse' = -(`touse' & _n == 1)
                            sort `touse'
                            count if `touse'
                
                            forval j = 1/`r(N)' {
                                local work = `g'[`j']
                                if "`distinct'" != "" {
                                    local work : list uniq work
                                }
                                if "`sort'" != "" {
                                    local work : list sort work
                                }
                                replace `g' = `"`work'"' in `j'
                            }
                            
                            bysort `by' (`touse') : replace `g' = `g'[1]
                        }
                    }
                end


                Test script:

                Code:
                * sandbox
                clear
                
                input byte id str8 text something
                1 "AD AR"           42
                1 "BD KL AD"       666  
                2 "AD SJ"          789
                2 "FD WE RS"        64
                end
                
                blockconcat text , by(id) generate(alltext2)
                
                blockconcat text , by(id) sort distinct generate(alltext3)
                
                list
                
                sysuse auto, clear
                
                egen tag = tag(rep78)
                
                blockconcat make, by(rep78) gen(allmake)
                l allmake if tag
                
                gen make2 = `"""' + make + `"""'
                
                blockconcat make2, by(rep78) gen(allmake2)
                l allmake2 if tag
                
                blockconcat make2, by(rep78) gen(allmake2_2) sort distinct
                l allmake2_2 if tag
                Last edited by Nick Cox; 22 Nov 2016, 07:50.

                Comment


                • #9
                  I was only wondering what the code would look like if one wanted to concatenate a string variable by another id variable, but before that remove duplicates - not only parts of a string variable separated by spaces like in the case above, but if entire contents of the string variable repeated.

                  Comment


                  • #10
                    This works, but I think you need to concatenate before you can (easily) find duplicates. I didn't try to rewrite the code. I just wrote fresh code. I see that independently I chose a similar approach.

                    Code:
                    clear
                    input ID str42 Text
                    1    "AD AR"
                    1    "BD KL AD"
                    2    "AD SJ"
                    2    "FD WE RS"
                    end 
                    
                    sort ID, stable 
                    by ID : gen AllText = Text[1] 
                    by ID : replace AllText = AllText[_n-1] + " " + Text if _n > 1 
                    by ID : replace AllText = AllText[_N] 
                    
                    * insist on identifiers 1 up, even though that is true for example 
                    egen NewID = group(ID) 
                    su NewID, meanonly 
                    gen long ObsNo = _n 
                    
                    quietly forval i = 1/`r(max)' { 
                        su ObsNo if NewID == `i', meanonly 
                        local this = AllText[r(min)] 
                        local this : list uniq this 
                        local this : list sort this 
                        replace AllText = "`this'" if NewID == `i' 
                    } 
                    
                    tabdisp ID, c(AllText) 
                    
                    --------------------------
                           ID |        AllText
                    ----------+---------------
                            1 |    AD AR BD KL
                            2 | AD FD RS SJ WE
                    --------------------------

                    Comment


                    • #11
                      I appreciate, only how would the code look like when the duplication was over the entire contents of variable values. Something like with the following for example:
                      Code:
                      clear
                      input byte ID str4 Text
                      1    "ABCD"
                      1    "ABCD"
                      2    "ABCD"
                      2    "EFGH"
                      3    "IJKL"
                      3    "MNOP"
                      end

                      Comment


                      • #12
                        Sorry, but you'll have to explain to me what results you want there, unless the answer is levelsof

                        Comment


                        • #13
                          For example:
                          Code:
                          . clear
                          
                          . input byte ID str4 Text
                          
                                     ID       Text
                            1. 1    "ABCD"
                            2. 1    "ABCD"
                            3. 2    "ABCD"
                            4. 2    "EFGH"
                            5. 3    "IJKL"
                            6. 3    "MNOP"
                            7. end 
                          
                          . 
                          . sort ID, stable 
                          
                          . by ID : gen AllText = Text[1] 
                          
                          . by ID : replace AllText = AllText[_n-1] + " " + Text if _n > 1 
                          (3 real changes made)
                          
                          . by ID : replace AllText = AllText[_N]
                          (3 real changes made)
                          
                          . 
                          . collapse (firstnm) AllText, by(ID)
                          
                          . 
                          .  tabdisp ID, c(AllText) 
                          
                          -----------------------------
                                 ID | (firstnm) AllText
                          ----------+------------------
                                  1 |         ABCD ABCD
                                  2 |         ABCD EFGH
                                  3 |         IJKL MNOP
                          -----------------------------
                          I would like not to have the extra "ABCD" repeated for the first group above (ID==1).

                          Comment


                          • #14
                            Doesn't #10 already do this?
                            Last edited by Nick Cox; 28 May 2018, 18:26.

                            Comment


                            • #15
                              Thank you Nick for the program, which worked like charm for me. However, I would suggest to replace the option by(varname) with by(varlist) in case blocks are not identified by a single id variable. Of course on could generate such a variable, but this way it would be more comparable to the collapse command.

                              Comment

                              Working...
                              X