Generate a variable that combines string-observations based on two conditions (date constraint by group)

Patrick Figge

Join Date: May 2014

Posts: 6
#1

Generate a variable that combines string-observations based on two conditions (date constraint by group)

24 Jul 2015, 05:52

Hi everyone,

I struggle with a list of persons that can be uniquely identified by the variable "name". The persons are observed at different points in time ("date_str" / "date"). I want to generate a new variable that includes all values of the variable "classes" for a given person within the past 5 years. Exemplary data: see below.

Code:

clear input str25 name str15 date_str str15 classes "Lastname 1, First name 1" "June 16, 2003" "F22B H04Q F04C" "Lastname 1, First name 1" "July 15, 2004" "B65D G01N" "Lastname 1, First name 1" "May 3, 2006" "C12Q" "Lastname 1, First name 1" "July 8, 2009" "C08K" "Lastname 2, First name 2" "April 5, 1999" "F16J B06B H04R" "Lastname 2, First name 2" "May 20, 2003" "F22B" "Lastname 2, First name 2" "April 2, 2007" "G01N" end gen date = date(date_str, "MDY") order name date_str date classes

For instance, in line 4 the new variable would have the value "C08K C12Q B65D G01N". In more general terms: I am trying to combine string-observations based on two conditions: (1) same "name" and (2) "date" has to be within the past 5 years of the focal date.

This might be related to a previous post: www.statalist.org/forums/forum/general-stata-discussion/general/1295115-how-to-summarize-multiple-observations-per-id. However, I struggle with adapting this: It is not enough to look at the previous line. The evaluation of dates has to consider the whole group defined by "name". [Repeating classes - e.g., "F22B H04Q F22B" - are not an issue: I can discard them afterwards.]

I'm thankful for any help or suggestions!
Patrick
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35414

24 Jul 2015, 06:16

Well posed problem!

Consider this:

Code:

 
clear
input str25 name str15 date_str str15 classes
"Lastname 1, First name 1" "June 16, 2003" "F22B H04Q F04C"
"Lastname 1, First name 1" "July 15, 2004" "B65D G01N"
"Lastname 1, First name 1" "May 3, 2006" "C12Q"
"Lastname 1, First name 1" "July 8, 2009" "C08K"
"Lastname 2, First name 2" "April 5, 1999" "F16J B06B H04R"
"Lastname 2, First name 2" "May 20, 2003" "F22B"
"Lastname 2, First name 2" "April 2, 2007" "G01N"
end
gen date = date(date_str, "MDY")
order name date_str date classes

gen wc = wordcount(classes) 
expand wc 
sort name date 
bysort name date : gen count = _n 
gen class = word(classes, count) 
by name (date) : drop if (date[_N] - date) > (5 * 365.25) 
bysort name class : drop if _n > 1 
by name : gen wanted = class[1] if _n == 1 
by name : replace wanted = wanted[_n-1] + " " + class if _n > 1 
by name : keep if _n == _N 
list 



     +-------------------------------------------------------------------+
  1. |                     name |      date_str |  date |   classes | wc |
     | Lastname 1, First name 1 | July 15, 2004 | 16267 | B65D G01N |  2 |
     |-------------------------------------------------------------------|
     |      count      |      class      |                   wanted      |
     |          2      |       G01N      |      B65D C08K C12Q G01N      |
     +-------------------------------------------------------------------+

     +-------------------------------------------------------------------+
  2. |                     name |      date_str |  date |   classes | wc |
     | Lastname 2, First name 2 | April 2, 2007 | 17258 |      G01N |  1 |
     |-------------------------------------------------------------------|
     |      count      |      class      |                   wanted      |
     |          1      |       G01N      |                F22B G01N      |
     +-------------------------------------------------------------------+

Comment

Patrick Figge

Join Date: May 2014

Posts: 6
#3

24 Jul 2015, 06:55

Thank you a lot for the quick reply: That's already a big step forward!

There's two things that separates this solution from "making my day":
I do not want to change the original data structure: I want the new variable ("wanted") to have a value for each observation, i.e. for all 7 lines of the example (the original value of "classes" plus any value from other observations/lines that satisfies the two conditions "same name" and "within past 5 years".

The result for "Lastname 1, Firstname 1" is not correct: The classes "C12Q" and "C08K" are from 2006 and 2009 (the ones after the focal date July 15, 2004). "Lastname 2, Firstname 2" is correct (wanted=="F22B" from 2003 and "G01N" from 2007, not "F16J B06B H04R" from 1999).
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35414
#4

24 Jul 2015, 07:30

1. Just merge back with the original.

2. Sorry, but I have no idea of what the focal date is. I can't see an explanation either in #1 or in #3.

I guessed at the last date and within 5 years previous to that date.

How is the programmer supposed to know that July 15, 2004 is focal for the first person???
Comment
Patrick Figge

Join Date: May 2014

Posts: 6
#5

24 Jul 2015, 07:45

Ah, OK, (2.) was a misunderstanding. I was confused by the dates in your solution (they show not the "last date" at which you estimated, but "July 15, 2004" for person 1). So, the solution is actually exactly what I want!

With regards to (1.):
I am still not sure how I can achieve a solution for every date - not only for the last date per person. I should have made that more clear in my description. The variable should have values in every line, for instance for the second-last date of person 1 ("focal date: May 3, 2006"), the solution would list the classes of years 2006, 2004 and 2003: "C12Q B65D G01N F22B H04Q F04C".

Thank you a lot for your repeated replies.
Comment

Roberto Ferrer

Join Date: Apr 2014
Posts: 449

24 Jul 2015, 07:45

This is not efficient, so it might or might not work for you depending on the dataset size. It loops through all observations. I added example observations.

Code:

clear
set more off

*----- example data -----

input str25 name str15 date_str str15 classes
"Lastname 1, First name 1" "June 16, 2003" "F22B H04Q F04C"
"Lastname 1, First name 1" "July 15, 2004" "B65D G01N"
"Lastname 1, First name 1" "May 3, 2006" "C12Q"
"Lastname 1, First name 1" "July 8, 2009" "C08K"
"Lastname 1, First name 1" "Sep 13, 2018" "X56P"
"Lastname 1, First name 1" "Sep 19, 2019" "X39P"
"Lastname 2, First name 2" "April 5, 1999" "F16J B06B H04R"
"Lastname 2, First name 2" "May 20, 2003" "F22B"
"Lastname 2, First name 2" "April 2, 2007" "G01N"
end
gen date = date(date_str, "MDY")
format %td date
drop date_str
order name date classes

list

*----- what you want -----

gen near = .
gen wanted = ""

// check identifier and sort
isid name date, sort
quietly forvalues i = 1/`=_N' {

    // variable to hold concatenations
    gen allclass = classes
    
    // observations within the 5-year range
    replace near = inrange(date[`i'] - date, 1, `=5*365.25') ///
        & name[`i'] == name
    
    // concatenate classes
    replace allclass = ///
        trim(classes + " " + allclass[_n-1]) if near[_n-1]
    
    // desired variable
    replace wanted = allclass in `i'
    
    // drop for next loop
    drop allclass

}

Last edited by Roberto Ferrer; 24 Jul 2015, 07:52.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

Comment

Patrick Figge

Join Date: May 2014

Posts: 6
#7

24 Jul 2015, 07:52

Thank you both a lot. Solution works like a charm!
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

24 Jul 2015, 09:12

Here's another approach derived from the one I suggested recently here

Code:

clear
input str24 name int date str14 classes
"Lastname 1, First name 1" 15872 "F22B H04Q F04C" 
"Lastname 1, First name 1" 16267 "B65D G01N" 
"Lastname 1, First name 1" 16924 "C12Q" 
"Lastname 1, First name 1" 18086 "C08K" 
"Lastname 2, First name 2" 14339 "F16J B06B H04R" 
"Lastname 2, First name 2" 15845 "F22B" 
"Lastname 2, First name 2" 17258 "G01N" 
end
format %td date

sort name date classes
gen s = classes
local more 1
local i 0
local window = 365 * 5
while `more' {
    local `++i'
    by name: replace s = s + " " + classes[_n-`i'] ///
        if (date - date[_n-`i']) < `window'
    qui count if (date - date[_n-`i']) < `window'
    local more = r(N)
}

Comment

Patrick Figge

Join Date: May 2014

Posts: 6
#9

24 Jul 2015, 09:30

Thank you! The solution also works.
I'm going through both codes right now to fully understand them
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#10

24 Jul 2015, 10:13

I worked too fast to adapt my example and introduced an error in the condition that ends the loop. Here's a revised version that I think takes care of the problem.

Code:

clear
input str24 name int date str14 classes
"Lastname 1, First name 1" 15872 "F22B H04Q F04C" 
"Lastname 1, First name 1" 16267 "B65D G01N" 
"Lastname 1, First name 1" 16924 "C12Q" 
"Lastname 1, First name 1" 18086 "C08K" 
"Lastname 2, First name 2" 14339 "F16J B06B H04R" 
"Lastname 2, First name 2" 15845 "F22B" 
"Lastname 2, First name 2" 17258 "G01N" 
end
format %td date

sort name date classes
gen s = classes
local more 1
local i 0
local window = 365 * 5
while `more' {
    local `++i'
    by name: gen doit = (date - date[_n-`i']) < `window'
    by name: replace s = s + " " + classes[_n-`i'] if doit
    count if doit
    local more = r(N)
    drop doit
}

Announcement