Creating row ids indicating when value of a given variable is equal to value of another variable in *any other row*

Paula de Souza Leao Spinola

Join Date: Jun 2015

Posts: 384
#1

Creating row ids indicating when value of a given variable is equal to value of another variable in *any other row*

08 Jan 2022, 04:38

I have a dataset with the start and end date of different events. A major event may constitute of different events happening in sequence (if the end date of an event = start date of another event, then they belong to the same major event). I am trying to create a code that identifies major events.

I have created a toy example of a dataset with two variables (var1, var2), where var1 refers to the start date and var2 to the end date in my real data.

Code:

input var1 var2 1 10 5 9 6 11 10 15 16 17 end list

I would like to give each row an id which grows in ascending order of var1 and, most importantly, is kept constant if the value of var1 coincides with var2 of *any other row*.
The output I would be looking for is the following:

Code:

input var1 var2 new_var 1 10 1 10 15 1 5 9 2 6 11 3 16 17 4 end list

Many thanks!
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 9948

08 Jan 2022, 07:49

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(var1 var2)
 1 10
 5  9
 6 11
10 15
16 17
end

gen id=_n
expand 2, g(new)
gen var=cond(new, var2, var1)
bys var(id): replace id= id[1]
keep if !new
egen wanted = group(id)
keep var1 var2 wanted

Res.:

Code:

. sort wanted var1

. l

     +----------------------+
     | var1   var2   wanted |
     |----------------------|
  1. |    1     10        1 |
  2. |   10     15        1 |
  3. |    5      9        2 |
  4. |    6     11        3 |
  5. |   16     17        4 |
     +----------------------+

Last edited by Andrew Musau; 08 Jan 2022, 07:57.

Comment

Paula de Souza Leao Spinola

Join Date: Jun 2015

Posts: 384
#3

08 Jan 2022, 10:34

This is awesome, Andrew Musau ! Many thanks.

There is just a small issue: the code would output the same value for variable -wanted- in case of rows that have the same value of -var1-. I would like the variable -wanted- to have the same value *only if* -var1- coincides with the value of -var2- in any of the rows.

Consider the example below, where I added an extra row to the previous example (with the same value of -var1- as the previous row):

Code:

clear input var1 var2 1 10 5 9 6 11 10 15 16 17 16 18 end sort var1 list

Then, the code would output:

Code:

. list +----------------------+ | var1 var2 wanted | |----------------------| 1. | 1 10 1 | 2. | 10 15 1 | 3. | 5 9 2 | 4. | 6 11 3 | 5. | 16 18 4 | |----------------------| 6. | 16 17 4 | +----------------------+

I would be looking to get 5 as the value for -wanted- in the last row.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9948

08 Jan 2022, 13:05

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(var1 var2)
 1 10
 5  9
 6 11
10 15
16 17
16 18
end

gen id=_n
expand 2, g(new)
gen var=cond(new, var2, var1)
bys var (new): gen in1=_N>=2 & new[1]==new[_N]
bys var (id): replace id= cond(in1, id, id[1])
keep if !new
egen wanted = group(id)
keep var1 var2 wanted

Res.:

Code:

. sort wanted var1

. l, sepby(wanted)

     +----------------------+
     | var1   var2   wanted |
     |----------------------|
  1. |    1     10        1 |
  2. |   10     15        1 |
     |----------------------|
  3. |    5      9        2 |
     |----------------------|
  4. |    6     11        3 |
     |----------------------|
  5. |   16     17        4 |
     |----------------------|
  6. |   16     18        5 |
     +----------------------+

Comment

Paula de Souza Leao Spinola

Join Date: Jun 2015
Posts: 384

08 Jan 2022, 14:51

Thanks a lot, Andrew Musau.

It seems to be an issue now in case the repeated value of var1 across rows also shows up in var2. Take this example below (where I replaced 15 with 16 in the previous example)

Code:

clear
input var1 var2
1 10
5 9
6 11
10 16
16 17
16 18
end

By running the previous code we would get the following:

Code:

     +----------------------+
     | var1   var2   wanted |
     |----------------------|
  1. |    1     10        1 |
  2. |   10     16        1 |
     |----------------------|
  3. |    5      9        2 |
     |----------------------|
  4. |    6     11        3 |
     |----------------------|
  5. |   16     17        4 |
  6. |   16     18        4 |
     +----------------------+

Again we would have the last two observations with the same value for -wanted-

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9948

08 Jan 2022, 15:03

Before modifying the code, as 16 is present in var2, your expectation is

Code:

     +----------------------+
     | var1   var2   wanted |
     |----------------------|
  1. |    1     10        1 |
  2. |   10     16        1 |
  3. |   16     17        1 |  
  4. |   16     18        1 |
     |----------------------|
  5. |    5      9        2 |
     |----------------------|
  6. |    6     11        3 |
     +----------------------+

? If not, explain the logic of considering 10 and not 16 in this example.

Comment

Paula de Souza Leao Spinola

Join Date: Jun 2015

Posts: 384
#7

08 Jan 2022, 15:20

You are absolutely right, Andrew Musau. My expectation is exactely what you flagged - thanks for catching this.

I will also need to add another layer of complexity to the code. So far we considered the case where we give the same identifier if var1 in a given row is equal to var2 in any other row. A next step which I will need to implement will be to extend the code to also consider the same identifier in cases where var1 in a given row is equal to var2 + 1 in any other row. I thought this would be relatively easy to implement once we have the first part of the code, but now I am a little less confident. Would you have any thoughts on this?
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9948

09 Jan 2022, 01:55

Install group_id from SSC by Robert Picard.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(var1 var2)
 1 10
 5  9
 6 11
10 16
16 17
16 18
end

gen id=_n
expand 2, g(new)
gen var=cond(new, var2, var1)
group_id id, matchby(var)
keep if !new
keep id var1 var2

Res.:

Code:

. sort id var1

. l, sepby(id)

     +---------------------+
     | var1   var2      id |
     |---------------------|
  1. |    1     10       1 |
  2. |   10     16       1 |
  3. |   16     17       1 |
  4. |   16     18       1 |
     |---------------------|
  5. |    5      9       2 |
     |---------------------|
  6. |    6     11       3 |
     +---------------------+

next step which I will need to implement will be to extend the code to also consider the same identifier in cases where var1 in a given row is equal to var2 + 1 in any other row. I thought this would be relatively easy to implement once we have the first part of the code, but now I am a little less confident. Would you have any thoughts on this?

I cannot follow this. Provide a data example, creating a wanted variable.

Last edited by Andrew Musau; 09 Jan 2022, 01:57.

Comment

Paula de Souza Leao Spinola

Join Date: Jun 2015

Posts: 384
#9

09 Jan 2022, 05:00

Many thanks Andrew Musau, this is very helpful!

Because I needed to provide further explanation as well as another data example + desired output, I thought it would be good practice to create another post (link below). I would be really grateful if you could have a look. Please let me know if further clarification is needed.

https://www.statalist.org/forums/for...-any-other-row
Comment
Paula de Souza Leao Spinola

Join Date: Jun 2015

Posts: 384
#10

09 Jan 2022, 16:20

Andrew Musau I am sorry - I have just realised that the code in #8 does not work when the same variable (var1 or var2) coincides across rows.

The code seems to be outputting the same id variable if the value of var1 of a given row coincides with values of var2 in any other row *OR* if values of var1 coincide across rows *OR* if values of var2 coincide across rows (same value for the same variable across rows). I would be looking to get the same id *only* for the first case (var1 in a given row coinciding with var2 in any other row), but *not* if values of the same variable coincide across rows in case the former condition does not hold.

Take the example below.

Code:

clear input float(var1 var2) 2 7 7 9 7 11 6 8 6 9 10 11 end

I would be looking to get:

Code:

clear input float(var1 var2 id) 2 7 1 7 9 1 7 11 1 6 8 2 6 9 3 10 11 4 end

The code in #8 outputs id=1 to every single row.

PS: I would then be looking to expand the problem with two extra layers of complexity explained here: https://www.statalist.org/forums/for...-any-other-row
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9948

#11

10 Jan 2022, 01:06

You just need to select a sample based on your criterion and run group_id on this. Then append to the rest of the dataset.

Code:

clear
input float(var1 var2)
2 7
7 9
7 11
6 8
6 9
10 11
end

levelsof var1, local(v1)
levelsof var2, local(v2)
local in12: list v1& v2
local in12: subinstr local in12 " " ",", all
gen sample= inlist(var1, `in12') | inlist(var2, `in12')
gen id=_n
expand 2, g(new)
gen var=cond(new, var2, var1)
preserve
keep if sample
group_id id, matchby(var)
tempfile sample
save `sample'
restore
drop if sample
append using `sample'
keep if !new
keep var1 var2 id 
sort id var1
replace id= sum(id!=id[_n-1])
l, sepby(id)

Res.:

Code:

. 
. l, sepby(id)

     +------------------+
     | var1   var2   id |
     |------------------|
  1. |    2      7    1 |
  2. |    7      9    1 |
  3. |    7     11    1 |
     |------------------|
  4. |    6      8    2 |
     |------------------|
  5. |    6      9    3 |
     |------------------|
  6. |   10     11    4 |
     +------------------+

Comment

Paula de Souza Leao Spinola

Join Date: Jun 2015

Posts: 384
#12

10 Jan 2022, 11:36

Thanks Andrew Musau !!!

I am sorry for this long exchange - I now realised that `group_id` changes the ids in cascade mode in a way that we do not wish here. Take the example below. We should aim for id=1 for the first 4 observations and id=2 for the last two observations.

Code:

clear input float(var1 var2) 2 7 7 9 7 11 11 12 6 8 8 9 end list, sep(0)

Because of the first observation (var1= 2 & var2 = 7) which has originally id=1, all rows in the sample where either variable (var1 or var2) is equal to 2 or 7 receives id=1. This includes observation where var1= 7 & var2 = 9. All ok up to now, the problem is that this rationale follows as a cascade. Now all observations where either variable equals 9 is also given id=1. This includes the observation where var1 = 8 and var = 9. [This should be given id=2 and not id=1 in our desired output.] In a sequence, now all observations that have 8 in either variable also gets id=1. This includes observation where var1=6 and var2=8. [Again, this should be given id=2 instead of id=1.]
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9948
#13

11 Jan 2022, 02:58

I have to think about this as you still need to do the complicated matching done by group_id at the same time introduce an isolation condition based on one of the variables. I would suggest that you start a new thread with the full description of the problem as in #12. Maybe someone else in the forum has a simpler solution that I am not seeing.
Comment

Paula de Souza Leao Spinola

Join Date: Jun 2015
Posts: 384

#14

11 Jan 2022, 06:32

Andrew Musau it is looking like this will be the last message to this thread (with two presumably successful codes!).

The issue above was that the sample was not restrictive enough. When redefining the sample & replicating the rows in a more restrictive way before grouping ids, the codes below seem to do the job. I came to the first one adding on the code you sent earlier. The alternative code below was kindly provided by Robert Picard, the author of `group_by` command, with whom I exchanged a couple of emails.

Consider the data example below (I appended a few extra obs to the previous example).

Code:

clear
input float(var1 var2)
2 7
7 9
7 11
11 12
6 8
8 9
10 15
15 19
50 51
end

Code 1

Code:

// selecting sample where var1 in a given row coincides with var2 in any other given row and tagging if var1 or var2 is of interest in each specific row
levelsof var1, local(v1) // all values taken by var1
levelsof var2, local(v2) // all values taken by var2
local int: list v1 & v2 // finds intersect
local int: subinstr local int " " ",", all // replaces spaces with commas in all occurences
gen d_v1= (inlist(var1, `int'))
gen d_v2= (inlist(var2, `int'))

gen id=_n // id of rows
gen sample = (d_v1==1 | d_v2==1)

preserve
    // select sample to work with
    keep if sample
    expand 2 if d_v1==1 & d_v2==1, g(new) // adds replicated row if condition holds
    gen var = cond(d_v1==1, var1, var2)
    replace var = var2 if new==1
    group_id id, matchby(var) // grouping ids according to variable of interest
    keep if !new // delete replicated rows
    tempfile sample
    save `sample'
restore

// replacing tagged rows with grouped id
drop if sample
append using `sample'
drop d_v* sample var new

// adding sequential numbers for id 
sort id var1
gen id_final = sum(id!=id[_n-1]) // cumulative sum if the condition holds
drop id
rename id_final id

list, sepby(id)

Code 2

Code:

* create unique identifier and save starting dataset
gen long id = _n
sort id
save "statalist1.dta", replace

* make a list of distinct event start dates
bysort var1: keep if _n == 1
keep var1

* match these to the event end dates in the original sample
rename var1 var2
merge 1:m var2 using "statalist1.dta"

sort _merge id
order id var1 var2
list, sepby(_merge)

* ignore distinct start dates that do not match any end date
drop if _merge == 1

sort id
list, sep(0)

* duplicate only observations where there is a matching start date in the data
expand 2 if _merge == 3, gen(dup)
sort id dup
gen startend = cond(dup, var2, var1)

drop _merge
group_id id, matchby(startend)
gen id_final = sum(id!=id[_n-1])
drop id
rename id_final id 

* drop copies
drop if dup
drop dup startend

list, sepby(id)

Both codes provide the correct output, copied below:

Code:

. list, sepby(id)

     +------------------+
     | var1   var2   id |
     |------------------|
  1. |    2      7    1 |
  2. |    7      9    1 |
  3. |    7     11    1 |
  4. |   11     12    1 |
     |------------------|
  5. |    6      8    2 |
  6. |    8      9    2 |
     |------------------|
  7. |   10     15    3 |
  8. |   15     19    3 |
     |------------------|
  9. |   50     51    4 |
     +------------------+

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 9948
#15

11 Jan 2022, 09:55

I think that you have the right modification to the code in #11 which was not apparent to me. Robert's coding is excellent as always, so thanks for sharing. There is a lot to learn from it.
1 like
Comment

Announcement