Creating row ids indicating when value of a given variable is equal to value of another variable in *any other row*

Romalpa Akzo

Join Date: Oct 2017
Posts: 369

#16

12 Jan 2022, 04:35

I find it is still necessary to clarify the logic behind the wanted id. Consider the below example:

Code:

clear
input byte(var1 var2)
 2  7
 7  9
 7 11
11 12
 6  8
 8  9
10 15
15 19
50 51
 9 14
end

Note the added observation in red, which starts at 9 that appears to be the end of 2 different obs (obs 2 and obs 9).

For that specific sample, code 1 and code 2 give different outputs. I show below the output for code 2, which seems ... undesirable.

Code:

     +------------------+
     | var1   var2   id |
     |------------------|
  1. |    2      7    1 |
  2. |    7      9    1 |
  3. |    7     11    1 |
  4. |   11     12    1 |
  5. |    6      8    1 |
  6. |    8      9    1 |
     |------------------|
  7. |   10     15    2 |
  8. |   15     19    2 |
     |------------------|
  9. |   50     51    3 |
     |------------------|
 10. |    9     14    4 |
     +------------------+

Comment

Paula de Souza Leao Spinola

Join Date: Jun 2015
Posts: 384

#17

12 Jan 2022, 08:36

Thanks very much, Romalpa Akzo. I have adjusted both codes with some incremental improvements (which I no longer remember at this stage) and both of them seem now to provide the desired output! I tested them on different cases of overlaps between rows and they seem to be robust now but please do let me know in case you come across another example for which the output is incorrect. Thanks again!

Code 1

Code:

levelsof var1, local(v1) // all values taken by var1
levelsof var2, local(v2) // all values taken by var2
local int: list v1 & v2 // finds intersect
local int: subinstr local int " " ",", all // replaces spaces with commas in all occurences
gen d_v1= (inlist(var1, `int'))
gen d_v2= (inlist(var2, `int'))

gen id=_n // id of rows
gen sample = (d_v1==1 | d_v2==1)

preserve
    // select sample to work with
    keep if sample
    expand 2 if d_v1==1 & d_v2==1, g(new) // adds replicated row if condition holds
    gen var = cond(d_v1==1, var1, var2)
    replace var = var2 if new==1
    group_id id, matchby(var) // grouping ids according to variable of interest
    keep if !new // delete replicated rows
    tempfile sample
    save `sample'
restore

// replacing tagged rows with grouped id
drop if sample
append using `sample'
drop d_v* sample var new

// adding sequential numbers for id 
sort id var1
gen id_final = sum(id!=id[_n-1]) // cumulative sum if the condition holds
drop id
rename id_final id

list, sepby(id)

Code 2

Code:

* create unique identifier and save starting dataset
sort var1 var2
gen long id = _n
sort id
save "statalist1.dta", replace

* make a list of distinct event end dates
bysort var2: keep if _n == 1
keep var2

* tag rows where start date = end date in another row
rename var2 var1
merge 1:m var1 using "statalist1.dta", keep(using match)
rename _merge _merge_v1 // == 3 if row's start date = end date elsewhere
tempfile temp
save `temp', replace

* tag rows where end date = start date elsewhere
bysort var1: keep if _n==1 & _merge==3
keep var1
rename var1 var2
merge 1:m var2 using `temp', keep(using match)
rename _merge _merge_v2
order var1 var2
list, sep(0)

* group id of subsequent events
gen sample = (_merge_v1==3 | _merge_v2==3)    
preserve
    keep if sample==1
    expand 2 if _merge_v1==3 & _merge_v2==3, gen(dup)
    gen int_var = cond(_merge_v1==3, var1, var2)
    replace int_var = var2 if dup==1
    group_id id, matchby(int_var)
    drop if dup
    drop dup int_var
    tempfile sample
    save `sample', replace
restore
drop if sample==1
append using `sample'
drop sample
    
* reorder id in sequential order
sort id
gen id_final = sum(id!=id[_n-1])
drop id
rename id_final id

sort id var1 var2
list, sepby(id)

Output of both codes

Code:

. list, sepby(id)

     +------------------+
     | var1   var2   id |
     |------------------|
  1. |    2      7    1 |
  2. |    6      8    1 |
  3. |    7     11    1 |
  4. |    7      9    1 |
  5. |    8      9    1 |
  6. |    9     14    1 |
  7. |   11     12    1 |
     |------------------|
  8. |   10     15    2 |
  9. |   15     19    2 |
     |------------------|
 10. |   50     51    3 |
     +------------------+

Comment

Romalpa Akzo

Join Date: Oct 2017
Posts: 369

#18

13 Jan 2022, 17:40

The below code might also reflect correctly the logic behind the id connection.

Code:

expand 2, gen(ex)

gen a = cond(ex, var1,var2)
bys a (ex): replace a = var2 if ex[1]
bys a (ex): replace a = var1 if !ex[_N] & !ex[1]

group_id a, matchby(var2)

bys ex (a var1): gen id = sum(a!= a[_n-1])

drop if ex
drop ex a

Last edited by Romalpa Akzo; 13 Jan 2022, 17:57.

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 9948
#19

13 Jan 2022, 23:07

Amazing Romalpa Akzo!
Comment

Announcement

Comment

Comment

Comment

Comment