Creating ids, with groups, if value of a given variable = exact value or value +1 of another variable in *any other row*

Paula de Souza Leao Spinola

Join Date: Jun 2015

Posts: 384
#1

Creating ids, with groups, if value of a given variable = exact value or value +1 of another variable in *any other row*

09 Jan 2022, 03:52

I have a dataset with the start & end dates of different events across regions. Within region, a major event constitutes of different events happening in sequence. I identify events happening in sequence if the end date of a given event = start date of another event (i.e., same day) OR start date of another event + 1 (i.e., next day). I am trying to create a code that identifies major events (within regions).

In another post, Andrew Musau helped me find a solution of this exercise while considering data at the same level (no region grouping) and that subsequent events are only the ones where the end date and start date matches perfectly (I added two layers of complexity here, stated above in bold): https://www.statalist.org/forums/for...-any-other-row

I have created a toy example of a dataset with three variables (region, start, end).

Code:

clear input str1 region float(var1 var2) "A" 1 10 "A" 5 9 "A" 6 11 "A" 10 16 "A" 16 17 "A" 16 18 "B" 1 30 "B" 2 29 "B" 30 32 "B" 31 38 "B" 33 33 "B" 35 38 "C" 7 9 "C" 2 7 "C" 7 11 "C" 10 11 "C" 6 8 "C" 6 9 "C" 50 50 "C" 50 57 end list, sepby(region)

I would like to give each row, within a region, an id that is kept constant if the value of `start` coincides with `end` OR `end+1` in *any other row*.

The output I would be looking for is the following:

Code:

clear input str1 region float(var1 var2 NEW) "A" 1 10 1 "A" 10 16 1 "A" 16 17 1 "A" 16 18 1 "A" 5 9 2 "A" 6 11 3 "B" 1 30 1 "B" 30 32 1 "B" 31 38 1 "B" 33 33 1 "B" 2 29 2 "B" 35 38 3 "C" 2 7 1 "C" 7 9 1 "C" 7 11 1 "C" 10 11 1 "C" 6 8 2 "C" 6 9 3 "C" 50 50 4 "C" 50 57 4 end list, sepby(region)
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10190

10 Jan 2022, 00:27

Based on #11 of https://www.statalist.org/forums/for...=1641798134734, you can expand the dataset to include values of var1+1. Then concatenate the group identifier with values of both var1 and var2 to get group-specific values and select the sample based on your criterion (as shown in the linked thread). Finally, run group_id on the sample and append to the rest of the dataset. The results show an inconsistency with your results, but an inspection (see below) will show that this approach gives the correct results.

Code:

clear
input str1 region float(var1 var2 NEW)
"A" 1 10 1
"A" 10 16 1
"A" 16 17 1
"A" 16 18 1
"A" 5  9 2
"A" 6 11 3
"B" 1 30 1
"B" 30 32 1
"B" 31 38 1
"B" 33 33 1
"B" 2 29 2
"B" 35 38 3
"C" 2 7 1
"C" 7 9 1
"C" 7 11 1
"C" 10 11 1
"C" 6 8 2
"C" 6 9 3
"C" 50 50 4
"C" 50 57 4
end

expand 2, g(new)
replace var1= var1+1 if new
gen svar1= region + string(var1)
gen svar2= region+ string(var2)
levelsof svar1, local(v1) clean
levelsof svar2, local(v2) clean
local in12: list v1& v2
local in12: subinstr local in12 " " "|", all
gen sample= regexm(svar1, "(`in12')") | regexm(svar2, "(`in12')")
gen id=_n
expand 2, g(new2)
gen var=cond(new2, svar2, svar1)
preserve
keep if sample
group_id id, matchby(var)
tempfile sample
save `sample'
restore
drop if sample
append using `sample'
keep if !new
keep region var1 var2 NEW id
sort region id var1
l, sepby(region id)

Res.:

Code:

. l, sepby(region id)

     +---------------------------------+
     | region   var1   var2   NEW   id |
     |---------------------------------|
  1. |      A      1     10     1    1 |
  2. |      A      1     10     1    1 |
  3. |      A      6     11     3    1 |
  4. |      A      6     11     3    1 |
  5. |      A     10     16     1    1 |
  6. |      A     10     16     1    1 |
  7. |      A     16     18     1    1 |
  8. |      A     16     18     1    1 |
  9. |      A     16     17     1    1 |
 10. |      A     16     17     1    1 |
     |---------------------------------|
 11. |      A      5      9     2    5 |
 12. |      A      5      9     2    5 |
     |---------------------------------|
 13. |      B      1     30     1    7 |
 14. |      B      1     30     1    7 |
 15. |      B     30     32     1    7 |
 16. |      B     30     32     1    7 |
     |---------------------------------|
 17. |      B     31     38     1    9 |
 18. |      B     31     38     1    9 |
     |---------------------------------|
 19. |      B     33     33     1   10 |
 20. |      B     33     33     1   10 |
     |---------------------------------|
 21. |      B      2     29     2   11 |
 22. |      B      2     29     2   11 |
     |---------------------------------|
 23. |      B     35     38     3   12 |
 24. |      B     35     38     3   12 |
     |---------------------------------|
 25. |      C      2      7     1   13 |
 26. |      C      2      7     1   13 |
27. |      C      6      8     2   13 |
 28. |      C      6      8     2   13 |
 29. |      C      7      9     1   13 |
 30. |      C      7     11     1   13 |
 31. |      C      7     11     1   13 |
 32. |      C      7      9     1   13 |
 33. |      C     10     11     1   13 |
 34. |      C     10     11     1   13 |
     |---------------------------------|
 35. |      C      6      9     3   18 |
 36. |      C      6      9     3   18 |
     |---------------------------------|
 37. |      C     50     57     4   19 |
 38. |      C     50     50     4   19 |
 39. |      C     50     50     4   19 |
 40. |      C     50     57     4   19 |
     +---------------------------------+

Above in red, I highlight the inconsistencies. For observations 3 and 4, your wanted variable assigns them a different group, but that cannot be the case as 10 (observations 5 and 6) is 1 integer less than 11 (so they must be in the same group following your criterion). Same applies for observations 27 and 28 as 6 is one integer less than 7 (from observations 25 and 26).

Last edited by Andrew Musau; 10 Jan 2022, 00:41.

Comment

Paula de Souza Leao Spinola

Join Date: Jun 2015
Posts: 384

12 Jan 2022, 08:07

Hi Andrew Musau thanks for this. On the point you raised in #2: we would like rows to be assigned to the same group id if var1 in a given row is either equal to var2 or to var2+1 in any other row. The inconsistencies you raised were based on the different criteria of var1 being equal to var2 - 1 instead of var2 + 1.

I have, however, corrected the desired output in another aspect. A row *may* also receive the same group identifier in case var2 is equal to var1 (or var1 - 1) in any other row even if its value of var1 does not coincide with var2 (or var2 + 1) in any other row within that group identifier. This occurs in the special case where we have one observation with var1 coinciding with var2 (or var2+1) of two different rows which would otherwise not belong to the same group. Take the three observations where var1=1 & var2=10 / var1=5 & var2=9 / var1=10 & var2=16. This first and third observations should receive the same id because var1 in the third observation is equal to var2 in the first observation. On the other hand, the third observation should also be in the same group id as the second observation because var1 in the third observation is equal to var2+1 of the second observation. Thus, they should all have the same group id.

The code below seems to provide the desired output (as well as to a battery of other overlapping situations but please do let me know in case you think of an example that invalides the code). This is based on a code kindly provided by Robert Picard, the author of `group_id`, with some incremental additions of my own.

Code:

* create unique identifier and save starting dataset
sort var1 var2
gen long id = _n
sort id

* add duplicate observations with end date + 1
expand 2, gen(nextday)
bysort region id: replace var2 = var2 + 1 if nextday

* add duplicate observations with end date + 1
expand 2, gen(previousday)
bysort region id: replace var1 = var1 - 1 if previousday

save "statalist_nextday.dta", replace

* tag rows where start date = [end date, end date + 1] in another row
keep region var2
bysort region var2: keep if _n == 1
rename var2 var1
merge 1:m region var1 using "statalist_nextday.dta", keep(using match)
rename _merge _merge_v1 // == 3 if row's start date = [end date, end day + 1] elsewhere
tempfile temp
save `temp', replace

* tag rows where [end date, end date + 1] = start date elsewhere
keep region var1
bysort region var1: keep if _n==1
rename var1 var2
merge 1:m region var2 using `temp', keep(using match)
rename _merge _merge_v2 // == 3 if row's [end date, end day + 1] = start day elsewhere
order var1 var2
list, sep(0)

* group id of subsequent events
gen sample = (_merge_v1==3 | _merge_v2==3)    
preserve
    keep if sample==1
    expand 2 if _merge_v1==3 & _merge_v2==3, gen(dup)
    gen int_var = cond(_merge_v1==3, var1, var2)
    replace int_var = var2 if dup==1
    group_id id, matchby(region int_var)
    drop if dup
    drop dup int_var
    tempfile sample
    save `sample', replace
restore
drop if sample==1
append using `sample'
drop if nextday | previousday    
drop sample nextday previousday

* reorder id in sequential order
bys region (id): gen id_final = sum(id!=id[_n-1])
drop id
rename id_final id

sort region id var1 var2
list region var1 var2 id, sepby(id)

Output

Code:

     +---------------------------+
     | region   var1   var2   id |
     |---------------------------|
  1. |      A      1     10    1 |
  2. |      A      5      9    1 |
  3. |      A     10     16    1 |
  4. |      A     16     17    1 |
  5. |      A     16     18    1 |
     |---------------------------|
  6. |      A      6     11    2 |
     |---------------------------|
  7. |      B      1     30    1 |
  8. |      B      2     29    1 |
  9. |      B     30     32    1 |
 10. |      B     31     38    1 |
 11. |      B     33     33    1 |
     |---------------------------|
 12. |      B     35     38    2 |
     |---------------------------|
 13. |      C      2      7    1 |
 14. |      C      6      9    1 |
 15. |      C      7      9    1 |
 16. |      C      7     11    1 |
 17. |      C     10     11    1 |
     |---------------------------|
 18. |      C      6      8    2 |
     |---------------------------|
 19. |      C     50     50    3 |
 20. |      C     50     57    3 |
     +---------------------------+

Last edited by Paula de Souza Leao Spinola; 12 Jan 2022, 08:11.

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10190

12 Jan 2022, 15:22

Looks good. #2 can be modified as follows to take into account your conditions.

Code:

clear
input str1 region float(var1 var2 NEW)
"A" 1 10 1
"A" 10 16 1
"A" 16 17 1
"A" 16 18 1
"A" 5  9 2
"A" 6 11 3
"B" 1 30 1
"B" 30 32 1
"B" 31 38 1
"B" 33 33 1
"B" 2 29 2
"B" 35 38 3
"C" 2 7 1
"C" 7 9 1
"C" 7 11 1
"C" 10 11 1
"C" 6 8 2
"C" 6 9 3
"C" 50 50 4
"C" 50 57 4
end

gen id=_n
expand 2, g(new)
replace var1= var1-1 if new
replace var2= var2+1 if new
gen svar1= region + string(var1)
gen svar2= region+ string(var2)
levelsof svar1, local(v1) clean
levelsof svar2, local(v2) clean
local in12: list v1& v2
local in12: subinstr local in12 " " "|", all
gen sample= regexm(svar1, "(`in12')") | regexm(svar2, "(`in12')")
expand 2 if sample, g(new2)
gen var=cond(new2, svar2, svar1)
preserve
keep if sample
group_id id, matchby(var)
tempfile sample
save `sample'
restore
drop if sample
append using `sample'
keep if !new
keep region var1 var2 id
duplicates drop *, force
sort region id var1
l, sepby(region id)

Res.:

Code:

. l, sepby(region id)

     +---------------------------+
     | region   var1   var2   id |
     |---------------------------|
  1. |      A      1     10    1 |
  2. |      A      5      9    1 |
  3. |      A     10     16    1 |
  4. |      A     16     18    1 |
  5. |      A     16     17    1 |
     |---------------------------|
  6. |      A      6     11    6 |
     |---------------------------|
  7. |      B      1     30    7 |
  8. |      B      2     29    7 |
  9. |      B     30     32    7 |
 10. |      B     31     38    7 |
 11. |      B     33     33    7 |
     |---------------------------|
 12. |      B     35     38   12 |
     |---------------------------|
 13. |      C      2      7   13 |
 14. |      C      6      9   13 |
 15. |      C      7      9   13 |
 16. |      C      7     11   13 |
 17. |      C     10     11   13 |
     |---------------------------|
 18. |      C      6      8   17 |
     |---------------------------|
 19. |      C     50     50   19 |
 20. |      C     50     57   19 |
     +---------------------------+

Announcement