Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating ids, with groups, if value of a given variable = exact value or value +1 of another variable in *any other row*

    I have a dataset with the start & end dates of different events across regions. Within region, a major event constitutes of different events happening in sequence. I identify events happening in sequence if the end date of a given event = start date of another event (i.e., same day) OR start date of another event + 1 (i.e., next day). I am trying to create a code that identifies major events (within regions).

    In another post, Andrew Musau helped me find a solution of this exercise while considering data at the same level (no region grouping) and that subsequent events are only the ones where the end date and start date matches perfectly (I added two layers of complexity here, stated above in bold): https://www.statalist.org/forums/for...-any-other-row

    I have created a toy example of a dataset with three variables (region, start, end).
    Code:
    clear
    input str1 region float(var1 var2)
    "A" 1 10
    "A" 5  9
    "A" 6 11
    "A" 10 16
    "A" 16 17
    "A" 16 18
    "B" 1 30
    "B" 2 29
    "B" 30 32
    "B" 31 38
    "B" 33 33
    "B" 35 38
    "C" 7 9
    "C" 2 7
    "C" 7 11
    "C" 10 11
    "C" 6 8
    "C" 6 9
    "C" 50 50
    "C" 50 57
    end
    list, sepby(region)
    I would like to give each row, within a region, an id that is kept constant if the value of `start` coincides with `end` OR `end+1` in *any other row*.

    The output I would be looking for is the following:
    Code:
    clear
    input str1 region float(var1 var2 NEW)
    "A" 1 10 1
    "A" 10 16 1
    "A" 16 17 1
    "A" 16 18 1
    "A" 5  9 2
    "A" 6 11 3
    "B" 1 30 1
    "B" 30 32 1
    "B" 31 38 1
    "B" 33 33 1
    "B" 2 29 2
    "B" 35 38 3
    "C" 2 7 1
    "C" 7 9 1
    "C" 7 11 1
    "C" 10 11 1
    "C" 6 8 2
    "C" 6 9 3
    "C" 50 50 4
    "C" 50 57 4
    end
    list, sepby(region)

  • #2
    Based on #11 of https://www.statalist.org/forums/for...=1641798134734, you can expand the dataset to include values of var1+1. Then concatenate the group identifier with values of both var1 and var2 to get group-specific values and select the sample based on your criterion (as shown in the linked thread). Finally, run group_id on the sample and append to the rest of the dataset. The results show an inconsistency with your results, but an inspection (see below) will show that this approach gives the correct results.

    Code:
    clear
    input str1 region float(var1 var2 NEW)
    "A" 1 10 1
    "A" 10 16 1
    "A" 16 17 1
    "A" 16 18 1
    "A" 5  9 2
    "A" 6 11 3
    "B" 1 30 1
    "B" 30 32 1
    "B" 31 38 1
    "B" 33 33 1
    "B" 2 29 2
    "B" 35 38 3
    "C" 2 7 1
    "C" 7 9 1
    "C" 7 11 1
    "C" 10 11 1
    "C" 6 8 2
    "C" 6 9 3
    "C" 50 50 4
    "C" 50 57 4
    end
    
    expand 2, g(new)
    replace var1= var1+1 if new
    gen svar1= region + string(var1)
    gen svar2= region+ string(var2)
    levelsof svar1, local(v1) clean
    levelsof svar2, local(v2) clean
    local in12: list v1& v2
    local in12: subinstr local in12 " " "|", all
    gen sample= regexm(svar1, "(`in12')") | regexm(svar2, "(`in12')")
    gen id=_n
    expand 2, g(new2)
    gen var=cond(new2, svar2, svar1)
    preserve
    keep if sample
    group_id id, matchby(var)
    tempfile sample
    save `sample'
    restore
    drop if sample
    append using `sample'
    keep if !new
    keep region var1 var2 NEW id
    sort region id var1
    l, sepby(region id)
    Res.:

    Code:
    . l, sepby(region id)
    
         +---------------------------------+
         | region   var1   var2   NEW   id |
         |---------------------------------|
      1. |      A      1     10     1    1 |
      2. |      A      1     10     1    1 |
      3. |      A      6     11     3    1 |
      4. |      A      6     11     3    1 |
      5. |      A     10     16     1    1 |
      6. |      A     10     16     1    1 |
      7. |      A     16     18     1    1 |
      8. |      A     16     18     1    1 |
      9. |      A     16     17     1    1 |
     10. |      A     16     17     1    1 |
         |---------------------------------|
     11. |      A      5      9     2    5 |
     12. |      A      5      9     2    5 |
         |---------------------------------|
     13. |      B      1     30     1    7 |
     14. |      B      1     30     1    7 |
     15. |      B     30     32     1    7 |
     16. |      B     30     32     1    7 |
         |---------------------------------|
     17. |      B     31     38     1    9 |
     18. |      B     31     38     1    9 |
         |---------------------------------|
     19. |      B     33     33     1   10 |
     20. |      B     33     33     1   10 |
         |---------------------------------|
     21. |      B      2     29     2   11 |
     22. |      B      2     29     2   11 |
         |---------------------------------|
     23. |      B     35     38     3   12 |
     24. |      B     35     38     3   12 |
         |---------------------------------|
     25. |      C      2      7     1   13 |
     26. |      C      2      7     1   13 |
    27. |      C      6      8     2   13 |
     28. |      C      6      8     2   13 |
     29. |      C      7      9     1   13 |
     30. |      C      7     11     1   13 |
     31. |      C      7     11     1   13 |
     32. |      C      7      9     1   13 |
     33. |      C     10     11     1   13 |
     34. |      C     10     11     1   13 |
         |---------------------------------|
     35. |      C      6      9     3   18 |
     36. |      C      6      9     3   18 |
         |---------------------------------|
     37. |      C     50     57     4   19 |
     38. |      C     50     50     4   19 |
     39. |      C     50     50     4   19 |
     40. |      C     50     57     4   19 |
         +---------------------------------+

    Above in red, I highlight the inconsistencies. For observations 3 and 4, your wanted variable assigns them a different group, but that cannot be the case as 10 (observations 5 and 6) is 1 integer less than 11 (so they must be in the same group following your criterion). Same applies for observations 27 and 28 as 6 is one integer less than 7 (from observations 25 and 26).
    Last edited by Andrew Musau; 10 Jan 2022, 01:41.

    Comment


    • #3
      Hi Andrew Musau thanks for this. On the point you raised in #2: we would like rows to be assigned to the same group id if var1 in a given row is either equal to var2 or to var2+1 in any other row. The inconsistencies you raised were based on the different criteria of var1 being equal to var2 - 1 instead of var2 + 1.

      I have, however, corrected the desired output in another aspect. A row *may* also receive the same group identifier in case var2 is equal to var1 (or var1 - 1) in any other row even if its value of var1 does not coincide with var2 (or var2 + 1) in any other row within that group identifier. This occurs in the special case where we have one observation with var1 coinciding with var2 (or var2+1) of two different rows which would otherwise not belong to the same group. Take the three observations where var1=1 & var2=10 / var1=5 & var2=9 / var1=10 & var2=16. This first and third observations should receive the same id because var1 in the third observation is equal to var2 in the first observation. On the other hand, the third observation should also be in the same group id as the second observation because var1 in the third observation is equal to var2+1 of the second observation. Thus, they should all have the same group id.

      The code below seems to provide the desired output (as well as to a battery of other overlapping situations but please do let me know in case you think of an example that invalides the code). This is based on a code kindly provided by Robert Picard, the author of `group_id`, with some incremental additions of my own.

      Code:
      * create unique identifier and save starting dataset
      sort var1 var2
      gen long id = _n
      sort id
      
      * add duplicate observations with end date + 1
      expand 2, gen(nextday)
      bysort region id: replace var2 = var2 + 1 if nextday
      
      * add duplicate observations with end date + 1
      expand 2, gen(previousday)
      bysort region id: replace var1 = var1 - 1 if previousday
      
      save "statalist_nextday.dta", replace
      
      * tag rows where start date = [end date, end date + 1] in another row
      keep region var2
      bysort region var2: keep if _n == 1
      rename var2 var1
      merge 1:m region var1 using "statalist_nextday.dta", keep(using match)
      rename _merge _merge_v1 // == 3 if row's start date = [end date, end day + 1] elsewhere
      tempfile temp
      save `temp', replace
      
      * tag rows where [end date, end date + 1] = start date elsewhere
      keep region var1
      bysort region var1: keep if _n==1
      rename var1 var2
      merge 1:m region var2 using `temp', keep(using match)
      rename _merge _merge_v2 // == 3 if row's [end date, end day + 1] = start day elsewhere
      order var1 var2
      list, sep(0)
      
      * group id of subsequent events
      gen sample = (_merge_v1==3 | _merge_v2==3)    
      preserve
          keep if sample==1
          expand 2 if _merge_v1==3 & _merge_v2==3, gen(dup)
          gen int_var = cond(_merge_v1==3, var1, var2)
          replace int_var = var2 if dup==1
          group_id id, matchby(region int_var)
          drop if dup
          drop dup int_var
          tempfile sample
          save `sample', replace
      restore
      drop if sample==1
      append using `sample'
      drop if nextday | previousday    
      drop sample nextday previousday
      
      * reorder id in sequential order
      bys region (id): gen id_final = sum(id!=id[_n-1])
      drop id
      rename id_final id
      
      sort region id var1 var2
      list region var1 var2 id, sepby(id)
      Output
      Code:
           +---------------------------+
           | region   var1   var2   id |
           |---------------------------|
        1. |      A      1     10    1 |
        2. |      A      5      9    1 |
        3. |      A     10     16    1 |
        4. |      A     16     17    1 |
        5. |      A     16     18    1 |
           |---------------------------|
        6. |      A      6     11    2 |
           |---------------------------|
        7. |      B      1     30    1 |
        8. |      B      2     29    1 |
        9. |      B     30     32    1 |
       10. |      B     31     38    1 |
       11. |      B     33     33    1 |
           |---------------------------|
       12. |      B     35     38    2 |
           |---------------------------|
       13. |      C      2      7    1 |
       14. |      C      6      9    1 |
       15. |      C      7      9    1 |
       16. |      C      7     11    1 |
       17. |      C     10     11    1 |
           |---------------------------|
       18. |      C      6      8    2 |
           |---------------------------|
       19. |      C     50     50    3 |
       20. |      C     50     57    3 |
           +---------------------------+
      Last edited by Paula de Souza Leao Spinola; 12 Jan 2022, 09:11.

      Comment


      • #4
        Looks good. #2 can be modified as follows to take into account your conditions.

        Code:
        clear
        input str1 region float(var1 var2 NEW)
        "A" 1 10 1
        "A" 10 16 1
        "A" 16 17 1
        "A" 16 18 1
        "A" 5  9 2
        "A" 6 11 3
        "B" 1 30 1
        "B" 30 32 1
        "B" 31 38 1
        "B" 33 33 1
        "B" 2 29 2
        "B" 35 38 3
        "C" 2 7 1
        "C" 7 9 1
        "C" 7 11 1
        "C" 10 11 1
        "C" 6 8 2
        "C" 6 9 3
        "C" 50 50 4
        "C" 50 57 4
        end
        
        gen id=_n
        expand 2, g(new)
        replace var1= var1-1 if new
        replace var2= var2+1 if new
        gen svar1= region + string(var1)
        gen svar2= region+ string(var2)
        levelsof svar1, local(v1) clean
        levelsof svar2, local(v2) clean
        local in12: list v1& v2
        local in12: subinstr local in12 " " "|", all
        gen sample= regexm(svar1, "(`in12')") | regexm(svar2, "(`in12')")
        expand 2 if sample, g(new2)
        gen var=cond(new2, svar2, svar1)
        preserve
        keep if sample
        group_id id, matchby(var)
        tempfile sample
        save `sample'
        restore
        drop if sample
        append using `sample'
        keep if !new
        keep region var1 var2 id
        duplicates drop *, force
        sort region id var1
        l, sepby(region id)
        Res.:

        Code:
        . l, sepby(region id)
        
             +---------------------------+
             | region   var1   var2   id |
             |---------------------------|
          1. |      A      1     10    1 |
          2. |      A      5      9    1 |
          3. |      A     10     16    1 |
          4. |      A     16     18    1 |
          5. |      A     16     17    1 |
             |---------------------------|
          6. |      A      6     11    6 |
             |---------------------------|
          7. |      B      1     30    7 |
          8. |      B      2     29    7 |
          9. |      B     30     32    7 |
         10. |      B     31     38    7 |
         11. |      B     33     33    7 |
             |---------------------------|
         12. |      B     35     38   12 |
             |---------------------------|
         13. |      C      2      7   13 |
         14. |      C      6      9   13 |
         15. |      C      7      9   13 |
         16. |      C      7     11   13 |
         17. |      C     10     11   13 |
             |---------------------------|
         18. |      C      6      8   17 |
             |---------------------------|
         19. |      C     50     50   19 |
         20. |      C     50     57   19 |
             +---------------------------+

        Comment

        Working...
        X