Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I find it is still necessary to clarify the logic behind the wanted id. Consider the below example:
    Code:
    clear
    input byte(var1 var2)
     2  7
     7  9
     7 11
    11 12
     6  8
     8  9
    10 15
    15 19
    50 51
     9 14
    end
    Note the added observation in red, which starts at 9 that appears to be the end of 2 different obs (obs 2 and obs 9).

    For that specific sample, code 1 and code 2 give different outputs. I show below the output for code 2, which seems ... undesirable.

    Code:
         +------------------+
         | var1   var2   id |
         |------------------|
      1. |    2      7    1 |
      2. |    7      9    1 |
      3. |    7     11    1 |
      4. |   11     12    1 |
      5. |    6      8    1 |
      6. |    8      9    1 |
         |------------------|
      7. |   10     15    2 |
      8. |   15     19    2 |
         |------------------|
      9. |   50     51    3 |
         |------------------|
     10. |    9     14    4 |
         +------------------+

    Comment


    • #17
      Thanks very much, Romalpa Akzo. I have adjusted both codes with some incremental improvements (which I no longer remember at this stage) and both of them seem now to provide the desired output! I tested them on different cases of overlaps between rows and they seem to be robust now but please do let me know in case you come across another example for which the output is incorrect. Thanks again!

      Code 1
      Code:
      levelsof var1, local(v1) // all values taken by var1
      levelsof var2, local(v2) // all values taken by var2
      local int: list v1 & v2 // finds intersect
      local int: subinstr local int " " ",", all // replaces spaces with commas in all occurences
      gen d_v1= (inlist(var1, `int'))
      gen d_v2= (inlist(var2, `int'))
      
      gen id=_n // id of rows
      gen sample = (d_v1==1 | d_v2==1)
      
      preserve
          // select sample to work with
          keep if sample
          expand 2 if d_v1==1 & d_v2==1, g(new) // adds replicated row if condition holds
          gen var = cond(d_v1==1, var1, var2)
          replace var = var2 if new==1
          group_id id, matchby(var) // grouping ids according to variable of interest
          keep if !new // delete replicated rows
          tempfile sample
          save `sample'
      restore
      
      // replacing tagged rows with grouped id
      drop if sample
      append using `sample'
      drop d_v* sample var new
      
      // adding sequential numbers for id 
      sort id var1
      gen id_final = sum(id!=id[_n-1]) // cumulative sum if the condition holds
      drop id
      rename id_final id
      
      list, sepby(id)
      Code 2
      Code:
      * create unique identifier and save starting dataset
      sort var1 var2
      gen long id = _n
      sort id
      save "statalist1.dta", replace
      
      * make a list of distinct event end dates
      bysort var2: keep if _n == 1
      keep var2
      
      * tag rows where start date = end date in another row
      rename var2 var1
      merge 1:m var1 using "statalist1.dta", keep(using match)
      rename _merge _merge_v1 // == 3 if row's start date = end date elsewhere
      tempfile temp
      save `temp', replace
      
      * tag rows where end date = start date elsewhere
      bysort var1: keep if _n==1 & _merge==3
      keep var1
      rename var1 var2
      merge 1:m var2 using `temp', keep(using match)
      rename _merge _merge_v2
      order var1 var2
      list, sep(0)
      
      * group id of subsequent events
      gen sample = (_merge_v1==3 | _merge_v2==3)    
      preserve
          keep if sample==1
          expand 2 if _merge_v1==3 & _merge_v2==3, gen(dup)
          gen int_var = cond(_merge_v1==3, var1, var2)
          replace int_var = var2 if dup==1
          group_id id, matchby(int_var)
          drop if dup
          drop dup int_var
          tempfile sample
          save `sample', replace
      restore
      drop if sample==1
      append using `sample'
      drop sample
          
      * reorder id in sequential order
      sort id
      gen id_final = sum(id!=id[_n-1])
      drop id
      rename id_final id
      
      sort id var1 var2
      list, sepby(id)
      Output of both codes
      Code:
      . list, sepby(id)
      
           +------------------+
           | var1   var2   id |
           |------------------|
        1. |    2      7    1 |
        2. |    6      8    1 |
        3. |    7     11    1 |
        4. |    7      9    1 |
        5. |    8      9    1 |
        6. |    9     14    1 |
        7. |   11     12    1 |
           |------------------|
        8. |   10     15    2 |
        9. |   15     19    2 |
           |------------------|
       10. |   50     51    3 |
           +------------------+

      Comment


      • #18
        The below code might also reflect correctly the logic behind the id connection.
        Code:
        expand 2, gen(ex)
        
        gen a = cond(ex, var1,var2)
        bys a (ex): replace a = var2 if ex[1]
        bys a (ex): replace a = var1 if !ex[_N] & !ex[1]
        
        group_id a, matchby(var2)
        
        bys ex (a var1): gen id = sum(a!= a[_n-1])
        
        drop if ex
        drop ex a
        Last edited by Romalpa Akzo; 13 Jan 2022, 17:57.

        Comment


        • #19
          Amazing Romalpa Akzo!

          Comment

          Working...
          X