Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating row ids indicating when value of a given variable is equal to value of another variable in *any other row*

    I have a dataset with the start and end date of different events. A major event may constitute of different events happening in sequence (if the end date of an event = start date of another event, then they belong to the same major event). I am trying to create a code that identifies major events.

    I have created a toy example of a dataset with two variables (var1, var2), where var1 refers to the start date and var2 to the end date in my real data.

    Code:
    input var1 var2
    1 10
    5 9
    6 11
    10 15
    16 17
    end
    list
    I would like to give each row an id which grows in ascending order of var1 and, most importantly, is kept constant if the value of var1 coincides with var2 of *any other row*.
    The output I would be looking for is the following:
    Code:
    input var1 var2 new_var
    1 10 1
    10 15 1
    5 9 2
    6 11 3
    16 17 4
    end
    list
    Many thanks!

  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(var1 var2)
     1 10
     5  9
     6 11
    10 15
    16 17
    end
    
    gen id=_n
    expand 2, g(new)
    gen var=cond(new, var2, var1)
    bys var(id): replace id= id[1]
    keep if !new
    egen wanted = group(id)
    keep var1 var2 wanted
    Res.:

    Code:
    . sort wanted var1
    
    . l
    
         +----------------------+
         | var1   var2   wanted |
         |----------------------|
      1. |    1     10        1 |
      2. |   10     15        1 |
      3. |    5      9        2 |
      4. |    6     11        3 |
      5. |   16     17        4 |
         +----------------------+
    Last edited by Andrew Musau; 08 Jan 2022, 07:57.

    Comment


    • #3
      This is awesome, Andrew Musau ! Many thanks.

      There is just a small issue: the code would output the same value for variable -wanted- in case of rows that have the same value of -var1-. I would like the variable -wanted- to have the same value *only if* -var1- coincides with the value of -var2- in any of the rows.

      Consider the example below, where I added an extra row to the previous example (with the same value of -var1- as the previous row):
      Code:
      clear
      input var1 var2
      1 10
      5 9
      6 11
      10 15
      16 17
      16 18
      end
      sort var1
      list
      Then, the code would output:
      Code:
      . list
      
           +----------------------+
           | var1   var2   wanted |
           |----------------------|
        1. |    1     10        1 |
        2. |   10     15        1 |
        3. |    5      9        2 |
        4. |    6     11        3 |
        5. |   16     18        4 |
           |----------------------|
        6. |   16     17        4 |
           +----------------------+
      I would be looking to get 5 as the value for -wanted- in the last row.

      Comment


      • #4
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float(var1 var2)
         1 10
         5  9
         6 11
        10 15
        16 17
        16 18
        end
        
        gen id=_n
        expand 2, g(new)
        gen var=cond(new, var2, var1)
        bys var (new): gen in1=_N>=2 & new[1]==new[_N]
        bys var (id): replace id= cond(in1, id, id[1])
        keep if !new
        egen wanted = group(id)
        keep var1 var2 wanted
        Res.:

        Code:
        . sort wanted var1
        
        . l, sepby(wanted)
        
             +----------------------+
             | var1   var2   wanted |
             |----------------------|
          1. |    1     10        1 |
          2. |   10     15        1 |
             |----------------------|
          3. |    5      9        2 |
             |----------------------|
          4. |    6     11        3 |
             |----------------------|
          5. |   16     17        4 |
             |----------------------|
          6. |   16     18        5 |
             +----------------------+

        Comment


        • #5
          Thanks a lot, Andrew Musau.

          It seems to be an issue now in case the repeated value of var1 across rows also shows up in var2. Take this example below (where I replaced 15 with 16 in the previous example)

          Code:
          clear
          input var1 var2
          1 10
          5 9
          6 11
          10 16
          16 17
          16 18
          end
          By running the previous code we would get the following:
          Code:
               +----------------------+
               | var1   var2   wanted |
               |----------------------|
            1. |    1     10        1 |
            2. |   10     16        1 |
               |----------------------|
            3. |    5      9        2 |
               |----------------------|
            4. |    6     11        3 |
               |----------------------|
            5. |   16     17        4 |
            6. |   16     18        4 |
               +----------------------+
          Again we would have the last two observations with the same value for -wanted-

          Comment


          • #6
            Before modifying the code, as 16 is present in var2, your expectation is


            Code:
                 +----------------------+
                 | var1   var2   wanted |
                 |----------------------|
              1. |    1     10        1 |
              2. |   10     16        1 |
              3. |   16     17        1 |  
              4. |   16     18        1 |
                 |----------------------|
              5. |    5      9        2 |
                 |----------------------|
              6. |    6     11        3 |
                 +----------------------+
            ? If not, explain the logic of considering 10 and not 16 in this example.

            Comment


            • #7
              You are absolutely right, Andrew Musau. My expectation is exactely what you flagged - thanks for catching this.

              I will also need to add another layer of complexity to the code. So far we considered the case where we give the same identifier if var1 in a given row is equal to var2 in any other row. A next step which I will need to implement will be to extend the code to also consider the same identifier in cases where var1 in a given row is equal to var2 + 1 in any other row. I thought this would be relatively easy to implement once we have the first part of the code, but now I am a little less confident. Would you have any thoughts on this?


              Comment


              • #8
                Install group_id from SSC by Robert Picard.

                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input float(var1 var2)
                 1 10
                 5  9
                 6 11
                10 16
                16 17
                16 18
                end
                
                gen id=_n
                expand 2, g(new)
                gen var=cond(new, var2, var1)
                group_id id, matchby(var)
                keep if !new
                keep id var1 var2
                Res.:

                Code:
                . sort id var1
                
                . l, sepby(id)
                
                     +---------------------+
                     | var1   var2      id |
                     |---------------------|
                  1. |    1     10       1 |
                  2. |   10     16       1 |
                  3. |   16     17       1 |
                  4. |   16     18       1 |
                     |---------------------|
                  5. |    5      9       2 |
                     |---------------------|
                  6. |    6     11       3 |
                     +---------------------+
                next step which I will need to implement will be to extend the code to also consider the same identifier in cases where var1 in a given row is equal to var2 + 1 in any other row. I thought this would be relatively easy to implement once we have the first part of the code, but now I am a little less confident. Would you have any thoughts on this?
                I cannot follow this. Provide a data example, creating a wanted variable.
                Last edited by Andrew Musau; 09 Jan 2022, 01:57.

                Comment


                • #9
                  Many thanks Andrew Musau, this is very helpful!

                  Because I needed to provide further explanation as well as another data example + desired output, I thought it would be good practice to create another post (link below). I would be really grateful if you could have a look. Please let me know if further clarification is needed.

                  https://www.statalist.org/forums/for...-any-other-row

                  Comment


                  • #10
                    Andrew Musau I am sorry - I have just realised that the code in #8 does not work when the same variable (var1 or var2) coincides across rows.

                    The code seems to be outputting the same id variable if the value of var1 of a given row coincides with values of var2 in any other row *OR* if values of var1 coincide across rows *OR* if values of var2 coincide across rows (same value for the same variable across rows). I would be looking to get the same id *only* for the first case (var1 in a given row coinciding with var2 in any other row), but *not* if values of the same variable coincide across rows in case the former condition does not hold.

                    Take the example below.
                    Code:
                    clear
                    input float(var1 var2)
                    2 7
                    7 9
                    7 11
                    6 8
                    6 9
                    10 11
                    end
                    I would be looking to get:
                    Code:
                    clear
                    input float(var1 var2 id)
                    2 7 1
                    7 9 1
                    7 11 1
                    6 8 2
                    6 9 3
                    10 11 4
                    end
                    The code in #8 outputs id=1 to every single row.

                    PS: I would then be looking to expand the problem with two extra layers of complexity explained here: https://www.statalist.org/forums/for...-any-other-row

                    Comment


                    • #11
                      You just need to select a sample based on your criterion and run group_id on this. Then append to the rest of the dataset.

                      Code:
                      clear
                      input float(var1 var2)
                      2 7
                      7 9
                      7 11
                      6 8
                      6 9
                      10 11
                      end
                      
                      levelsof var1, local(v1)
                      levelsof var2, local(v2)
                      local in12: list v1& v2
                      local in12: subinstr local in12 " " ",", all
                      gen sample= inlist(var1, `in12') | inlist(var2, `in12')
                      gen id=_n
                      expand 2, g(new)
                      gen var=cond(new, var2, var1)
                      preserve
                      keep if sample
                      group_id id, matchby(var)
                      tempfile sample
                      save `sample'
                      restore
                      drop if sample
                      append using `sample'
                      keep if !new
                      keep var1 var2 id 
                      sort id var1
                      replace id= sum(id!=id[_n-1])
                      l, sepby(id)
                      Res.:

                      Code:
                      . 
                      . l, sepby(id)
                      
                           +------------------+
                           | var1   var2   id |
                           |------------------|
                        1. |    2      7    1 |
                        2. |    7      9    1 |
                        3. |    7     11    1 |
                           |------------------|
                        4. |    6      8    2 |
                           |------------------|
                        5. |    6      9    3 |
                           |------------------|
                        6. |   10     11    4 |
                           +------------------+

                      Comment


                      • #12
                        Thanks Andrew Musau !!!

                        I am sorry for this long exchange - I now realised that `group_id` changes the ids in cascade mode in a way that we do not wish here. Take the example below. We should aim for id=1 for the first 4 observations and id=2 for the last two observations.

                        Code:
                        clear
                        input float(var1 var2)
                        2 7
                        7 9
                        7 11
                        11 12
                        6 8
                        8 9
                        end
                        list, sep(0)
                        Because of the first observation (var1= 2 & var2 = 7) which has originally id=1, all rows in the sample where either variable (var1 or var2) is equal to 2 or 7 receives id=1. This includes observation where var1= 7 & var2 = 9. All ok up to now, the problem is that this rationale follows as a cascade. Now all observations where either variable equals 9 is also given id=1. This includes the observation where var1 = 8 and var = 9. [This should be given id=2 and not id=1 in our desired output.] In a sequence, now all observations that have 8 in either variable also gets id=1. This includes observation where var1=6 and var2=8. [Again, this should be given id=2 instead of id=1.]

                        Comment


                        • #13
                          I have to think about this as you still need to do the complicated matching done by group_id at the same time introduce an isolation condition based on one of the variables. I would suggest that you start a new thread with the full description of the problem as in #12. Maybe someone else in the forum has a simpler solution that I am not seeing.

                          Comment


                          • #14
                            Andrew Musau it is looking like this will be the last message to this thread (with two presumably successful codes!).

                            The issue above was that the sample was not restrictive enough. When redefining the sample & replicating the rows in a more restrictive way before grouping ids, the codes below seem to do the job. I came to the first one adding on the code you sent earlier. The alternative code below was kindly provided by Robert Picard, the author of `group_by` command, with whom I exchanged a couple of emails.

                            Consider the data example below (I appended a few extra obs to the previous example).

                            Code:
                            clear
                            input float(var1 var2)
                            2 7
                            7 9
                            7 11
                            11 12
                            6 8
                            8 9
                            10 15
                            15 19
                            50 51
                            end
                            Code 1
                            Code:
                            // selecting sample where var1 in a given row coincides with var2 in any other given row and tagging if var1 or var2 is of interest in each specific row
                            levelsof var1, local(v1) // all values taken by var1
                            levelsof var2, local(v2) // all values taken by var2
                            local int: list v1 & v2 // finds intersect
                            local int: subinstr local int " " ",", all // replaces spaces with commas in all occurences
                            gen d_v1= (inlist(var1, `int'))
                            gen d_v2= (inlist(var2, `int'))
                            
                            gen id=_n // id of rows
                            gen sample = (d_v1==1 | d_v2==1)
                            
                            preserve
                                // select sample to work with
                                keep if sample
                                expand 2 if d_v1==1 & d_v2==1, g(new) // adds replicated row if condition holds
                                gen var = cond(d_v1==1, var1, var2)
                                replace var = var2 if new==1
                                group_id id, matchby(var) // grouping ids according to variable of interest
                                keep if !new // delete replicated rows
                                tempfile sample
                                save `sample'
                            restore
                            
                            // replacing tagged rows with grouped id
                            drop if sample
                            append using `sample'
                            drop d_v* sample var new
                            
                            // adding sequential numbers for id 
                            sort id var1
                            gen id_final = sum(id!=id[_n-1]) // cumulative sum if the condition holds
                            drop id
                            rename id_final id
                            
                            list, sepby(id)

                            Code 2
                            Code:
                            * create unique identifier and save starting dataset
                            gen long id = _n
                            sort id
                            save "statalist1.dta", replace
                            
                            * make a list of distinct event start dates
                            bysort var1: keep if _n == 1
                            keep var1
                            
                            * match these to the event end dates in the original sample
                            rename var1 var2
                            merge 1:m var2 using "statalist1.dta"
                            
                            sort _merge id
                            order id var1 var2
                            list, sepby(_merge)
                            
                            * ignore distinct start dates that do not match any end date
                            drop if _merge == 1
                            
                            sort id
                            list, sep(0)
                            
                            * duplicate only observations where there is a matching start date in the data
                            expand 2 if _merge == 3, gen(dup)
                            sort id dup
                            gen startend = cond(dup, var2, var1)
                            
                            drop _merge
                            group_id id, matchby(startend)
                            gen id_final = sum(id!=id[_n-1])
                            drop id
                            rename id_final id 
                            
                            * drop copies
                            drop if dup
                            drop dup startend
                            
                            list, sepby(id)
                            Both codes provide the correct output, copied below:
                            Code:
                            . list, sepby(id)
                            
                                 +------------------+
                                 | var1   var2   id |
                                 |------------------|
                              1. |    2      7    1 |
                              2. |    7      9    1 |
                              3. |    7     11    1 |
                              4. |   11     12    1 |
                                 |------------------|
                              5. |    6      8    2 |
                              6. |    8      9    2 |
                                 |------------------|
                              7. |   10     15    3 |
                              8. |   15     19    3 |
                                 |------------------|
                              9. |   50     51    4 |
                                 +------------------+

                            Comment


                            • #15
                              I think that you have the right modification to the code in #11 which was not apparent to me. Robert's coding is excellent as always, so thanks for sharing. There is a lot to learn from it.

                              Comment

                              Working...
                              X