Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data manipulation - sorting variables in a large dataset of entries over time

    Hi, i've been attempting to sort a large dataset which contains data collected from participants over time.

    This includes identifying data (indexno) and entries over time. Entries are shown by the "count" variable and the dates of the entries are given by the variable "date1".

    There are several binary variables and it is important whether these occur before or after the entry shown by the "newevent" variable.

    For each of the participants identified by the indexno; if a variable occurs before, but not at the same time as the "newevent" variable, I'm trying to add this to the same entry as the "newevent" and delete the prior entries.

    Examples of the datasets are given below which I hope will make this clearer:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float indexno long id float(count newevent date1 var1 var2 var3 var4)
    256789 4 1 0 18447 1 1 0 1
    256789 4 2 0 18773 0 1 0 1
    256789 4 3 1 19726 0 0 0 1
    256789 4 4 0 19799 0 0 1 0
    256789 4 5 0 20777 0 0 1 1
    256789 4 6 0 21383 0 0 1 0
    347652 2 1 1 19084 0 0 0 0
    347652 2 2 0 19701 0 0 0 0
    347652 2 3 0 20768 0 0 0 0
    347652 2 4 0 20830 0 0 0 1
    347652 2 5 0 20894 0 0 0 1
    347652 2 6 0 21189 0 0 0 1
    347652 2 7 0 21455 0 0 0 0
    347652 2 8 0 21883 0 0 0 0
    347652 2 9 0 22277 0 0 0 0
    456736 1 1 0 13808 0 1 0 1
    456736 1 2 0 13889 0 1 0 1
    456736 1 3 0 14280 0 1 0 1
    456736 1 4 0 14722 0 1 0 0
    456736 1 6 0 15215 0 1 0 0
    456736 1 7 1 15663 0 1 0 0
    456736 1 8 0 16061 0 1 0 0
    456736 1 9 0 17550 0 1 0 0
    156782 3 1 0 16576 0 0 0 1
    156782 3 2 0 17705 1 0 0 1
    156782 3 3 0 18101 1 0 0 1
    156782 3 4 1 18479 1 0 0 0
    156782 3 5 0 18825 0 0 0 0
    end
    format %td date1
    label values id id
    label def id 1 "83hf9390", modify
    label def id 2 "88829bb3", modify
    label def id 3 "8hgh838", modify
    label def id 4 "d348fjf3", modify

    Below is what I am aiming for:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float indexno long id float(count newevent date1 var1 var2 var3 var4 var1a var2a var4a)
    256789 4 3 1 19726 0 0 0 1 1 1 1
    256789 4 4 0 19799 0 0 1 0 0 0 0
    256789 4 5 0 20777 0 0 1 1 0 0 0
    256789 4 6 0 21383 0 0 1 0 0 0 0
    347652 2 1 1 19084 0 0 0 0 0 0 0
    347652 2 2 0 19701 0 0 0 0 0 0 0
    347652 2 3 0 20768 0 0 0 0 0 0 0
    347652 2 4 0 20830 0 0 0 1 0 0 0
    347652 2 5 0 20894 0 0 0 1 0 0 0
    347652 2 6 0 21189 0 0 0 1 0 0 0
    347652 2 7 0 21455 0 0 0 0 0 0 0
    347652 2 8 0 21883 0 0 0 0 0 0 0
    347652 2 9 0 22277 0 0 0 0 0 0 0
    456736 1 7 1 15663 0 1 0 0 0 1 1
    456736 1 8 0 16061 0 1 0 0 0 0 0
    456736 1 9 0 17550 0 1 0 0 0 0 0
    156782 3 4 1 18479 1 0 0 0 1 0 1
    156782 3 5 0 18825 0 0 0 0 0 0 0
    end
    format %td date1
    label values id id
    label def id 1 "83hf9390", modify
    label def id 2 "88829bb3", modify
    label def id 3 "8hgh838", modify
    label def id 4 "d348fjf3", modify

    Many thanks for your advice!
    Last edited by Rhys Davies; 15 Mar 2022, 12:16.

  • #2
    I only partly understand what you want.

    First, there are some assumptions that must be true in order not just for this code to work, but for your problem to be solvable in principle. It must be the case that indexno and count jointly identify unique observations in the data. And it must be that in every group of observations having the same indexno, there must be one, and only one, observation with newevent == 1.

    If those are true, the following code almost works:
    Code:
    //  CHECK ASSUMPTIONS
    isid indexno count, sort
    by indexno (count): assert date1 <= date[_n+1]
    by indexno (count): egen n_new_events = total(newevent)
    assert n_new_events == 1
    drop n_new_events
    
    //  UPDATE THE NEW-EVENT OBSERVATION TO INCLUDE ALL EVENTS (var1-var4)
    //  THAT OCCUR BEFORE IT; THEN DELETE PRIOR ENTRIES
    by indexno (count): egen new_event_date = max(cond(newevent, date1, .))
    foreach v of varlist var1-var4 {
        by indexno (count): egen `v'a = max(cond(date1 <= new_event_date, `v', .))
        replace `v'a = `v' if date1 > new_event_date
    }
    drop if date1 < new_event_date
    The problem is that you do not state how to populate the variables var1a-var4a for the observations where date1 is after the new event. A quick glance at your examples suggest that most of the time you want to just preserve the original values of var1-var4 for those, and that is what the above code does. But looking more closely, there are a number of instances where that is not the case:

    Code:
        indexno   count   wanted2   var2  
         456736       8         0      1  
         456736       9         0      1  
    
    
        indexno   count   wanted4   var4  
         256789       5         0      1  
         347652       4         0      1  
         347652       5         0      1  
         347652       6         0      1
    where wanted2 and wanted4 are the var2a and var4a values shown in your post. I could not discern any pattern that explains why you do something different in these particular instances. So I just left the code as is, in the hope that perhaps you made some mistakes creating these variables "by hand." If that's not so, please post back explaining how to get your desired values of var1a-var4a when date1 is after the new event date.

    By the way, is there a reason for not having any var3a? I created one anyway.
    Last edited by Clyde Schechter; 15 Mar 2022, 13:48.

    Comment


    • #3
      Clyde this is brilliant! Thank you. Your assumptions were correct and yes I must have introduced so errors entering the data, apologies.

      The code worked perfectly, i'm very grateful.

      Comment


      • #4
        Hi Clyde,

        I have a further final challenge manipulating a set of similar data. Based on your previous recommendations i've taken it most of the way there but struggling with a final step.

        The data is in long format with participant identifying data (indexno) and a new event. Some of the participants then have further entries over time shown by the count variable with a corresponding date and several binary variables.

        For those variables with more than one "count" i'm attempting to bring this into a wide format with new variables corresponding to the additional date and binary variables. The total number of new entries over time may be up to 7 to 8, i've given an example of 4 here.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float indexno str8 id float(count newevent date1 var1 var2 var3 var4)
        663943 "5fjrkw53" 1 1 18703 1 0 0 0
        847583 "92ddwg3"  1 1 19278 0 0 1 1
        847583 "92ddwg3"  2 0 19440 1 0 0 0
        357345 "987fhdj1" 1 1 18796 0 1 0 0
        356802 "2gdldh93" 1 1 19923 1 0 1 0
        356802 "2gdldh93" 2 0 21843 0 0 0 1
        356802 "2gdldh93" 3 0 21936 0 1 1 0
        692950 "8nnaekj"  1 1 20076 1 0 0 0
        440524 "49fhfn3"  1 1 20890 0 0 1 0
        440524 "49fhfn3"  2 0 21318 1 0 1 0
        957381 "82rnd3"   1 1 20777 1 1 0 0
        end
        format %td date1
        I am aiming to get the data into the following format (different names of the new variables may be more straightforward?).

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float indexno str8 id float(count newevent date1 var1 var2 var3 var4 date2 var1a var2a var3a var4a date3 var1b var2b var3b var4b date4 var1c var2c var3c var4c)
        663943 "5fjrkw53" 1 1 18703 1 0 0 0     0 0 0 0 0     0 0 0 0 0 0 0 0 0 0
        847583 "92ddwg3"  1 1 19278 0 0 1 1 19440 1 0 0 0     0 0 0 0 0 0 0 0 0 0
        357345 "987fhdj1" 1 1 18796 0 1 0 0     0 0 0 0 0     0 0 0 0 0 0 0 0 0 0
        356802 "2gdldh93" 1 1 19923 1 0 1 0 21843 0 0 0 1 21936 0 1 1 0 0 0 0 0 0
        692950 "8nnaekj"  1 1 20076 1 0 0 0     0 0 0 0 0     0 0 0 0 0 0 0 0 0 0
        440524 "49fhfn3"  1 1 20890 0 0 1 0 21318 1 0 1 0     0 0 0 0 0 0 0 0 0 0
        957381 "82rnd3"   1 1 20777 1 1 0 0     0 0 0 0 0     0 0 0 0 0 0 0 0 0 0
        end
        format %td date1
        I'd be very grateful for any advice about the best way to do this.

        Many thanks

        Comment


        • #5
          Well, yes, the names you have chosen for the new variables are possible, but not very convenient. So I'll choose to rename them var1_1 (original var1), var1_2, var1_3, etc.

          The only other obstacle to going wide here is the variable newevent. In your example data, newevent == 1 when and only when count == 1. If this is true throughout the data, then it contains only redundant information and can be dropped. Then the reshape goes forward easily.

          Code:
          assert newevent == (count == 1)
          drop newevent
          
          rename date1 date_
          rename (date var*) =_
          reshape wide date_ var1_ var2_ var3_ var4_, i(indexno) j(count)
          If this code breaks at the -assert- statement, then newevent would have to be added to the list of variables in -rename (date var*) =_-, and newevent_ would have to be added to the list of variables in the -reshape- command so that the appropriate value of newevent would be carried across.

          Comment


          • #6
            Understood, yes that's excellent, thank you very much Clyde!

            Comment

            Working...
            X