Data manipulation - sorting variables in a large dataset of entries over time

Rhys Davies

Join Date: Dec 2021
Posts: 19

Data manipulation - sorting variables in a large dataset of entries over time

15 Mar 2022, 10:53

Hi, i've been attempting to sort a large dataset which contains data collected from participants over time.

This includes identifying data (indexno) and entries over time. Entries are shown by the "count" variable and the dates of the entries are given by the variable "date1".

There are several binary variables and it is important whether these occur before or after the entry shown by the "newevent" variable.

For each of the participants identified by the indexno; if a variable occurs before, but not at the same time as the "newevent" variable, I'm trying to add this to the same entry as the "newevent" and delete the prior entries.

Examples of the datasets are given below which I hope will make this clearer:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float indexno long id float(count newevent date1 var1 var2 var3 var4)
256789 4 1 0 18447 1 1 0 1
256789 4 2 0 18773 0 1 0 1
256789 4 3 1 19726 0 0 0 1
256789 4 4 0 19799 0 0 1 0
256789 4 5 0 20777 0 0 1 1
256789 4 6 0 21383 0 0 1 0
347652 2 1 1 19084 0 0 0 0
347652 2 2 0 19701 0 0 0 0
347652 2 3 0 20768 0 0 0 0
347652 2 4 0 20830 0 0 0 1
347652 2 5 0 20894 0 0 0 1
347652 2 6 0 21189 0 0 0 1
347652 2 7 0 21455 0 0 0 0
347652 2 8 0 21883 0 0 0 0
347652 2 9 0 22277 0 0 0 0
456736 1 1 0 13808 0 1 0 1
456736 1 2 0 13889 0 1 0 1
456736 1 3 0 14280 0 1 0 1
456736 1 4 0 14722 0 1 0 0
456736 1 6 0 15215 0 1 0 0
456736 1 7 1 15663 0 1 0 0
456736 1 8 0 16061 0 1 0 0
456736 1 9 0 17550 0 1 0 0
156782 3 1 0 16576 0 0 0 1
156782 3 2 0 17705 1 0 0 1
156782 3 3 0 18101 1 0 0 1
156782 3 4 1 18479 1 0 0 0
156782 3 5 0 18825 0 0 0 0
end
format %td date1
label values id id
label def id 1 "83hf9390", modify
label def id 2 "88829bb3", modify
label def id 3 "8hgh838", modify
label def id 4 "d348fjf3", modify

Below is what I am aiming for:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float indexno long id float(count newevent date1 var1 var2 var3 var4 var1a var2a var4a)
256789 4 3 1 19726 0 0 0 1 1 1 1
256789 4 4 0 19799 0 0 1 0 0 0 0
256789 4 5 0 20777 0 0 1 1 0 0 0
256789 4 6 0 21383 0 0 1 0 0 0 0
347652 2 1 1 19084 0 0 0 0 0 0 0
347652 2 2 0 19701 0 0 0 0 0 0 0
347652 2 3 0 20768 0 0 0 0 0 0 0
347652 2 4 0 20830 0 0 0 1 0 0 0
347652 2 5 0 20894 0 0 0 1 0 0 0
347652 2 6 0 21189 0 0 0 1 0 0 0
347652 2 7 0 21455 0 0 0 0 0 0 0
347652 2 8 0 21883 0 0 0 0 0 0 0
347652 2 9 0 22277 0 0 0 0 0 0 0
456736 1 7 1 15663 0 1 0 0 0 1 1
456736 1 8 0 16061 0 1 0 0 0 0 0
456736 1 9 0 17550 0 1 0 0 0 0 0
156782 3 4 1 18479 1 0 0 0 1 0 1
156782 3 5 0 18825 0 0 0 0 0 0 0
end
format %td date1
label values id id
label def id 1 "83hf9390", modify
label def id 2 "88829bb3", modify
label def id 3 "8hgh838", modify
label def id 4 "d348fjf3", modify

Many thanks for your advice!

Last edited by Rhys Davies; 15 Mar 2022, 11:16.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

15 Mar 2022, 12:45

I only partly understand what you want.

First, there are some assumptions that must be true in order not just for this code to work, but for your problem to be solvable in principle. It must be the case that indexno and count jointly identify unique observations in the data. And it must be that in every group of observations having the same indexno, there must be one, and only one, observation with newevent == 1.

If those are true, the following code almost works:

Code:

// CHECK ASSUMPTIONS isid indexno count, sort by indexno (count): assert date1 <= date[_n+1] by indexno (count): egen n_new_events = total(newevent) assert n_new_events == 1 drop n_new_events // UPDATE THE NEW-EVENT OBSERVATION TO INCLUDE ALL EVENTS (var1-var4) // THAT OCCUR BEFORE IT; THEN DELETE PRIOR ENTRIES by indexno (count): egen new_event_date = max(cond(newevent, date1, .)) foreach v of varlist var1-var4 { by indexno (count): egen `v'a = max(cond(date1 <= new_event_date, `v', .)) replace `v'a = `v' if date1 > new_event_date } drop if date1 < new_event_date

The problem is that you do not state how to populate the variables var1a-var4a for the observations where date1 is after the new event. A quick glance at your examples suggest that most of the time you want to just preserve the original values of var1-var4 for those, and that is what the above code does. But looking more closely, there are a number of instances where that is not the case:

Code:

indexno count wanted2 var2 456736 8 0 1 456736 9 0 1 indexno count wanted4 var4 256789 5 0 1 347652 4 0 1 347652 5 0 1 347652 6 0 1

where wanted2 and wanted4 are the var2a and var4a values shown in your post. I could not discern any pattern that explains why you do something different in these particular instances. So I just left the code as is, in the hope that perhaps you made some mistakes creating these variables "by hand." If that's not so, please post back explaining how to get your desired values of var1a-var4a when date1 is after the new event date.

By the way, is there a reason for not having any var3a? I created one anyway.

Last edited by Clyde Schechter; 15 Mar 2022, 12:48.
Comment
Rhys Davies

Join Date: Dec 2021

Posts: 19
#3

16 Mar 2022, 08:58

Clyde this is brilliant! Thank you. Your assumptions were correct and yes I must have introduced so errors entering the data, apologies.

The code worked perfectly, i'm very grateful.
Comment

Rhys Davies

Join Date: Dec 2021
Posts: 19

31 Mar 2022, 11:54

Hi Clyde,

I have a further final challenge manipulating a set of similar data. Based on your previous recommendations i've taken it most of the way there but struggling with a final step.

The data is in long format with participant identifying data (indexno) and a new event. Some of the participants then have further entries over time shown by the count variable with a corresponding date and several binary variables.

For those variables with more than one "count" i'm attempting to bring this into a wide format with new variables corresponding to the additional date and binary variables. The total number of new entries over time may be up to 7 to 8, i've given an example of 4 here.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float indexno str8 id float(count newevent date1 var1 var2 var3 var4)
663943 "5fjrkw53" 1 1 18703 1 0 0 0
847583 "92ddwg3"  1 1 19278 0 0 1 1
847583 "92ddwg3"  2 0 19440 1 0 0 0
357345 "987fhdj1" 1 1 18796 0 1 0 0
356802 "2gdldh93" 1 1 19923 1 0 1 0
356802 "2gdldh93" 2 0 21843 0 0 0 1
356802 "2gdldh93" 3 0 21936 0 1 1 0
692950 "8nnaekj"  1 1 20076 1 0 0 0
440524 "49fhfn3"  1 1 20890 0 0 1 0
440524 "49fhfn3"  2 0 21318 1 0 1 0
957381 "82rnd3"   1 1 20777 1 1 0 0
end
format %td date1

I am aiming to get the data into the following format (different names of the new variables may be more straightforward?).

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float indexno str8 id float(count newevent date1 var1 var2 var3 var4 date2 var1a var2a var3a var4a date3 var1b var2b var3b var4b date4 var1c var2c var3c var4c)
663943 "5fjrkw53" 1 1 18703 1 0 0 0     0 0 0 0 0     0 0 0 0 0 0 0 0 0 0
847583 "92ddwg3"  1 1 19278 0 0 1 1 19440 1 0 0 0     0 0 0 0 0 0 0 0 0 0
357345 "987fhdj1" 1 1 18796 0 1 0 0     0 0 0 0 0     0 0 0 0 0 0 0 0 0 0
356802 "2gdldh93" 1 1 19923 1 0 1 0 21843 0 0 0 1 21936 0 1 1 0 0 0 0 0 0
692950 "8nnaekj"  1 1 20076 1 0 0 0     0 0 0 0 0     0 0 0 0 0 0 0 0 0 0
440524 "49fhfn3"  1 1 20890 0 0 1 0 21318 1 0 1 0     0 0 0 0 0 0 0 0 0 0
957381 "82rnd3"   1 1 20777 1 1 0 0     0 0 0 0 0     0 0 0 0 0 0 0 0 0 0
end
format %td date1

I'd be very grateful for any advice about the best way to do this.

Many thanks

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

31 Mar 2022, 13:12

Well, yes, the names you have chosen for the new variables are possible, but not very convenient. So I'll choose to rename them var1_1 (original var1), var1_2, var1_3, etc.

The only other obstacle to going wide here is the variable newevent. In your example data, newevent == 1 when and only when count == 1. If this is true throughout the data, then it contains only redundant information and can be dropped. Then the reshape goes forward easily.

Code:

assert newevent == (count == 1) drop newevent rename date1 date_ rename (date var*) =_ reshape wide date_ var1_ var2_ var3_ var4_, i(indexno) j(count)

If this code breaks at the -assert- statement, then newevent would have to be added to the list of variables in -rename (date var*) =_-, and newevent_ would have to be added to the list of variables in the -reshape- command so that the appropriate value of newevent would be carried across.
Comment
Rhys Davies

Join Date: Dec 2021

Posts: 19
#6

31 Mar 2022, 14:46

Understood, yes that's excellent, thank you very much Clyde!
Comment

Announcement

Data manipulation - sorting variables in a large dataset of entries over time

Comment

Comment

Comment

Comment

Comment