Sorting data: filling in missing data, duplicating data and adding an additional variable

Rhys Davies

Join Date: Dec 2021
Posts: 19

Sorting data: filling in missing data, duplicating data and adding an additional variable

04 Feb 2022, 10:17

Hi

I'm attempting to clean and organise a large set of data that has been presented in an unusual way, i've included an example. It's slightly difficult to explain, thanks for your help!

I have an identifying code for a person "identity", and multiple rows of data for each person which are shown by the "count" variable. For each person there is an "event" code. I would like to fill in the missing data in the event column, using the event code which corresponds to the same person.

Each "event" is unique to one person (you'll never have the same event code for a different person), but a person can have more than one event (more than one event code corresponding to a person).

If they have more than one event all the remaining data for the corresponding identity should be duplicated.

The position of the event code should also be noted as an additional variable. To help make it clearer i've also included an example of the outcome i'm looking for!

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(event count) str8 identity float(var4 var5)
     . 1 "g19937g"   37   67
     . 2 "g19937g"   54 3832
256789 3 "g19937g"   23  653
     . 4 "g19937g"   34    3
     . 5 "g19937g"   35   45
     . 6 "g19937g"   14    6
347652 1 "d18793"    75   67
     . 2 "d18793"    56   34
     . 3 "d18793"    23    1
     . 4 "d18793"    55  568
     . 6 "d18793"    87  134
456736 7 "d18793"     3    .
     . 8 "d18793"    58   76
     . 9 "d18793"  2337  567
     . 1 "e34863"    23  223
     . 2 "e34863"    14  787
     . 3 "e34863"    76  342
156782 4 "e34863"   745  753
     . 5 "e34863"   135   26
end

So it should go on to look like the data below.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(newevent event count) str8 identity float(var4 var5)
0 256789 1 "g19937g"   37   67
0 256789 2 "g19937g"   54 3832
1 256789 3 "g19937g"   23  653
0 256789 4 "g19937g"   34    3
0 256789 5 "g19937g"   35   45
0 256789 6 "g19937g"   14    6
1 347652 1 "d18793"    75   67
0 347652 2 "d18793"    56   34
0 347652 3 "d18793"    23    1
0 347652 4 "d18793"    55  568
0 347652 6 "d18793"    87  134
0 347652 7 "d18793"     3    .
0 347652 8 "d18793"    58   76
0 347652 9 "d18793"  2337  567
0 456736 1 "d18793"    75   67
0 456736 2 "d18793"    56   34
0 456736 3 "d18793"    23    1
0 456736 4 "d18793"    55  568
0 456736 6 "d18793"    87  134
1 456736 7 "d18793"     3    .
0 456736 8 "d18793"    58   76
0 456736 9 "d18793"  2337  567
0 156782 1 "e34863"    23  223
0 156782 2 "e34863"    14  787
0 156782 3 "e34863"    76  342
1 156782 4 "e34863"   745  753
0 156782 5 "e34863"   135   26
end

Thank you.

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 29792

04 Feb 2022, 13:21

Code:

preserve

keep identity event count
drop if missing(event)
duplicates drop
rename count count2
tempfile events
save `events'

restore
drop event
joinby identity using `events'
isid identity event count, sort
gen new_event = (count == count2)
drop count2
order new_event event, first

Note: You cannot run this code line by line. Because of the -preserve- and -restore- commands and use of tempfiles, it must be run in one fell swoop.

Comment

Rhys Davies

Join Date: Dec 2021

Posts: 19
#3

05 Feb 2022, 04:20

Thank you, that's brilliant!
Comment
Rhys Davies

Join Date: Dec 2021

Posts: 19
#4

09 Feb 2022, 11:30

Hi Clyde

You solution worked very well on the example dataset, when I come to use it on the main dataset I get the following error:

isid identity event count, sort
"variables identity event count do not uniquely identify the observations"

I've double checked the data and there are no duplicates of this nature prior to running the code. There are lots of additional variables but I can't see how this would influence things.
I've tried to work out what's going on and have re-run the code with the following additional line prior to isid identity even count, sort:

duplicates drop identity event count, force

This seems to do the job, but wonder if i'm missing something? Apologies I can't share the output directly, the data is confidential and for the work I access the data and STATA through a secure server.

Many thanks for your help!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#5

09 Feb 2022, 11:39

This seems to do the job, but wonder if i'm missing something?

As do I. And a clue that something is going seriously wrong is that you used the -force- option on your -duplicates drop- command. If you can run it with -duplicates drop identity event count- without the -force- option, then you can be sure that you are not losing any information. But if you actually need the -force- option to avoid an error message and a break, then it means that you not only have multiple instances of some combination(s) of identity event and count, but those instances disagree about the values of other variables. When you use -force-, one of the multiples is arbitrarily picked for keeping. And it's not even reproducible if you rerun the code. So that's big trouble.

Unfortunately, without example data, I cannot directly troubleshoot for you. What I recommend you do is start over, and immediately after the joinby command run:

Code:

duplicates tag identity event count, gen(flag) browse if flag

That will enable you to see what the multiple observations are, and perhaps then you can figure out where they are coming from and what to do about it.
Comment
Rhys Davies

Join Date: Dec 2021

Posts: 19
#6

09 Feb 2022, 12:44

Thanks Clyde, this is v helpful.

The only variable that is not duplicated at that stage seems to be in the new "count2" column. Everything else throughout the dataset is identical, there is a single new row for a portion of the rows with n+1 from the corresponding row above in the count2 column. This seems rather odd?

When I drop this duplicate, it seems to function fine.

I'm not sure if this is due to the platform on which i'm having to access the data and STATA that's causing an issue.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#7

09 Feb 2022, 13:14

Well, again, it's hard to do this flying blind. But I think what you're telling me in #6 is that after -joinby- you have some observations that agree on id, event, and count, but have different values of count2. Now, by the way the code works, event comes from the temporary `events' data set, not the original data. And because the construction of `events' has -duplicates drop- and the only variables at that time are identiy, event, and count, there cannot be duplicates of combination of identity, event, and count in there.

So the duplicates must be arising from the original data set. But that would imply that the original data set contains observations that have the same value of identity, event, and count--which you said in #1 should not happen. So I think you need to go back all the way to the start and before you run any of the code from #2, run

Code:

duplicates tag identity event count, gen(flag) browse if flag

to find those. Those are, I believe, the source of the problem, and you will need to fix that error in the data.
Comment

Announcement

Sorting data: filling in missing data, duplicating data and adding an additional variable

Comment

Comment

Comment

Comment

Comment

Comment