Merging with multiple (intended) duplicate observations

Jessica Choi

Join Date: Nov 2019

Posts: 15
#1

Merging with multiple (intended) duplicate observations

09 Apr 2022, 11:50

Hello Stata experts,

I am trying to merge two datasets (pre- and post- of an intervention). I understand that this cannot be done without each observation in the selected variable being unique, however, I actually need to retain those duplicate observations as these are the primary identifier of each row.

The survey was not created at an academic standardized extent as it is for a small non-profit organization, therefore other identifying variables were not required. The same participant can participate in multiple sessions of the intervention in a given time (e.g., a month), and the identifier is obviously the same for that participant (e.g., phone number). We do have another secondary identifier, which is the date of the session. Additionally, the survey participation is completely voluntary, and therefore, some participants do not always complete both pre- and post- surveys. I am still trying to use those two variables to merge the two datasets, but I am not sure how to approach assigning a variable to each observation that can be used as an identifier. I know this isn't ideal, but I have to work with what I have.

The ideal combined dataset would be matched by the primary identifier (e.g., phone number), and then also by the secondary identifier (e.g., date of the session).

I have tried these two approaches, but I am still not sure if there would be an easier way to give each observation a unique value without relying on the row number as these values will be different in the two datasets due to the fact that participants do not always complete both surveys:

https://www.statalist.org/forums/for...other-variable
https://www.statalist.org/forums/for...plicate-values

Would you be able to suggest a direction for me to resolve the issue?

Thank you so much,
Jessica
Tags: None
Jessica Choi

Join Date: Nov 2019

Posts: 15
#2

09 Apr 2022, 11:54

Originally posted by Jessica Choi View Post

I am not sure how to approach assigning a variable to each observation that can be used as an identifier.

I don't think this make sense. I basically meant to say "I am not sure how to assign unique identifier for each observation".
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

09 Apr 2022, 12:37

I understand that this cannot be done without each observation in the selected variable being unique

This suggests that you do not understand that the merge command will match on a variable list, not just one selected variable, so it would seem possible that

Code:

merge 1:1 identifier date using ...

would solve your problem.

If not, the for me at least your question really isn't clear without more detail, or at a minimum it is too difficult to guess at a good answer from what you have shared. Please help us help you. Show example data. If you've run a merge that failed, show your code and show us what Stata told you. The Statalist FAQ provides advice on effectively posing your questions, posting data, and sharing Stata output.
1 like
Comment
Jessica Choi

Join Date: Nov 2019

Posts: 15
#4

09 Apr 2022, 13:33

Hello William,
Thank you for your response. By "selected variable" I meant the variable that would be used to match the datasets by. The Stata lingo is still confusing to me, so please excuse me for the lack thereof.

I did try including both variables (email and date), but still getting the same error message.

#

Code:

merge 1:1 email date using "Post Data `c(current_date)'.dta" variables email date do not uniquely identify observations in the master data r(459);

I am unsure how to protect the confidentiality if I were to show the dataset, so not including it in the post at this time.

Thank you,
Jessica
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29812
#5

09 Apr 2022, 13:59

I understand the problems with data confidentiality, and I agree that it is not easy to anonymize this kind of data. It's actually more complicated, I think, than solving the -merge- problem! Still, without seeing the data it is difficult to troubleshoot the problem.

But let me venture a guess right now. Your data are wrong! Or you are misunderstanding them in some way. In your original description, you stated that each person is uniquely identified by an email address, and that there are two surveys, pre and post which are represented by the date variable, and each person might participate in one survey or both. Now, if that is all true, there should never be two observations with the same email and date because that would represent the same person completing the same survey twice. But the error message Stata is giving you says that is exactly what's happening: there are some situations where the same email and date appear together in more than one observation. And by your description of the data generating process, that should never happen.

So you have to find those observations:

Code:

duplicates tag email date, gen(flag) browse if flag

will show them to you. You will then have to figure out how these data errors crept in.

It may be that the observations that duplicate email & date are complete duplicates on all variables. In that case, -duplicates drop- will eliminate them. But then you have to wonder why they are there in the first place: usually it reflects an error in the data management that created the data set, and where there has been one error, others may well lurk. So before just proceeding, you would need to thoroughly review the creation of this data set and fix the problem(s) you find.

On the other hand, you may find that the observations with the same email and date disagree on other variables. Then you will have to figure out how that happened and fix it. Possible routes: they actually represent pre and post surveys from the same person but one of the dates got recorded incorrectly; they are surveys on the same date from different people, but one person's email address got recorded incorrectly. Or the same person actually did hand in two surveys--that's a real dilemma unless there is a way to figure out which one was the "correct" survey: you might have to invalidate that person's surveys for that date, or pick one at random or something like that.

Another possibility is that if the data were actually imported from a spreadsheet, the import process brought in a bunch of empty spreadsheet rows that are not really observations at all. This is easy to spot because the Browser will be showing you observations with nothing entered for the data--just blank rows. The solution is to then drop those extraneous rows from the data set.

You should also check the other data set to see if it exhibits the same problems.

Anyway, even without seeing the data, I am fairly confident that they are wrong and that this is the path you should head down from here.
Comment

Jessica Choi

Join Date: Nov 2019
Posts: 15

09 Apr 2022, 14:20

Hi Clyde,

Thank you for your response. I think my statement about each person being unique was misleading. I didn't mean "unique" by the Stata term, but really meant to say that each person only uses one email address. I do agree that the data show be examined closely, and they were, and there really aren't any human error with entering the date. So I really think this is an issue of the lack of my knowledge.

I created a set of brief data that mimics what I am working with:

Pre data
#

Code:

 Email
Date

Email_1
Date_1

Email_2
Date_1

Email_2
Date_2

Email_2
Date_3

Email_3
Date_1

Email_3
Date_3

Email_4
Date_2

Email_5
Date_2

Email_5
Date_3

Post data
#

Code:

 Email
Date

Email_1
Date_1

Email_2
Date_1

Email_2
Date_3

Email_3
Date_3

Email_4
Date_2

Email_5
Date_2

Email_5
Date_4

Email_6
Date_1

And this is what I am hoping I can produce in the end:
#

Code:

 Email
Date
Match

Email_1
Date_1
1

Email_2
Date_1
1

Email_2
Date_2
0

Email_2
Date_3
1

Email_3
Date_1
0

Email_3
Date_3
1

Email_4
Date_2
1

Email_5
Date_2
1

Email_5
Date_3
0

Email_5
Date_4
1

Email_6
Date_1
0

I hope this clears up what I am trying to do. If there aren't any solutions after presenting these tables, I will re-examine the data and see if I need to discard any observations.

Thank you so much!
Jessica

Comment

Jessica Choi

Join Date: Nov 2019

Posts: 15
#7

09 Apr 2022, 14:21

Originally posted by Jessica Choi View Post

I do agree that the data show be examined closely,

"should" not "show". Sorry for the typo.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29812
#8

09 Apr 2022, 14:31

Thank you for the tables, but I think it was clear from the start that this is what you are trying to do. If the actual data were like what you show, your -merge 1:1 email date using "Post Data `c(current_date)'.dta"- command would have run without error messages and produced the results you desire. Stata is telling you, in no uncertain terms, that the actual data are not like that. I have been using Stata since 1994 and I have never known it to be wrong about this particular kind of error message.

I urge you to use the approach outlined in #5 to identify the surplus observations and deal with them. I will just modify what I said there in one respect. There is another possibility. It may be that the data are actually correct and that those duplicate observations are not errors and are supposed to be there. But in that case, there must be some other variable or variables that, combined with email and date, will jointly identify unique observations.
Comment
Jessica Choi

Join Date: Nov 2019

Posts: 15
#9

09 Apr 2022, 14:42

Hi Clyde,

Thank you again for your response- and I hope you didn't think I was devaluing your suggestion! I have referred to Statalist posts for years, and have definitely learned a lot from your respectful feedback for everyone !

I will examine the original datasets further and re-read your first response.

Thank you (and William, as well)!
Jessica
Comment

Announcement