Merging 3 data sets

Amanda Wiebe

Join Date: Dec 2018

Posts: 4
#1

Merging 3 data sets

29 Dec 2018, 21:42

Good evening all. I am looking for some help merging 3 data sets.
All 3 data sets are sorted by patient ID (ptid) and I would like to merge by ptid. The issue is that the master data set has one row per ptid, but the other two have multiple rows of data with the same ptid.
I was able to merge the master data set with one of the 2 other data sets no problem using 1:m merge.
code:
use "apap_analysis"
merge 1:m ptid using "apap_meds"

Now I am unsure of how to merge in the 3rd data set, which contains the same ptid variable, but otherwise contains different variables than the first two datasets.
I tried using the m:m merge, but it created issues in the data, mainly duplicating rows of variables that I do not want to be duplicated.

Does anyone know how I can merge in the 3rd data set? Can I tag the ptid in all 3 data sets and merge based on the tag ptid?

I can provide more detail/clarification if needed.

Thanks!!
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#2

30 Dec 2018, 00:33

Originally posted by Amanda Wiebe View Post

I can provide more detail/clarification if needed.

Welcome to Statalist.

My initial take is

1. you cannot merge the latter two datasets (without first reshaping them wide), and

2. it doesn't make sense to try to.

But, yes, providing more detail would be helpful, in particular, the variables (and their contents) that uniquely identify the observations in each of the two datasets that have multiple observations per patient.
Comment
Amanda Wiebe

Join Date: Dec 2018

Posts: 4
#3

30 Dec 2018, 10:32

The master dataset has baseline characteristics for each patient (age, sex, race, etc), and only 1 row per patient (ptid).

The 2nd dataset includes doses of medications by time for each patient. There are multiple rows because I want to compare the use of the medications over time, so I do not want to collapse the rows down to fewer rows.

Simplified example:
Ptid Date/Time Tylenol Oxycodone Morphine
1 1/1/17 01:30 650
1 1/1/17 02:25 10
1 1/1/17 03:50 2
2 2/4/17 09:20 5
2 2/4/17 10:30 650
2 2/4/17 14:30 5
...

The 3rd dataset includes assessment data by time for each patient. Again there are multiple rows for each patient to evaluate the assessments over time.

Simplified example:
Ptid Date/Time Pain score Nausea
1 1/1/17 01:20 5
1 1/1/17 02:20 6
1 1/1/17 03:00 1
2 2/4/17 08:05 2
2 2/4/17 11:00 1
2 2/4/17 14:00 4
...

Would it work to change both the 2^nd and 3^rd datasets to wide, merge them first 1:1 to create a 4^th dataset, then change that back to long and merge it into the master dataset using 1:m?
Comment
Amanda Wiebe

Join Date: Dec 2018

Posts: 4
#4

30 Dec 2018, 10:35

I didn't realize the spacing would collapse when I posted that, but in the 2nd and 3rd dataset, there is only data for 1 of the variables after date/time (i.e. Tylenol, Oxycodone, or Morphine) for each row. So in the 2nd data set example, for Ptid 1, the 650 goes under Tylenol variable, the 10 goes under the oxycodone variable, and the 2 goes under the morphine variable.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#5

30 Dec 2018, 10:50

So, I think you need to stop trying to write code and spend some time thinking about what you want the final resulting data set to look like. Reshaping wide, merging and then going long is a possible way to combine these, but it may leave you with a data set that is not well-organized for your longer-term purposes.

This is difficult data to organize. Your second and third datasets appear to be identified by the combination of patient id and date-time. But, the date-times for a given patient in dataset 2 do not match those in dataset 3. While there is some sense that for example, the first observation in the medication data set "goes with" the first observation in the second data set (same patient, and the times are fairly close), when we look at the data for patient 2, it is unclear if there is any relationship between any particular event in dataset 2 and any particular event in dataset 3.

It may be that you want to pair up each observation for a given patient in dataset 2 with the observation in dataset 3 that is closest to it in time (or perhaps with the closest one that precedes it) or something like that. (If so, -joinby- rather than -merge- is involved, followed by an appropriate -keep if-.) Or perhaps you want a different data organization in the end where you have a data set in which each observation is an "event" in the patient's history, and some events are pain asessments and others are medication administration, etc. That's might be a more effective way to handle this data. Maybe there are other ways of combining them that I have not yet thought of as well. The choice ultimately depends on what you want to do with the data. So I would say, step away from the keyboard, and start sketching out in your mind, or on paper, what the data set should look like when you're done. Once you have that clear in your mind, the path to code should be clearer. If you want help with coding after you are sure what result you want, post back showing example data from all three data sets, and a mock-up of what the result should look like (as well as an explanation.)

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

By the way, I am delighted that you recognized that -merge m:m- is not a solution to your problem. All too often, people use -merge m:m- and, because they are so relieved that Stata did not throw an error message at them, they then blunder ahead using results that are just data salad. Good for you for taking the time to actually look at the resulting data and recognizing that it was wrong! I wish everybody were as careful about their data.
Comment

David Benson

Join Date: Oct 2018
Posts: 489

30 Dec 2018, 13:57

Hi Amanda, and as the others have said, welcome to Statalist!

1) So let me echo what Clyde said, and the key thing for you is to decide what you want the data (in the end) to look like, which will depend on what analysis you are wanting to do.

2) In reading through your comments, it seems likely that dataset #2 or #3 will actually become the master dataset, and that you will merge in patient time-invariant characteristics from dataset #1 using m:1. Also, I suspect you will want to keep them in a long format (it's easier to do most regression analysis that way).

3) Regarding whether which dataset (of #2 & 3) is the master and how to merge them depends on what you are trying to measure (and what the DV is). Are you trying to measure nausea or pain after taking a particular medication? Are you trying to predict whether a patient is given Tylenol vs Oxycodone vs Morphine (and how many times over the course of a hospital stay)?

So you could imagine ending up with data looking like this:

Code:

dataex, varlabel
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte ptid double(date time datetime) str10 type str9 medicine int dosage byte(pain_score nausea)
1 20820  4800000 1.7988528e+12 "Assessment" ""            . 5 0
1 20820  5400000 1.7988534e+12 "Medicine"   "Tylenol"   650 . .
1 20820  8400000 1.7988564e+12 "Assessment" ""            . 6 1
1 20820  8700000 1.7988567e+12 "Medicine"   "Oxycodone"  10 . .
1 20820 10800000 1.7988588e+12 "Assessment" ""            . 1 1
1 20820 13800000 1.7988618e+12 "Medicine"   "Morphine"    2 . .
2 20854 29100000 1.8018147e+12 "Assessment" ""            . 2 0
2 20854 33600000 1.8018192e+12 "Medicine"   "Oxycodone"   5 . .
2 20854 37800000 1.8018234e+12 "Medicine"   "Tylenol"   650 . .
2 20854 39600000 1.8018252e+12 "Assessment" ""            . 1 1
2 20854 50400000  1.801836e+12 "Assessment" ""            . 4 0
2 20854 52200000 1.8018378e+12 "Medicine"   "Oxycodone"   5 . .
end
format %tdN/D/CY date
format %tcHH:MM time
format %tcNN/DD/CCYY_HH:MM:SS datetime
label var ptid "Ptid"
label var date "Date of visit or eval"
label var time "Time of visit or evaluation:"
label var datetime "Date & time combined in one"
label var medicine "Medicine"
label var dosage "Dosage"

bysort ptid (datetime): gen n = _n
gen time_diff = datetime - datetime[_n-1] if ptid == ptid[_n-1]
replace time_diff = time_diff / 60000  // converting from milliseconds (how Stata measures time diff) to minutes
label var time_diff "Time from previous event (in minutes)"
order n time_diff, after(datetime)
. list, sepby(ptid ) noobs abbrev(12)

  +-------------------------------------------------------------------------------------------------------------------------+
  | ptid         date    time              datetime   n   time_diff         type    medicine   dosage   pain_score   nausea |
  |-------------------------------------------------------------------------------------------------------------------------|
  |    1   01/01/2017   01:20   01/01/2017 01:20:00   1           .   Assessment                    .            5        0 |
  |    1   01/01/2017   01:30   01/01/2017 01:30:00   2          10     Medicine     Tylenol      650            .        . |
  |    1   01/01/2017   02:20   01/01/2017 02:20:00   3          50   Assessment                    .            6        1 |
  |    1   01/01/2017   02:25   01/01/2017 02:25:00   4           5     Medicine   Oxycodone       10            .        . |
  |    1   01/01/2017   03:00   01/01/2017 03:00:00   5          35   Assessment                    .            1        1 |
  |    1   01/01/2017   03:50   01/01/2017 03:50:00   6          50     Medicine    Morphine        2            .        . |
  |-------------------------------------------------------------------------------------------------------------------------|
  |    2   02/04/2017   08:05   02/04/2017 08:05:00   1           .   Assessment                    .            2        0 |
  |    2   02/04/2017   09:20   02/04/2017 09:20:00   2          75     Medicine   Oxycodone        5            .        . |
  |    2   02/04/2017   10:30   02/04/2017 10:30:00   3          70     Medicine     Tylenol      650            .        . |
  |    2   02/04/2017   11:00   02/04/2017 11:00:00   4          30   Assessment                    .            1        1 |
  |    2   02/04/2017   14:00   02/04/2017 14:00:00   5         180   Assessment                    .            4        0 |
  |    2   02/04/2017   14:30   02/04/2017 14:30:00   6          30     Medicine   Oxycodone        5            .        . |
  +-------------------------------------------------------------------------------------------------------------------------+

If the above is what you want, you would want to append dataset #3 to #2 and then sort by ptid datetime. You would then merge in the patient info from dataset #1 using a m:1 merge. I was thinking about whether to put assessments and medicine in different variables (datetime_assess, datetime_medicine, etc) but there may be different numbers of assessments from medicine dispensed, and the order doesn't have to go assessment followed by dispensing medicine (as seen with patient #2). But in this format one can also easily determine number of minutes between pain being measured and medicine dispensed, as well as the average pain threshold one has to reach before medicine is dispensed. So for patient #2, observation #4 (11am), the patient reported pain==1 and no medicine was dispensed. However, at 2pm, pain==4 and medicine was dispensed.

NOTE: I didn't know how nausea was scored, so I put it in as a 1/0 variable.

Last edited by David Benson; 30 Dec 2018, 13:59.

Comment

Amanda Wiebe

Join Date: Dec 2018

Posts: 4
#7

31 Dec 2018, 14:02

Thank you so much for the advice!

In response to Clyde:
- I'm sorry to say that I did not realize my mistake with the m:m merge until I started cleaning my data and setting it up for analysis. I did a very brief review after I completed the merge, but it was not as thorough as it should have been. I am pretty new to Stata, but have definitely learned my lesson in checking my data more carefully (and not using m:m merge of course)!

The ultimate goal of this project is to evaluate the effect of the use of Tylenol on the use of opioid medications (oxycodone, morphine) after surgery. Secondary outcomes that I am interested in are the effect of the use of Tylenol on pain control and side effects of opioid medications (like nausea).

I was planning to analyze the data as continuous variables over time from surgery. I created a variable similar to "time_diff" that is the time from the end of surgery until the event occurred.

The times in dataset 2 and dataset 3 do not correlate.

Thinking about it, I think I do want all the variables together in one dataset and organizing the data by time elapsed from surgery so that each observation is an "event" and some events are pain asessments and others are medication administration, etc. makes the most sense to me.

I think the option of appending dataset 3 to dataset 2 and then sorting by ptid datetime and merging in dataset 1 with m:1 should work. I have not used append before, but it makes perfect sense.

Thank you again for the suggestions.
1 like
Comment

Announcement