m:m merging - clarification

Thushani Guruge

Join Date: Jan 2022

Posts: 7
#1

m:m merging - clarification

24 Apr 2022, 23:30

Hi there,

I have 2 datasets; one shows the data of patients who were diagnosed with cancers and the other one shows the details of medications they use. Both datasets have multiple rows for one patient (ex: multiple cancers for 1 patient in the cancer data and multiple medicines for 1 patient in the medication data). I want to merge these 2 datasets to find out how is the medication use of patients with cancers. As this will be a m:m merging I’m not sure whether the usual stata code will appropriately work out.

Below is a part of 2 datasets that I want to get merged. Highly appreciate if you can help me to find a suitable command to merge them properly.

Thank you in advance!
Thushani

Medication dataset
input str11 ID str1 gender long medication_code float(frequency daily_dose)
"1" "M" 70238 1 1
"1" "M" 70238 1 1
"1" "M" 70238 1 1
"2" "F" 70238 1 1
"2" "F" 67117 1 1
"2" "F" 67117 1 1
"3" "M" 67117 1 1
"3" "M" 70238 1 1
"4" "F" 74121 1 1
"4" "F" 67265 1 1
"4" "F" 67265 1 1
"5" "M" 70238 2 2
"5" "M" 70238 2 2
"5" "M" 70238 2 2
"5" "M" 70238 2 2

Cancer dataset
input str11 ID str1 gender int age str4 site int morph
"1" "M" 79 "C160" 8070
"2" "F" 74 "C20" 8140
"2" "F" 74 "C187" 8140
"2" "F" 74 "C250" 8140
"3" "M" 75 "C187" 8140
"3" "M" 75 "C250" 8000
"4" "F" 85 "C259" 8000
"4" "F" 85 "C187" 8140
"4" "F" 85 "C250" 8000
"5" "M" 78 "C187" 8140
"5" "M" 78 "C187" 8140
"5" "M" 78 "C221" 8160
"5" "M" 78 "C20" 8140
"5" "M" 78 "C20" 8140
Tags: None
FernandoRios

Join Date: Apr 2014

Posts: 2412
#2

25 Apr 2022, 05:31

Hi Thushani,
First of all. I can say with almost certainty, m:m is NOT what you want to do.
The next option would be joinby. However, be aware that this will merge records using an n x m style. For example/ Consider your patient ID=2.
You have 3 records in medication dataset, and 3 on cancer dataset. The joinby command will generate 9 records(3x3) for this patient. The question is, is that what you want?

Seems to me that a better approach would be to reshape one of your datasets (say medication), and then use an 1:m or m:1 merge.
HTH
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3806
#3

25 Apr 2022, 06:24

Originally posted by Thushani Guruge View Post

Medication dataset
input str11 ID str1 gender long medication_code float(frequency daily_dose)
"1" "M" 70238 1 1
"1" "M" 70238 1 1
"1" "M" 70238 1 1
"2" "F" 70238 1 1
"2" "F" 67117 1 1
"2" "F" 67117 1 1
"3" "M" 67117 1 1
"3" "M" 70238 1 1
"4" "F" 74121 1 1
"4" "F" 67265 1 1
"4" "F" 67265 1 1
"5" "M" 70238 2 2
"5" "M" 70238 2 2
"5" "M" 70238 2 2
"5" "M" 70238 2 2

Cancer dataset
input str11 ID str1 gender int age str4 site int morph
"1" "M" 79 "C160" 8070
"2" "F" 74 "C20" 8140
"2" "F" 74 "C187" 8140
"2" "F" 74 "C250" 8140
"3" "M" 75 "C187" 8140
"3" "M" 75 "C250" 8000
"4" "F" 85 "C259" 8000
"4" "F" 85 "C187" 8140
"4" "F" 85 "C250" 8000
"5" "M" 78 "C187" 8140
"5" "M" 78 "C187" 8140
"5" "M" 78 "C221" 8160
"5" "M" 78 "C20" 8140
"5" "M" 78 "C20" 8140

First, I agree with Fernando: you never want an m:m merge.

Second, in the examples you show, you have duplicates. Except for ID 3, in the Medication datasets each ID repates the very same information between two to four times. If that is an honest representation of your real data, then you should either collapse the datasets, adding frequency weights for each observation, or drop the duplicate records altogether as they do not provide any additional information and are, thus, completely useless for any analyses. If the duplicate observations do provide additional information (i.e., if they are not really duplicates), then there must be a way to differentiate between those observations in the data. Either way, once you figure out how to uniquely identify the observations in each dataset (and you should figure this out before doing anything else), you can safely use a 1:1, 1:m, or m:1 merge.

Last edited by daniel klein; 25 Apr 2022, 06:26.
1 like
Comment
Thushani Guruge

Join Date: Jan 2022

Posts: 7
#4

25 Apr 2022, 17:48

Thank you, Fernando and Daniel. Your suggestions are much appreciated!

Actually, I have missed the variable "dispensing date" in the Medication dataset. That leads to showing the data as duplicates. Below shows the updated example.

I would like to know whether I can still collapse the dataset by adding frequency weights for each observation with the new variable? My dataset has nearly 90,000+ obs and my primary research question is to assess the drug exposure of cancer patients.

I may need to rearrange my data to facilitate the merging command but I'm not certain how to do it. I would highly appreciate if you can give any insight on this further!

input str11 ID str1 gender long medication_code str10 date_dispensed float(frequency daily_dose)
"1" "M" 70238 "05/03/2006" 1 1
"1" "M" 70238 "05/08/2006" 1 1
"1" "M" 70238 "12/12/2006" 1 1
"2" "F" 70238 "01/08/2007" 1 1
"2" "F" 67117 "05/09/2007" 1 1
"2" "F" 67117 "05/11/2007" 1 1
"3" "M" 67117 "10/11/2007" 1 1
"3" "M" 70238 "05/01/2008" 1 1
"4" "F" 74121 "15/03/2008" 1 1
"4" "F" 67265 "26/06/2008" 1 1
"4" "F" 67265 "01/11/2008" 1 1
"5" "M" 70238 "10/06/2010" 2 2
"5" "M" 70238 "05/11/2010" 2 2
"5" "M" 70238 "19/01/2011" 2 2
"5" "M" 70238 "27/02/2011" 2 2

Last edited by Thushani Guruge; 25 Apr 2022, 17:52.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#5

25 Apr 2022, 17:51

The success of merge (thank God, even though I don't believe in a God) doesn't depend on how we've sorted our dataset.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2412
#6

25 Apr 2022, 18:58

Funny enough Jared
older versions of stata did required your data to be sorted for a successful merge
1 like
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#7

25 Apr 2022, 19:14

FernandoRios what a dastardly system of organization.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2372
#8

25 Apr 2022, 20:18

Originally posted by Jared Greathouse View Post

The success of merge (thank God, even though I don't believe in a God) doesn't depend on how we've sorted our dataset.

This is also an illusion as there is an implicit sort that happens behind the scenes if the data are not appropriately sorted already.
1 like
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#9

25 Apr 2022, 20:36

I figured. I've never looked at the ado code, but if it were my problem to write a 1:1 merge code from scratch, I imagine we'd have to sort it under the hood so the user doesn't have to Leonardo Guizzetti
Comment

Announcement

m:m merging - clarification

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment