Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • m:m merging - clarification

    Hi there,

    I have 2 datasets; one shows the data of patients who were diagnosed with cancers and the other one shows the details of medications they use. Both datasets have multiple rows for one patient (ex: multiple cancers for 1 patient in the cancer data and multiple medicines for 1 patient in the medication data). I want to merge these 2 datasets to find out how is the medication use of patients with cancers. As this will be a m:m merging I’m not sure whether the usual stata code will appropriately work out.

    Below is a part of 2 datasets that I want to get merged. Highly appreciate if you can help me to find a suitable command to merge them properly.

    Thank you in advance!
    Thushani

    Medication dataset
    input str11 ID str1 gender long medication_code float(frequency daily_dose)
    "1" "M" 70238 1 1
    "1" "M" 70238 1 1
    "1" "M" 70238 1 1
    "2" "F" 70238 1 1
    "2" "F" 67117 1 1
    "2" "F" 67117 1 1
    "3" "M" 67117 1 1
    "3" "M" 70238 1 1
    "4" "F" 74121 1 1
    "4" "F" 67265 1 1
    "4" "F" 67265 1 1
    "5" "M" 70238 2 2
    "5" "M" 70238 2 2
    "5" "M" 70238 2 2
    "5" "M" 70238 2 2

    Cancer dataset
    input str11 ID str1 gender int age str4 site int morph
    "1" "M" 79 "C160" 8070
    "2" "F" 74 "C20" 8140
    "2" "F" 74 "C187" 8140
    "2" "F" 74 "C250" 8140
    "3" "M" 75 "C187" 8140
    "3" "M" 75 "C250" 8000
    "4" "F" 85 "C259" 8000
    "4" "F" 85 "C187" 8140
    "4" "F" 85 "C250" 8000
    "5" "M" 78 "C187" 8140
    "5" "M" 78 "C187" 8140
    "5" "M" 78 "C221" 8160
    "5" "M" 78 "C20" 8140
    "5" "M" 78 "C20" 8140

  • #2
    Hi Thushani,
    First of all. I can say with almost certainty, m:m is NOT what you want to do.
    The next option would be joinby. However, be aware that this will merge records using an n x m style. For example/ Consider your patient ID=2.
    You have 3 records in medication dataset, and 3 on cancer dataset. The joinby command will generate 9 records(3x3) for this patient. The question is, is that what you want?

    Seems to me that a better approach would be to reshape one of your datasets (say medication), and then use an 1:m or m:1 merge.
    HTH


    Comment


    • #3
      Originally posted by Thushani Guruge View Post

      Medication dataset
      input str11 ID str1 gender long medication_code float(frequency daily_dose)
      "1" "M" 70238 1 1
      "1" "M" 70238 1 1
      "1" "M" 70238 1 1
      "2" "F" 70238 1 1
      "2" "F" 67117 1 1
      "2" "F" 67117 1 1
      "3" "M" 67117 1 1
      "3" "M" 70238 1 1
      "4" "F" 74121 1 1
      "4" "F" 67265 1 1
      "4" "F" 67265 1 1
      "5" "M" 70238 2 2
      "5" "M" 70238 2 2
      "5" "M" 70238 2 2
      "5" "M" 70238 2 2

      Cancer dataset
      input str11 ID str1 gender int age str4 site int morph
      "1" "M" 79 "C160" 8070
      "2" "F" 74 "C20" 8140
      "2" "F" 74 "C187" 8140
      "2" "F" 74 "C250" 8140
      "3" "M" 75 "C187" 8140
      "3" "M" 75 "C250" 8000
      "4" "F" 85 "C259" 8000
      "4" "F" 85 "C187" 8140
      "4" "F" 85 "C250" 8000
      "5" "M" 78 "C187" 8140
      "5" "M" 78 "C187" 8140
      "5" "M" 78 "C221" 8160
      "5" "M" 78 "C20" 8140
      "5" "M" 78 "C20" 8140
      First, I agree with Fernando: you never want an m:m merge.

      Second, in the examples you show, you have duplicates. Except for ID 3, in the Medication datasets each ID repates the very same information between two to four times. If that is an honest representation of your real data, then you should either collapse the datasets, adding frequency weights for each observation, or drop the duplicate records altogether as they do not provide any additional information and are, thus, completely useless for any analyses. If the duplicate observations do provide additional information (i.e., if they are not really duplicates), then there must be a way to differentiate between those observations in the data. Either way, once you figure out how to uniquely identify the observations in each dataset (and you should figure this out before doing anything else), you can safely use a 1:1, 1:m, or m:1 merge.
      Last edited by daniel klein; 25 Apr 2022, 06:26.

      Comment


      • #4
        Thank you, Fernando and Daniel. Your suggestions are much appreciated!

        Actually, I have missed the variable "dispensing date" in the Medication dataset. That leads to showing the data as duplicates. Below shows the updated example.

        I would like to know whether I can still collapse the dataset by adding frequency weights for each observation with the new variable? My dataset has nearly 90,000+ obs and my primary research question is to assess the drug exposure of cancer patients.

        I may need to rearrange my data to facilitate the merging command but I'm not certain how to do it. I would highly appreciate if you can give any insight on this further!

        input str11 ID str1 gender long medication_code str10 date_dispensed float(frequency daily_dose)
        "1" "M" 70238 "05/03/2006" 1 1
        "1" "M" 70238 "05/08/2006" 1 1
        "1" "M" 70238 "12/12/2006" 1 1
        "2" "F" 70238 "01/08/2007" 1 1
        "2" "F" 67117 "05/09/2007" 1 1
        "2" "F" 67117 "05/11/2007" 1 1
        "3" "M" 67117 "10/11/2007" 1 1
        "3" "M" 70238 "05/01/2008" 1 1
        "4" "F" 74121 "15/03/2008" 1 1
        "4" "F" 67265 "26/06/2008" 1 1
        "4" "F" 67265 "01/11/2008" 1 1
        "5" "M" 70238 "10/06/2010" 2 2
        "5" "M" 70238 "05/11/2010" 2 2
        "5" "M" 70238 "19/01/2011" 2 2
        "5" "M" 70238 "27/02/2011" 2 2
        Last edited by Thushani Guruge; 25 Apr 2022, 17:52.

        Comment


        • #5
          The success of merge (thank God, even though I don't believe in a God) doesn't depend on how we've sorted our dataset.

          Comment


          • #6
            Funny enough Jared
            older versions of stata did required your data to be sorted for a successful merge

            Comment


            • #7
              FernandoRios what a dastardly system of organization.

              Comment


              • #8
                Originally posted by Jared Greathouse View Post
                The success of merge (thank God, even though I don't believe in a God) doesn't depend on how we've sorted our dataset.
                This is also an illusion as there is an implicit sort that happens behind the scenes if the data are not appropriately sorted already.

                Comment


                • #9
                  I figured. I've never looked at the ado code, but if it were my problem to write a 1:1 merge code from scratch, I imagine we'd have to sort it under the hood so the user doesn't have to Leonardo Guizzetti

                  Comment

                  Working...
                  X