Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge with update

    Hello,

    I am trying to use the merge command with the update option. I would like to keep the following observations:

    1 master observation appeared in master only

    3 match observation appeared in both

    4 match_update observation appeared in both, missing values updated

    I also want to keep, but NOT to update the missing values from the following observations:

    5 match_conflict observation appeared in both, conflicting nonmissing values

    I understand that if I specify:

    . merge m:1 varlist using file1, keepusing(varlist1) update keep(master match match_update)

    the observations for _m = 5 are dropped.

    If I specify instead:

    . merge m:1 varlist using file1, keepusing(varlist1) update keep(master match match_update match_conflict)

    then the missing values are updated for the observations with conflicting non-missing values.

    Is there a way to keep observations with _m = 5 but without updating the missing values?

    Thanks a lot,

    Best

    Mariela

  • #2
    If I specify instead:

    . merge m:1 varlist using file1, keepusing(varlist1) update keep(master match match_update match_conflict)

    then the missing values are updated for the observations with conflicting non-missing values.
    No, that's not correct. In fact, this command will do exactly what you want: it will update missing values, leave conflicting values as they were in the master data, and retain both the updated and non-updated conflicting values (along with perfect matches and original master observations).

    The only circumstance under which -merge- will change conflicting values is if you specify both the -update- and -replace- options.

    Comment


    • #3
      Dear Clyde,

      Thanks for your reply. Just to confirm, I understand that the command does not update the non-missing values when _m = 5 (unless I also specify replace). But does it preserve the missing values of the master file or they are updated when _m = 5 and the using dataset has non-missing values? (I want to preserve the original missing values when _m == 5)
      Thanks again

      Comment


      • #4
        The whole point of -update- is to fill in the master data set's missing values from the using data set. It would make no sense for it to preserve the missing values of the master. If you want to preserve the missing value of the master data set you should specify neither update nor replace. Then all of the data in the master, missing or not, is preserved.

        With update but not replace specified you get:

        master has missing value + using has missing value: merged result has missing value
        master has non-missing value + using has missing value: merged result preserves master's non-missing value
        master has missing value + using has non-missing value: merged result has using's non-missing value
        master has non-missing value + using has same non-missing value: merged result has common non-missing value
        master has non_missing value + using has different non-missing value: merged result has mater's non-missing value.

        Comment


        • #5
          Dear Clyde,

          Thanks for the detailed answer. Yes, I understand your point and confirm what I get with:

          master has missing value + using has non-missing value: merged result has using's non-missing value

          But I want to update the missing values when _m = 4, not when _m=5. If I don't specify neither update nor replace, then I cannot update the missing values for _m = 4. At least for what I am doing, I take the presence of conflicting nonmissing values as an indicator that the observations from the master and using are not the same. The variables used for the merge are not perfect (have typos, etc), so I am doing a merge in multiple stages, where I start with the best matching variables and then move to the ones that give a less perfect matching.
          I think it could be convenient to add the update option only for _m = 4.

          Thanks, best

          Comment


          • #6
            But I want to update the missing values when _m = 4 , not when _m=5.
            That is what happens when you do -merge ..., update- without -replace-. And for _merge == 5 observations the master data are left unchanged. Please re-read what I wrote in #4.

            Comment


            • #7
              Dear Clyde,

              I reread #4 and I think what I need falls in the left-hand side of this expression:

              master has missing value + using has non-missing value: merged result has using's non-missing value

              But I need this:

              master has missing value + using has non-missing value: merged result has using's non-missing value if _m = 4
              master has missing value + using has non-missing value: merged result preserves master's non-missing and missing values if _m = 5

              I actually created an example in Stata and confirmed that.

              Master file:

              id | var1 | var2 | var3
              1 | 5 | 6 | .
              2 | 7 | . | 9
              3 | 88 | 1 | 2

              Using file:
              id | var1 |var2 | var3
              1 | 5 | 7 | 2
              2 | 7 | 3 | 9
              3 |88 | 1 | 2

              If I do the merge using:

              . merge 1:1 id using usingfile, update

              I get this:

              id | var1 | var2 | var3 | _merge
              1 | 5 | 6 | 2 | nonmissing conflict (5)
              2 | 7 | 3 | 9 | missing updated (4)
              3 | 88 | 1 | 2 | matched (3)

              And what I want is this:

              id | var1 | var2 | var3 | _merge
              1 | 5 | 6 | . | nonmissing conflict (5)
              2 | 7 | 3 | 9 | missing updated (4)
              3 | 88 | 1 | 2 | matched (3)

              Because for me the observation with id 1 is not the same in the using and master files.
              Last edited by Mariela Dal Borgo; 23 Aug 2018, 13:02.

              Comment


              • #8
                You can't get that from a single -merge- operation; the command just doesn't work that way. It couldn't. The first observation provides a non-missing conflict (5) for variable 2, and a missing-updated (4) for variable 3. You want a different command that looks at all of the variables and prioritizes reporting non-missing conflict(5) over anything else. Stata doesn't have that.

                You can approximate it by merging in just a single variable at a time. Which, although it might be tedious to review the output, is actually more useful because you will discover both aspects of mismatch for observation 1 and you will also know exactly which variable mismatches in which way.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input byte(id var1 var2 var3)
                1  5 6 .
                2  7 . 9
                3 88 1 2
                end
                list
                tempfile master
                save `master'
                
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input byte(id var1 var2 var3)
                1  5 7 2
                2  7 3 9
                3 88 1 2
                end
                list
                tempfile usingfile
                save `usingfile'
                
                use `master', clear
                foreach v of varlist var* {
                    merge 1:1 id using `usingfile', keepusing(`v') update
                    list
                    drop _merge
                }
                In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

                When asking for help with code, always show example data. When showing example data, always use -dataex-.

                Comment


                • #9
                  Dear Clyde,

                  Thanks again for your answer. I should have started with the example. I'll try to follow your suggestion, I have very large data sets and many variables involved in the merge.
                  And thanks for the command, very useful, I was struggling to input the tables.
                  Best wishes

                  Comment

                  Working...
                  X