Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge error (ID matched but merged data sometimes erroneous)

    Hello Everybody,

    This is my second time posting, so I’m going to try to add additional information to see if it helps. I have an existing dataset of 5,597 families that participated in a treatment program, and I noticed that some of their termination dates were missing. Thus, I received an updated dataset in Excel to fill in the missing termination dates, which I converted to a Stata dataset using Stattransfer version 14.

    The first time I tried to merge on the additional dataset, it did not recognize the identifier although it was a numeric variable. Thus I used the following code, which helped me match the identifiers:

    tostring fpid, replace format (%07.0f)
    encode fpid, g (fpid2)
    rename fpid fpid_original
    rename fpid2 fpid
    sort fpid
    save

    Below is the dataex from this new dataset
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str7 fpid_original long(FP_REFERRAL_START FP_REFERRAL_TERM FP_REFERRAL_END fpid)
    "1087671" 19361 19509 19541  1
    "1087672" 19361 19395 19541  2
    "1087674" 19361 19416 19541  3
    "1087698" 19360 19524 19539  4
    "1087703" 19360 19477 19540  5
    "1087704" 19366 19473 19534  6
    "1087722" 19365 19438 19545  7
    "1087756" 19361 19446 19541  8
    "1087759" 19361 19386 19541  9
    "1087760" 19361 19375 19541 10
    "1087762" 19361 19477 19541 11
    "1087765" 19366 19414 19540 12
    "1087799" 19360 19526 19540 13
    "1087813" 19366 19730 19730 14
    "1087814" 19366 19502 19546 15
    "1087818" 19365 19754 19788 16
    "1087822" 19365 19563 19607 17
    "1087823" 19365 19458 19545 18
    "1087837" 19361 19607 19633 19
    "1087839" 19372 19446 19542 20
    end
    format %tdD_m_Y FP_REFERRAL_START
    format %tdD_m_Y FP_REFERRAL_TERM
    format %tdD_m_Y FP_REFERRAL_END
    label values fpid fpid2
    label def fpid2 1 "1087671", modify
    label def fpid2 2 "1087672", modify
    label def fpid2 3 "1087674", modify
    label def fpid2 4 "1087698", modify
    label def fpid2 5 "1087703", modify
    label def fpid2 6 "1087704", modify
    label def fpid2 7 "1087722", modify
    label def fpid2 8 "1087756", modify
    label def fpid2 9 "1087759", modify
    label def fpid2 10 "1087760", modify
    label def fpid2 11 "1087762", modify
    label def fpid2 12 "1087765", modify
    label def fpid2 13 "1087799", modify
    label def fpid2 14 "1087813", modify
    label def fpid2 15 "1087814", modify
    label def fpid2 16 "1087818", modify
    label def fpid2 17 "1087822", modify
    label def fpid2 18 "1087823", modify
    label def fpid2 19 "1087837", modify
    label def fpid2 20 "1087839", modify


    I successfully merged this dataset by the fpid to the new one, and the merged dataset is below. However, I am finding a curious error when I double-check the data. Some of the merged dates are incorrect i.e. the start dates do not match and should match every time. The two variables that should match are fpstart & FP_REFERRAL_START. I pasted the dataex below and the erroneous variables are in bold font; I am using Stata 15.1.


    Below is the dataex from this merged dataset
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long(fpstart term_date fpid FP_REFERRAL_START FP_REFERRAL_TERM)
    19361 19509  1 19361 19509
    19361 19395  2 19361 19395
    19361 19416  3 19361 19416
    19360 19524  4 19360 19524
    19360 19477  5 19360 19477
    19365 19438  6 19366 19473
    19361 19477  7 19365 19438
    19360 19526  8 19361 19446
    19366 19730  9 19361 19386
    19366 19502 10 19361 19375
    19365 19754 11 19361 19477
    19365 19563 12 19366 19414
    19365 19458 13 19360 19526
    19361 19607 14 19366 19730
    19372 19446 15 19366 19502
    19360 19477 16 19365 19754
    19372 19558 17 19365 19563
    19372 19537 18 19365 19458
    19362 19705 19 19361 19607
    19372 19409 20 19372 19446
    end
    format %tdD_m_Y fpstart
    format %tdD_m_Y term_date
    format %tdD_m_Y FP_REFERRAL_START
    format %tdD_m_Y FP_REFERRAL_TERM
    label values fpid fpid
    label def fpid 1 "1087671", modify
    label def fpid 2 "1087672", modify
    label def fpid 3 "1087674", modify
    label def fpid 4 "1087698", modify
    label def fpid 5 "1087703", modify
    label def fpid 6 "1087722", modify
    label def fpid 7 "1087762", modify
    label def fpid 8 "1087799", modify
    label def fpid 9 "1087813", modify
    label def fpid 10 "1087814", modify
    label def fpid 11 "1087818", modify
    label def fpid 12 "1087822", modify
    label def fpid 13 "1087823", modify
    label def fpid 14 "1087837", modify
    label def fpid 15 "1087839", modify
    label def fpid 16 "1087843", modify
    label def fpid 17 "1087844", modify
    label def fpid 18 "1087846", modify
    label def fpid 19 "1087847", modify
    label def fpid 20 "1087848", modify

    I would appreciate any help in figuring out why these start dates do not match. To see the error, you will have to compare FP_REFERRAL_START in the old dataset and the merged one. I’m wondering if it had something to do with the initial code I used to convert the fpid from string back to numeric. Thanks!

    Last edited by James Simon; 15 Apr 2020, 04:40. Reason: I added one additional detail about how to identify the error in the last paragraph.

  • #2
    encode is not the right command to use here. What it does is to order distinct values of your identifier and then assign them a value starting from 1 for the lowest value in each dataset. If both datasets do not contain the same distinct values of the identifier, there won't be a 1:1 mapping of values assigned by encode and the identifiers across the datasets. You should instead use

    Code:
    destring fpid, replace
    and merge numeric fpid in one dataset to numeric fpid in the other. If both are strings to begin with, you don't need to convert to numeric as you can merge using two string identifiers. Do read the documentation on both commands.

    Code:
    help encode
    help destring

    Comment


    • #3
      Thank you! I'll try it a bit later today when I get a calm moment :-).

      Comment


      • #4
        I tried the first set of code below, and I have the same problem as before, i.e., although the fpid in both datasets were originally numeric, they don't seem to recognize one another. None of the fpid variables between the datasets are matched. This is why I tried converting to string and then back to numeric to see if this would help.


        tostring fpid, replace format (%07.0f)
        destring fpid,
        save "C:\Users\james\Documents\Jumpdrive documents\James\Files to Merge\New data to merge2.dta"

        **I then open the master dataset and use the following code:


        merge 1:1 fpid using "C:\Users\james\Documents\Jumpdrive documents\James\Files to Merge\New data to merge2.dta"


        Result # of obs.
        -----------------------------------------
        not matched 11,194
        from master 5,597 (_merge==1)
        from using 5,597 (_merge==2)

        matched 0 (_merge==3)
        -----------------------------------------


        To expand on the original problem about the numeric fpid variables not recognizing one another, I'm pasting the format from the using and master using dataset in case it helps. If you try to merge these two datasets without modifying the fpid, it doesn’t work despite the fact that the fpid is numeric in both. Why? Again, I would appreciate any help.

        Using:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input long(fpid FP_REFERRAL_START FP_REFERRAL_TERM)
        1087671 19361 19509
        1087672 19361 19395
        1087674 19361 19416
        1087698 19360 19524
        1087703 19360 19477
        1087704 19366 19473
        1087722 19365 19438
        1087756 19361 19446
        1087759 19361 19386
        1087760 19361 19375
        1087762 19361 19477
        1087765 19366 19414
        1087799 19360 19526
        1087813 19366 19730
        1087814 19366 19502
        1087818 19365 19754
        1087822 19365 19563
        1087823 19365 19458
        1087837 19361 19607
        1087839 19372 19446
        end
        format %tdD_m_Y FP_REFERRAL_START
        format %tdD_m_Y FP_REFERRAL_TERM
        Master:
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input long(fpstart term_date fpid)
        19361 19509  1
        19361 19395  2
        19361 19416  3
        19360 19524  4
        19360 19477  5
        19365 19438  6
        19361 19477  7
        19360 19526  8
        19366 19730  9
        19366 19502 10
        19365 19754 11
        19365 19563 12
        19365 19458 13
        19361 19607 14
        19372 19446 15
        19360 19477 16
        19372 19558 17
        19372 19537 18
        19362 19705 19
        19372 19409 20
        end
        format %tdD_m_Y fpstart
        format %tdD_m_Y term_date
        label values fpid fpid
        label def fpid 1 "1087671", modify
        label def fpid 2 "1087672", modify
        label def fpid 3 "1087674", modify
        label def fpid 4 "1087698", modify
        label def fpid 5 "1087703", modify
        label def fpid 6 "1087722", modify
        label def fpid 7 "1087762", modify
        label def fpid 8 "1087799", modify
        label def fpid 9 "1087813", modify
        label def fpid 10 "1087814", modify
        label def fpid 11 "1087818", modify
        label def fpid 12 "1087822", modify
        label def fpid 13 "1087823", modify
        label def fpid 14 "1087837", modify
        label def fpid 15 "1087839", modify
        label def fpid 16 "1087843", modify
        label def fpid 17 "1087844", modify
        label def fpid 18 "1087846", modify
        label def fpid 19 "1087847", modify
        label def fpid 20 "1087848", modify

        Comment


        • #5
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input long(fpid FP_REFERRAL_START FP_REFERRAL_TERM)
          1087671 19361 19509
          1087672 19361 19395
          1087674 19361 19416
          1087698 19360 19524
          1087703 19360 19477
          1087704 19366 19473
          1087722 19365 19438
          1087756 19361 19446
          1087759 19361 19386
          1087760 19361 19375
          1087762 19361 19477
          1087765 19366 19414
          1087799 19360 19526
          1087813 19366 19730
          1087814 19366 19502
          1087818 19365 19754
          1087822 19365 19563
          1087823 19365 19458
          1087837 19361 19607
          1087839 19372 19446
          end
          format %tdD_m_Y FP_REFERRAL_START
          format %tdD_m_Y FP_REFERRAL_TERM
          tempfile using
          save `using'
          
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input long(fpstart term_date fpid)
          19361 19509  1
          19361 19395  2
          19361 19416  3
          19360 19524  4
          19360 19477  5
          19365 19438  6
          19361 19477  7
          19360 19526  8
          19366 19730  9
          19366 19502 10
          19365 19754 11
          19365 19563 12
          19365 19458 13
          19361 19607 14
          19372 19446 15
          19360 19477 16
          19372 19558 17
          19372 19537 18
          19362 19705 19
          19372 19409 20
          end
          format %tdD_m_Y fpstart
          format %tdD_m_Y term_date
          label values fpid fpid
          label def fpid 1 "1087671", modify
          label def fpid 2 "1087672", modify
          label def fpid 3 "1087674", modify
          label def fpid 4 "1087698", modify
          label def fpid 5 "1087703", modify
          label def fpid 6 "1087722", modify
          label def fpid 7 "1087762", modify
          label def fpid 8 "1087799", modify
          label def fpid 9 "1087813", modify
          label def fpid 10 "1087814", modify
          label def fpid 11 "1087818", modify
          label def fpid 12 "1087822", modify
          label def fpid 13 "1087823", modify
          label def fpid 14 "1087837", modify
          label def fpid 15 "1087839", modify
          label def fpid 16 "1087843", modify
          label def fpid 17 "1087844", modify
          label def fpid 18 "1087846", modify
          label def fpid 19 "1087847", modify
          label def fpid 20 "1087848", modify
          decode fpid, gen(fpid2)
          destring fpid2, replace
          drop fpid
          rename fpid2 fpid
          merge 1:1 fpid using `using'
          Res.:

          Code:
          . 
          . merge 1:1 fpid using `using'
          
              Result                           # of obs.
              -----------------------------------------
              not matched                            10
                  from master                         5  (_merge==1)
                  from using                          5  (_merge==2)
          
              matched                                15  (_merge==3)
              -----------------------------------------

          Comment


          • #6
            Thank you! I was going nuts trying to figure this out and was close to manually updating a few hundred missing dates myself :-).

            It worked once I used the decode and destring code in the master dataset and not in the using dataset because it gave me a value label error. What I still don't understand is why the two fpids variables didn't just communicate with one another if they were both numeric variables. Why didn't they just recognize one another? Anyways, thanks again!

            Comment


            • #7
              Sometimes it has to do with precision, but I am not sure if this is the reason in your case. Consider the following example:


              Code:
              clear
              set obs 1
              generate var1 = 200.000002 in 1
              gen var2= "x" in 1
              tempfile one
              save `one'
              clear
              set obs 1
              generate var1 = 200.00001 in 1
              gen var3= "y" in 1
              merge 1:1 var1 using `one'
              list
              Res.

              Code:
              . merge 1:1 var1 using `one'
              
                  Result                           # of obs.
                  -----------------------------------------
                  not matched                             2
                      from master                         1  (_merge==1)
                      from using                          1  (_merge==2)
              
                  matched                                 0  (_merge==3)
                  -----------------------------------------
              
              . list
              
                   +--------------------------------------+
                   | var1   var3   var2            _merge |
                   |--------------------------------------|
                1. |  200      y          master only (1) |
                2. |  200             x    using only (2) |
                   +--------------------------------------+
              Both values appear as 200 (given your display format), but they are not precisely 200, so we are unable to match them. See

              Code:
              help precision

              Comment


              • #8
                Interesting. This is helpful, so I will keep it in mind moving forward. I appreciate the time you took to assist me. Thanks again!

                Comment

                Working...
                X