Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strange behavior with merge

    I’m trying to understand this bizarre behavior I’m encountering with merge. Unfortunately, the files are too large to attach here, and an excerpt won’t really help because when I restrict the data to subsets (explained below), everything works as intended.

    Here’s an example, when I contract the identifier and merge on the test_2 dataset, all but one area_fips merge.

    Code:
    use "test_1.dta", clear
     
    . contract area_fips
     
    . merge 1:1 area_fips using  "test_2.dta"
     
        Result                           # of obs.
        -----------------------------------------
        not matched                         3,277
            from master                     3,276  (_merge==1)
            from using                          1  (_merge==2)
    
        matched                             1,449  (_merge==3)
        -----------------------------------------
    However, without restricting the dataset I get the following result, with virtually nothing merging in.

    Code:
     
    . use "test_2.dta", replace
     
    . describe
     
    Contains data from test_2.dta
      obs:         1,450                          
     vars:             2                          2 Mar 2022 08:23
    --------------------------------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    --------------------------------------------------------------------------------------------------------------------------
    area_fips       str5    %9s                   County FIPS code
    cbsa            long    %10.0g              
    --------------------------------------------------------------------------------------------------------------------------
    Sorted by:
     
    . isid area_fips
     
    .
    . merge 1:m area_fips using  "test_1.dta"
     
        Result                           # of obs.
        -----------------------------------------
        not matched                   107,624,840
            from master                     1,448  (_merge==1)
            from using                107,623,392  (_merge==2)
     
       matched                                 2  (_merge==3)
        -----------------------------------------
     
    .
    . tab _merge if area_fips == "01001"
     
                     _merge |      Freq.     Percent        Cum.
    ------------------------+-----------------------------------
            master only (1) |          1        0.00        0.00
             using only (2) |     26,816      100.00      100.00
    ------------------------+-----------------------------------
                      Total |     26,817      100.00
    Notice above how area_fips == "01001" shows no matches. However, restricting the dataset beforehand, everything works as intended

    Code:
     
    use "test_1.dta", clear
     
    . describe
     
    Contains data from test_1.dta
      obs:   107,623,394                        
     vars:             2                          2 Mar 2022 08:27
    --------------------------------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    --------------------------------------------------------------------------------------------------------------------------
    area_fips       str5    %9s                  
    industry_code   int     %10.0g              
    --------------------------------------------------------------------------------------------------------------------------
    Sorted by: area_fips
     
    . keep if area_fips == "01001"
    (107,596,578 observations deleted)
     
    .
    . merge m:1 area_fips using "test_2.dta"
     
        Result                           # of obs.
        -----------------------------------------
        not matched                         1,449
            from master                         0  (_merge==1)
            from using                      1,449  (_merge==2)
     
       matched                            26,816  (_merge==3)
        -----------------------------------------
    Is this a bug? If not, can someone explain this behavior?

    Code:
    . about
     
    Stata/SE 16.1 for Mac (Intel 64-bit)
    Revision 07 Dec 2021
    Copyright 1985-2019 StataCorp LLC
     
    Total physical memory: 64.00 GB
    Last edited by Justin Niakamal; 03 Mar 2022, 17:42.

  • #2
    I really can’t follow what the problem might be without seeing the data. Find a minimal set that reproduces the problem and report back. I fail to believe this is a bug in merge and likely some idiosyncrasy of your dataset.

    Comment


    • #3
      Thanks, Leonardo. As shown and explained above, the issue occurs with the full dataset, not with subsets. Here's the inefficient work around I used


      Code:
      use  "test_2.dta", clear
      isid area_fips
      levelsof area_fips, local(fips)
      tempfile temp
      save `temp'
      
      foreach z in `fips' {
           use "test_1.dta" if area_fips == "`z'"
           merge m:1 area_fips using `temp', keep(using matched)
           tempfile _`z'
           save `_`z''
      }
      clear
      foreach z in `fips' {
          append using `_`z''
      }
      
      compress
      save "Merged.dta", replace
      
      
      tab _merge
      
         Matching result from |
                        merge |      Freq.     Percent        Cum.
      ------------------------+-----------------------------------
               Using only (2) |  2,101,051        3.95        3.95
                  Matched (3) | 51,034,396       96.05      100.00
      ------------------------+-----------------------------------
                        Total | 53,135,447      100.00
      Here's the data

      Test_1.dta
      Code:
      clear
      input str5 area_fips int industry_code
      "38073" 4842
      "38073" 2361
      "38073" 1111
      "38073" 5182
      "38073" 3273
      "38073" 7224
      "38073" 4451
      "38073" 4239
      "38073" 4531
      "38073" 2381
      end
      Test_2.dta

      Code:
      clear
      input str5 area_fips long cbsa
      "01001" 33860
      "01003" 19300
      "01007" 13820
      "01009" 13820
      "01015" 11500
      "01017" 46740
      "01021" 13820
      "01031" 21460
      "01033" 22520
      "01037" 45180
      end

      Comment


      • #4
        Given the large size of the files, you can do your workaround noticeably faster with -runby-. It also has the virtue that the code is much cleaner and more transparent.

        Code:
        clear
        input str5 area_fips int industry_code
        "38073" 4842
        "38073" 2361
        "38073" 1111
        "38073" 5182
        "38073" 3273
        "38073" 7224
        "38073" 4451
        "38073" 4239
        "38073" 4531
        "38073" 2381
        end
        save test_1, replace
        
        clear
        input str5 area_fips long cbsa
        "01001" 33860
        "01003" 19300
        "01007" 13820
        "01009" 13820
        "01015" 11500
        "01017" 46740
        "01021" 13820
        "01031" 21460
        "01033" 22520
        "01037" 45180
        end
        save test_2, replace
        
        capture program drop one_fips
        program define one_fips
            merge m:1 area_fips using test_2, keep(match using)
            exit
        end
        
        use test_1, clear
        one_fips, by(area_fips) status
        -runby- is written by Robert Picard and me, and is available from SSC.

        Of course, in the example data shown, there are no matches, so the results with the demo are not very interesting or persuasive.
        Last edited by Clyde Schechter; 03 Mar 2022, 18:39.

        Comment


        • #5
          Thanks, Clyde. Yes, I'm a fan of runby and I've got a lot of utility out of it! I'm still puzzled as to why the difference in results. A colleague of mine also tested this on Stata 17 and got the same results.

          One edit, I believe you meant

          Code:
          runby one_fips, by(area_fips) status
          Edit:

          Results from runby, which was a substantially faster approach.

          Code:
          . runby one_fips, by(area_fips) status
          
            elapsed ----------- by-groups ----------    ------- observations ------       time
               time      count     errors    no-data        processed         saved  remaining
          ------------------------------------------------------------------------------------
           00:00:04         10          0          0          305,700       192,578   00:22:24
           00:00:05         38          0          0        1,002,168       648,712   00:08:35
          (now reporting every 5 seconds)
           00:00:10        219          0          0        4,826,702     2,777,489   00:03:31
           00:00:15        339          0          0        8,883,896     5,775,995   00:02:46
           00:00:20        457          0          0       12,708,164     8,839,203   00:02:29
           00:00:25        624          0          0       16,494,484    11,364,730   00:02:18
           00:00:30        779          0          0       20,444,166    13,908,936   00:02:08
           00:00:35        921          0          0       24,363,704    16,478,264   00:01:59
           00:00:40      1,110          0          0       28,217,418    18,175,489   00:01:52
           00:00:45      1,281          0          0       32,130,965    20,754,863   00:01:46
           00:00:50      1,406          0          0       36,295,469    23,526,873   00:01:38
           00:00:55      1,559          0          0       40,397,847    25,748,780   00:01:32
          (now reporting every 15 seconds)
           00:01:10      2,044          0          0       51,683,977    31,925,299   00:01:16
           00:01:25      2,456          0          0       63,284,191    39,972,246   00:00:60
           00:01:40      2,901          0          0       74,688,655    47,817,740   00:00:44
           00:01:55      3,357          0          0       85,925,362    55,102,221   00:00:29
           00:02:10      4,046          0          0       98,825,151    56,899,647   00:00:12
           00:02:23      4,725          0          0      107,623,394    57,884,197   00:00:00
          
          --------------------------------------
          Number of by-groups    =         4,725
          by-groups with errors  =             0
          by-groups with no data =             0
          Observations processed =   107,623,394
          Observations saved     =    57,884,197
          --------------------------------------
          
          .
          
          
          . tab _merge
          
             Matching result from |
                            merge |      Freq.     Percent        Cum.
          ------------------------+-----------------------------------
                   Using only (2) |  6,849,801       11.83       11.83
                      Matched (3) | 51,034,396       88.17      100.00
          ------------------------+-----------------------------------
                            Total | 57,884,197      100.00
          Last edited by Justin Niakamal; 03 Mar 2022, 18:58.

          Comment


          • #6
            For speeding things up, you may also try -joinby- (SSC). I echo Clyde's comments about your example data.

            A couple things with checking. Check the string-valued FIPS code for anomalous characters, using say -chartab- (SSC). Or, convert the code the number it represents and then see if the problem persists.

            Comment


            • #7
              Yes, I did use chartab to try and diagnose the problem. I also thought to destring the data, but there are some FIPs with "C","US", etc. Here's example data with the same FIPS contained in both datasets. Edit: isn't joinby part of official Stata?

              Code:
              // test_1
              clear
              input str5 area_fips int industry_code
              "01001" 1111
              "01001" 2373
              "01001" 5617
              "01001" 5312
              "01001" 5415
              "01001" 5419
              "01001" 4461
              "01001" 3353
              "01001" 3152
              "01001" 2383
              "01001" 3371
              "01001" 8141
              "01001" 4483
              "01001" 5231
              "01001" 1111
              "01001" 3273
              "01001" 6223
              "01001" 6216
              "01001" 1153
              "01001" 4512
              end
              
              // test_2
              clear
              input str5 area_fips long cbsa
              "01001" 33860
              "01003" 19300
              "01007" 13820
              "01009" 13820
              "01015" 11500
              "01017" 46740
              "01021" 13820
              "01031" 21460
              "01033" 22520
              "01037" 45180
              "01043" 18980
              "01045" 37120
              "01051" 33860
              "01055" 23460
              "01061" 20020
              "01065" 46220
              "01067" 20020
              "01069" 20020
              "01071" 42460
              "01073" 13820
              end

              Comment


              • #8
                Justin Niakamal Yes, of course you are right. I meant -runby one_fips, by(area_fips) status-. That was actually a mousing error on my part. When I copied that line from my do-editor, I didn't highlight the leftmost part of the line. Glad you picked it up and realized what I meant.

                Comment


                • #9
                  Originally posted by Justin Niakamal View Post
                  Yes, I did use chartab to try and diagnose the problem. I also thought to destring the data, but there are some FIPs with "C","US", etc. Here's example data with the same FIPS contained in both datasets. Edit: isn't joinby part of official Stata?
                  Oops, you are correct about -joinby-. Thanks for posting the data example.

                  Comment


                  • #10
                    I cannot reproduce your result with your data example.

                    Code:
                    clear
                    input str5 area_fips int industry_code
                    "01001" 1111
                    "01001" 2373
                    "01001" 5617
                    "01001" 5312
                    "01001" 5415
                    "01001" 5419
                    "01001" 4461
                    "01001" 3353
                    "01001" 3152
                    "01001" 2383
                    "01001" 3371
                    "01001" 8141
                    "01001" 4483
                    "01001" 5231
                    "01001" 1111
                    "01001" 3273
                    "01001" 6223
                    "01001" 6216
                    "01001" 1153
                    "01001" 4512
                    end
                    
                    tempfile test1
                    save `test1'
                    
                    
                    clear
                    input str5 area_fips long cbsa
                    "01001" 33860
                    "01003" 19300
                    "01007" 13820
                    "01009" 13820
                    "01015" 11500
                    "01017" 46740
                    "01021" 13820
                    "01031" 21460
                    "01033" 22520
                    "01037" 45180
                    "01043" 18980
                    "01045" 37120
                    "01051" 33860
                    "01055" 23460
                    "01061" 20020
                    "01065" 46220
                    "01067" 20020
                    "01069" 20020
                    "01071" 42460
                    "01073" 13820
                    end
                    
                    
                    merge 1:m area_fips using  `test1'
                    Res.:

                    Code:
                    . merge 1:m area_fips using  `test1'
                    
                        Result                           # of obs.
                        -----------------------------------------
                        not matched                            19
                            from master                        19  (_merge==1)
                            from using                          0  (_merge==2)
                    
                        matched                                20  (_merge==3)
                        -----------------------------------------
                    If this also happens with the data example that you have posted, perhaps update/ re-install Stata.

                    Comment


                    • #11
                      I'm confused.... why not just break your merging variable into two numerical IDs, where 01 001 is.... some random county in Alabama, if I recall correctly.


                      Having it in string format means the merge will be subject to all manner of strange idiosyncrasies of the string, but just merging on 01 001 will never have that problem. Stata will always know what 1 1 means, but having it as a string is sure to be a nightmare, especially when you work with different FIPS datasets.

                      Comment


                      • #12
                        Andrew Musau that's exactly the point and why I said I didn't want to share a data example in #1 because the problem isn't reproducible on smaller datasets. As shown and explained in my first post, the issue occurs with the full dataset, but if I break the dataset apart I don't have the same issues. I've tried this on two machines and two versions of Stata. One of them is completely up to date and the information on the other is shown in #1. I'm trying to understand why this occurs. See excerpt below

                        Jared Greathouse I mentioned in another post that they aren't numeric variables, some FIPS have "C", "US", etc. The issue isn't 'how do I get around the merge issue', but why is this occurring on the full dataset but not the subsets. The data do merge when subsetted

                        Code:
                        // FULL DATASET
                        . tab _merge if area_fips == "01001"
                         
                                         _merge |      Freq.     Percent        Cum.
                        ------------------------+-----------------------------------
                            master only (1) |          1        0.00        0.00
                                 using only (2) |     26,816      100.00      100.00
                        ------------------------+-----------------------------------
                                          Total |     26,817      100.00
                        
                        
                        // RESTRICTED DATASET
                        . keep if area_fips == "01001"
                        (107,596,578 observations deleted)
                         
                        .
                        . merge m:1 area_fips using "test_2.dta"
                         
                            Result                           # of obs.
                            -----------------------------------------
                            not matched                         1,449
                                from master                         0  (_merge==1)
                                from using                      1,449  (_merge==2)
                         
                        matched                            26,816  (_merge==3)
                            -----------------------------------------
                        For good measure, the results of running #10

                        Code:
                        . merge 1:m area_fips using  `test1'
                        
                            Result                           # of obs.
                            -----------------------------------------
                            not matched                            19
                                from master                        19  (_merge==1)
                                from using                          0  (_merge==2)
                        
                            matched                                20  (_merge==3)
                            -----------------------------------------
                        
                        .
                        end of do-file
                        Last edited by Justin Niakamal; 04 Mar 2022, 07:43.

                        Comment


                        • #13
                          I don't think that the issue is size. It is something unique to your dataset. Consider:

                          Code:
                          clear
                          input str5 area_fips int industry_code
                          "01001" 1111
                          "01001" 2373
                          "01001" 5617
                          "01001" 5312
                          "01001" 5415
                          "01001" 5419
                          "01001" 4461
                          "01001" 3353
                          "01001" 3152
                          "01001" 2383
                          "01001" 3371
                          "01001" 8141
                          "01001" 4483
                          "01001" 5231
                          "01001" 1111
                          "01001" 3273
                          "01001" 6223
                          "01001" 6216
                          "01001" 1153
                          "01001" 4512
                          end
                          
                          expand 10000000
                          tempfile test1
                          save `test1'
                          
                          
                          
                          
                          clear
                          input str5 area_fips long cbsa
                          "01001" 33860
                          "01003" 19300
                          "01007" 13820
                          "01009" 13820
                          "01015" 11500
                          "01017" 46740
                          "01021" 13820
                          "01031" 21460
                          "01033" 22520
                          "01037" 45180
                          "01043" 18980
                          "01045" 37120
                          "01051" 33860
                          "01055" 23460
                          "01061" 20020
                          "01065" 46220
                          "01067" 20020
                          "01069" 20020
                          "01071" 42460
                          "01073" 13820
                          end
                          
                          
                          merge 1:m area_fips using  `test1'
                          Res.:

                          Code:
                          . expand 10000000
                          (199,999,980 observations created)
                          
                          
                          . merge 1:m area_fips using  `test1'
                          
                              Result                           # of obs.
                              -----------------------------------------
                              not matched                            19
                                  from master                        19  (_merge==1)
                                  from using                          0  (_merge==2)
                          
                              matched                       200,000,000  (_merge==3)
                              -----------------------------------------
                          
                          .

                          Comment


                          • #14
                            Any idea of what that could be? If it's something unique to the dataset, why does restricting/subsetting the data beforehand produce the intended results?

                            Comment


                            • #15
                              Since the minimal data variables need for this problem (FIPS codes plus one other variable) do not appear to be confidential, and you are reluctant to share that data publicly, I think it best if you contact Stata Technical Services and attach those datasets to reproduce the problem. From what you've shared and described, I doubt we will have a more productive discussion.

                              Comment

                              Working...
                              X