Strange behavior with merge

Justin Niakamal

Join Date: Aug 2017

Posts: 755
#16

04 Mar 2022, 08:09

To my knowledge, nobody asked for the datasets themselves, just excerpts. If anyone would like the datasets, just message me.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#17

04 Mar 2022, 08:15

I agree that this seems to point to a problem with the merge command.

I am losing track of the various permutations, but it appears to me that have tried three approaches, briefly

Code:

// success use test_1 contract merge using test_2

Code:

// failure use test_2 merge using test_1

Code:

// success use test_1 keep "01001" merge using test_2

where test_1 is the large dataset and test_2 is the smaller one.

What these have in common are that the two successes start by reducing the number of observations in test_1 in memory and then merging on test_2. The failure starts with test_2 in memory and then merges on the full test_1.

What we don't know is what happens if you start with the full test_1 in memory and then merge on test_2. At least, I don't think we've seen that; perhaps you have.

Code:

use test_1 merge using test_2
Comment

Justin Niakamal

Join Date: Aug 2017
Posts: 755

#18

04 Mar 2022, 08:21

Thanks, William. Same result

Code:

 clear all 

. 
. use "test_1.dta"

. merge m:1 area_fips using "test_2.dta"

    Result                           # of obs.
    -----------------------------------------
    not matched                   107,624,840
        from master               107,623,392  (_merge==1)
        from using                      1,448  (_merge==2)

    matched                                 2  (_merge==3)
    -----------------------------------------

Comment

Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 333
#19

04 Mar 2022, 09:40

Try the following on the full dataset to see if it makes a difference:

Code:

clear all use "test_1.dta" gen `c(obs_t)' id = _n sort id drop id save temp.dta, replace clear all use temp.dta desc merge m:1 area_fips using "test_2.dta"

Last edited by Hua Peng (StataCorp); 04 Mar 2022, 09:43.
2 likes
Comment

Justin Niakamal

Join Date: Aug 2017
Posts: 755

#20

04 Mar 2022, 09:56

Thanks, Hua. Yes, that makes a difference.

Code:

. clear all

. use "test_1.dta"  

. gen `c(obs_t)' id = _n  

. sort id

. drop id

. save temp.dta, replace
(note: file temp.dta not found)
file temp.dta saved

. clear all

. use temp.dta

. desc

Contains data from temp.dta
  obs:   107,623,394                          
 vars:             2                          4 Mar 2022 10:52
------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------
area_fips       str5    %9s                   
industry_code   int     %10.0g                
------------------------------------------------------------------------------------------------------------------------
Sorted by: 

. merge m:1 area_fips using "test_2.dta"

    Result                           # of obs.
    -----------------------------------------
    not matched                    56,588,999
        from master                56,588,998  (_merge==1)
        from using                          1  (_merge==2)

    matched                        51,034,396  (_merge==3)
    -----------------------------------------

. 
. 
. exit

Comment

Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 333
#21

04 Mar 2022, 10:04

Do you know how the test_1.dta was created? Using Stata, or from some third party software.

The issue here is that the variable "area_fips" was incorrectly marked as sorted (from the output of -desc-) when it is actually not.
2 likes
Comment
Justin Niakamal

Join Date: Aug 2017

Posts: 755
#22

04 Mar 2022, 10:10

A colleague of mine sent me both files after he was encountering the merge issue. I assume he imported a .csv and saved the results in .dta, but I’ll try to confirm with him and report back here.
2 likes
Comment
Justin Niakamal

Join Date: Aug 2017

Posts: 755
#23

04 Mar 2022, 19:43

Hua Peng (StataCorp) as suspected, the data was a .csv that was imported into Stata and then saved as a .dta.
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 333
#24

04 Mar 2022, 20:11

Justin Niakamal , thanks for the information.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment