Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    To my knowledge, nobody asked for the datasets themselves, just excerpts. If anyone would like the datasets, just message me.

    Comment


    • #17
      I agree that this seems to point to a problem with the merge command.

      I am losing track of the various permutations, but it appears to me that have tried three approaches, briefly
      Code:
      // success
      use test_1
      contract
      merge using test_2
      Code:
      // failure
      use test_2
      merge using test_1
      Code:
      // success
      use test_1
      keep "01001"
      merge using test_2
      where test_1 is the large dataset and test_2 is the smaller one.

      What these have in common are that the two successes start by reducing the number of observations in test_1 in memory and then merging on test_2. The failure starts with test_2 in memory and then merges on the full test_1.

      What we don't know is what happens if you start with the full test_1 in memory and then merge on test_2. At least, I don't think we've seen that; perhaps you have.
      Code:
      use test_1
      merge using test_2

      Comment


      • #18
        Thanks, William. Same result

        Code:
         clear all 
        
        . 
        . use "test_1.dta"
        
        . merge m:1 area_fips using "test_2.dta"
        
            Result                           # of obs.
            -----------------------------------------
            not matched                   107,624,840
                from master               107,623,392  (_merge==1)
                from using                      1,448  (_merge==2)
        
            matched                                 2  (_merge==3)
            -----------------------------------------

        Comment


        • #19
          Try the following on the full dataset to see if it makes a difference:

          Code:
          clear all
          use "test_1.dta"  
          gen `c(obs_t)' id = _n  
          sort id
          drop id
          save temp.dta, replace
          clear all
          use temp.dta
          desc
          merge m:1 area_fips using "test_2.dta"
          Last edited by Hua Peng (StataCorp); 04 Mar 2022, 09:43.

          Comment


          • #20
            Thanks, Hua. Yes, that makes a difference.

            Code:
            . clear all
            
            . use "test_1.dta"  
            
            . gen `c(obs_t)' id = _n  
            
            . sort id
            
            . drop id
            
            . save temp.dta, replace
            (note: file temp.dta not found)
            file temp.dta saved
            
            . clear all
            
            . use temp.dta
            
            . desc
            
            Contains data from temp.dta
              obs:   107,623,394                          
             vars:             2                          4 Mar 2022 10:52
            ------------------------------------------------------------------------------------------------------------------------
                          storage   display    value
            variable name   type    format     label      variable label
            ------------------------------------------------------------------------------------------------------------------------
            area_fips       str5    %9s                   
            industry_code   int     %10.0g                
            ------------------------------------------------------------------------------------------------------------------------
            Sorted by: 
            
            . merge m:1 area_fips using "test_2.dta"
            
                Result                           # of obs.
                -----------------------------------------
                not matched                    56,588,999
                    from master                56,588,998  (_merge==1)
                    from using                          1  (_merge==2)
            
                matched                        51,034,396  (_merge==3)
                -----------------------------------------
            
            . 
            . 
            . exit

            Comment


            • #21
              Do you know how the test_1.dta was created? Using Stata, or from some third party software.

              The issue here is that the variable "area_fips" was incorrectly marked as sorted (from the output of -desc-) when it is actually not.

              Comment


              • #22
                A colleague of mine sent me both files after he was encountering the merge issue. I assume he imported a .csv and saved the results in .dta, but I’ll try to confirm with him and report back here.

                Comment


                • #23
                  Hua Peng (StataCorp) as suspected, the data was a .csv that was imported into Stata and then saved as a .dta.

                  Comment


                  • #24
                    Justin Niakamal , thanks for the information.

                    Comment

                    Working...
                    X