Merging database with unique identifier

sandeep kaur

Join Date: Jul 2022

Posts: 60
#1

Merging database with unique identifier

15 Feb 2023, 13:52

I am trying to merge two databases with one unique identifier with 9 numeric digits. But on merging as one to one key variable it's showing error.

For both files I transferred the data from excel sheet to stata in string format to avoid the exponent number. And in stata I encoded string to numeric.
For both files its formatted as long %9.0g

I rechecked unique identifier in both dataset contains 9 digits.

Last edited by sandeep kaur; 15 Feb 2023, 14:18.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#2

15 Feb 2023, 14:13

And in stata I encoded string to numeric.

That was a mistake. Unless the two data sets contain exactly the same set of unique identifiers, this will result in the -encoded- values being different in the two data sets, and when you then -merge- you will get mismatched pairs.

Why don't you just leave the identifiers as string variables and then merge on that? Unless the problem is that one of the data sets has leading zeroes and the other doesn't, or something like that, that will be the safest way to go. If you need a numeric id variable as well, then after they are already merged, it is safe to -encode- them (or use -egen, group()-).

For the future, it is a lot easier to resolve problems when you actually show example data (for both data sets). Even the best description in words invariably falls short in some ways. We have a command, -dataex-, that makes it utterly simple to post example data here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
sandeep kaur

Join Date: Jul 2022

Posts: 60
#3

15 Feb 2023, 14:18

**Correction
For both files formatted as long %9.0g
Comment
sandeep kaur

Join Date: Jul 2022

Posts: 60
#4

15 Feb 2023, 14:31

Hi Prof Clyde Schechter

I tried merging as string variable first but was not successful and there are no leading zeroes in either of dataset.

I will install dataex
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#5

15 Feb 2023, 14:52

When you post back, please explain what "not successful" means. Did Stata crash? Or hang? Did it give you an error message; if so what was it? Did it do the merge but produce results that were not what you wanted? If so, show the results and, unless it is blatantly obvious, explain how the results differ from what you wanted.

Are you sure you don't already have -dataex- installed? As I pointed out in #2, unless you are running an old version that has not been completely updated, or a truly archaic version of Stata (fully updated or not), you already have it.

Last edited by Clyde Schechter; 15 Feb 2023, 14:55.
Comment
sandeep kaur

Join Date: Jul 2022

Posts: 60
#6

15 Feb 2023, 21:39

Hi Prof Clyde Schechter

one dataset has more observation's than the other but both have common identifier which is ID.
a) I was trying to match using one to one key variable option and two datasets did not merge at all and was showing error r495. Result was mismatched pairs
b) Then I tried many to many on key variable option and datasets merged completely without any error. Result was matched pairs.

I had to convert string format to numeric as matching/merging did not happen with string.

Thanks for providing information on dataex- I already had it on my software but never used it earlier. It's confidential dataset, was not sure how to create fake data giving insight into problem.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#7

15 Feb 2023, 22:48

Then I tried many to many on key variable option and datasets merged completely without any error.

I'll bet it didn't. m:m merges produce data salad and match observations that have nothing to do with each other. The only time an m:m merge produces correct results is if a 1:m or m:1 merge would run, or in some other extremely unusual circumstances that almost never arise in real life. You probably have a garbage data set at this point. You should never use m:m merging. It even says that in the Stata documentation!

Go back and do it over. Try

Code:

use data_set_with_more_observations, clear merge m:1 common_identifier using data_set_with_fewer_observations

If that doesn't work, try 1:m instead of m:1 and see if that works. If neither of these works, then you have data sets that cannot be merged, or at least cannot be merged without some other variable(s) as part of the -merge- key. Or, there may be errors in the data sets that need to be fixed to make them mergeable. If they cannot be merged at all, then there may be other ways to combine them that make sense.

If the above does not solve your problem, when you post back, show the exact code and error messages or other output that Stata gives you. Also show example data from both data sets (using -dataex-) that reproduces the problem you are having. Also make sure that the examples you show from the data sets produce the same error message(s) that you get using your whole data sets, and also that they contain some observations that should pair with each other. All of that information will be needed to provide further help.

But, please, do not move forward with the data set you created with the m:m merge. If it really is correct, then either the 1:m or m:1 merge will run, and give you the same correct results, and you can be confident you are ready to proceed. But if neither of those works, it is definitely wrong, and somewhere down the line, the problems with it, which are not apparent to you now, will surface. You'd better hope that happens before somebody is harmed by relying on your incorrect analyses.

And, please, never use -merge m:m- again. If you think you need to use it, then either your data are not what you think they are, or you need to use some command other than -merge-. And be thankful to Stata for alerting you to the problem through error messages.

Last edited by Clyde Schechter; 15 Feb 2023, 22:52.
2 likes
Comment
sandeep kaur

Join Date: Jul 2022

Posts: 60
#8

16 Feb 2023, 10:23

Thanks Prof for bringing that to my attention.

I tried with 1:m and m:1. Files merged in both ways. Stata result was that all observation matched without giving any error message.
But I checked all merged files and found many observations haven't correctly match even though stata mentioned matched.

I guess at this point as I don't see any error codes won't be able to produce example by dataex. I will review original data file as there is only one identifier with which I can merge files. But if that does not work either have to find another way.

Regards
Sandeep
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

16 Feb 2023, 10:55

We know from post #1 that you started with string variables and initially used encode to convert them to numeric.

And in stata I encoded string to numeric.

And in post #2 you were told

That was a mistake. Unless the two data sets contain exactly the same set of unique identifiers, this will result in the -encoded- values being different in the two data sets, and when you then -merge- you will get mismatched pairs.

In post #6 you told us

I had to convert string format to numeric as matching/merging did not happen with string.

Did you again use encode to convert them? Because it is still a mistake.

In post #8 you told us

But I checked all merged files and found many observations haven't correctly match even though stata mentioned matched.

I am guessing that the incorrectly encoded IDs did match. Or else that you misinterpreted the output of the merge command.

You very much need to use dataex to provide example data from each of your two datasets. Let us suppose that your string common identifier is IDstr and your numeric common identifier is IDnum (you should substitute the actual variable names for these in the command below)

Code:

sort IDstr dataex IDstr IDnum in 1/50

and paste both sets of output into your next post, that would go a long way to helping us tell you what went wrong and what you should do to correctly merge your datasets.
2 likes
Comment

sandeep kaur

Join Date: Jul 2022
Posts: 60

#10

17 Feb 2023, 13:27

DATASET 1
. dataex ID ID_n Var1 var2 in 1/10

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str9 ID long ID_n str1(Var1 var2)
"111111110" 111111110 "0" "0"
"111111110" 111111110 "0" "0"
"111111111" 111111111 "1" "0"
"111111112" 111111112 "1" "0"
"111111113" 111111113 "0" "0"
"111111114" 111111114 "1" "1"
"111111116" 111111116 "0" "1"
"111111117" 111111117 "1" "1"
"111111118" 111111118 "0" "0"
"111111119" 111111119 "1" "0"
end

------------------ copy up to and including the previous line ------------------

Listed 10 out of 10 observations

DATASET 2
. dataex ID ID_n Var1 var2 var3 in 1/17

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str9 ID long ID_n str1(Var1 var2 var3)
"111111111" 111111111 "1" "0" "1"
"111111112" 111111112 "1" "0" "1"
"111111110" 111111110 "0" "0" "1"
"111111113" 111111113 "0" "0" "1"
"111111114" 111111114 "1" "1" "1"
"111111116" 111111116 "0" "1" "2"
"111111117" 111111117 "1" "1" "1"
"111111118" 111111118 "0" "0" "2"
"111111119" 111111119 "1" "0" "2"
"111111110" 111111110 "0" "0" "2"
"111111122" 111111122 "1" "0" "2"
"111111121" 111111121 "1" "0" "1"
"111111100" 111111100 "0" "0" "2"
"111111123" 111111123 "0" "0" "1"
"111111124" 111111124 "1" "1" "1"
"111111125" 111111125 "0" "1" "2"
"111111128" 111111128 "1" "1" "2"
end

------------------ copy up to and including the previous line ------------------

Listed 17 out of 17 observations

Comment

sandeep kaur

Join Date: Jul 2022

Posts: 60
#11

17 Feb 2023, 13:33

Thanks Prof's

It's helpful- critical to learn merging data correctly. For future references

1) ) How to expand all the digits in numeric format. I see whenever there are 6 or more digits it's condensed into exponent form under numeric format. Under string format all digits are visible. That's what it lead to confusion, hence errors

2) What is the best way to convert string with number to numeric format?
3) How encoding effects the string format?

4) Is there a way to confirm if the observations have been correctly matched? for example distinct command

Thanks for all your help.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#12

17 Feb 2023, 14:19

1) How to expand all the digits in numeric format. I see whenever there are 6 or more digits it's condensed into exponent form under numeric format.

Using your first dataset with 10 observations for this example, we see that assigning a different numeric format to a numeric variable changes how it is displayed.

Code:

. // show that ID_n has a "general" numeric format
. describe ID_n

Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------------
ID_n            long    %12.0g                

. tab ID_n

       ID_n |      Freq.     Percent        Cum.
------------+-----------------------------------
   1.11e+08 |          2       20.00       20.00
   1.11e+08 |          1       10.00       30.00
   1.11e+08 |          1       10.00       40.00
   1.11e+08 |          1       10.00       50.00
   1.11e+08 |          1       10.00       60.00
   1.11e+08 |          1       10.00       70.00
   1.11e+08 |          1       10.00       80.00
   1.11e+08 |          1       10.00       90.00
   1.11e+08 |          1       10.00      100.00
------------+-----------------------------------
      Total |         10      100.00

. // change the format to a "fixed" numer format
. format %9.0f ID_n

. describe ID_n

Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------------
ID_n            long    %9.0f                 

. tab ID_n

       ID_n |      Freq.     Percent        Cum.
------------+-----------------------------------
  111111110 |          2       20.00       20.00
  111111111 |          1       10.00       30.00
  111111112 |          1       10.00       40.00
  111111113 |          1       10.00       50.00
  111111114 |          1       10.00       60.00
  111111116 |          1       10.00       70.00
  111111117 |          1       10.00       80.00
  111111118 |          1       10.00       90.00
  111111119 |          1       10.00      100.00
------------+-----------------------------------
      Total |         10      100.00

See the output of help format for more details.

2) What is the best way to convert string with number to numeric format?

To get the warning out of the way first, do not use the encode command, which is designed for assigning numerical codes to non-numeric strings like "France", "Germany", "United States". The output of help encode instructs us

Code:

Do not use encode if varname contains numbers that merely happen to be stored as strings;
instead, use generate newvar = real(varname) or destring; see real() or [D] destring.

With that said, my preference is using destring because it takes care to create the new variable in with the appropriate characteristics.

Code:

 input str20 (a b c)

                        a                     b                     c
  1. "9" "999999999" "999999999999"
  2. end

. destring a b c, replace
a: all characters numeric; replaced as byte
b: all characters numeric; replaced as long
c: all characters numeric; replaced as double

3) How encoding effects the string format?

Consider the following example.

Code:

. destring string, generate(right)
string: all characters numeric; right generated as byte

. encode string, generate(wrong)

. // wrong looks correct
. list, clean 

       string   right   wrong  
  1.        1       1       1  
  2.        2       2       2  
  3.        3       3       3  
  4.       11      11      11  
  5.       12      12      12  
  6.       13      13      13  

. // but the actual values encoded are incorrect
. label list wrong
wrong:
           1 1
           2 11
           3 12
           4 13
           5 2
           6 3

. list, clean nolabel

       string   right   wrong  
  1.        1       1       1  
  2.        2       2       5  
  3.        3       3       6  
  4.       11      11       2  
  5.       12      12       3  
  6.       13      13       4

Is there a way to confirm if the observations have been correctly matched?

Be sure you use the correct command -

merge 1:1 if the identifier is distinct in both datasets
merge 1:m if the identifier is only distinct in the primary dataset
merge m:1 if the identifier is only distinct in the using dataset
joinby if identifier is not distinct in either dataset.

Your example datasets fall into the last category.

Code:

. describe using `dataA'

Contains data                                 
 Observations:            10                  17 Feb 2023 16:16
    Variables:             3                  
------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------------
ID              str9    %9s                   
var1A           str1    %9s                   
var2A           str1    %9s                   
------------------------------------------------------------------------------------------------
Sorted by: 

. describe using `dataB'

Contains data                                 
 Observations:            17                  17 Feb 2023 16:16
    Variables:             4                  
------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------------
ID              str9    %9s                   
var1B           str1    %9s                   
var2B           str1    %9s                   
var3B           str1    %9s                   
------------------------------------------------------------------------------------------------
Sorted by: 

. use `dataA', clear

. joinby ID using `dataB', unmatched(both)

. list, clean noobs

           ID   var1A   var2A                          _merge   var1B   var2B   var3B  
    111111110       0       0   both in master and using data       0       0       2  
    111111110       0       0   both in master and using data       0       0       1  
    111111110       0       0   both in master and using data       0       0       1  
    111111110       0       0   both in master and using data       0       0       2  
    111111111       1       0   both in master and using data       1       0       1  
    111111112       1       0   both in master and using data       1       0       1  
    111111113       0       0   both in master and using data       0       0       1  
    111111114       1       1   both in master and using data       1       1       1  
    111111116       0       1   both in master and using data       0       1       2  
    111111117       1       1   both in master and using data       1       1       1  
    111111118       0       0   both in master and using data       0       0       2  
    111111119       1       0   both in master and using data       1       0       2  
    111111100                              only in using data       0       0       2  
    111111125                              only in using data       0       1       2  
    111111121                              only in using data       1       0       1  
    111111124                              only in using data       1       1       1  
    111111123                              only in using data       0       0       1  
    111111122                              only in using data       1       0       2  
    111111128                              only in using data       1       1       2  

.

Comment

Daniel Shin

Join Date: Mar 2020
Posts: 146

#13

17 Feb 2023, 14:27

I recently had an issue with merging datasets using match variables that had really high values (integers above 10m+), and I wonder if you're having similar issues. You must check the integrity of the data types, especially if you're converting string numbers (your ID variable) to numeric (your ID_n variable). If you don't specify the data type, you will end up with loss of information. In your data examples, it seems you specified the data type as long integers, but Stata does not use long as a default.

Code:

. gen newID=real(ID)

. gen long newID2=real(ID)

. format newID* %12.0g

. list

     +-------------------------------------------------------------+
     |        ID        ID_n   Var1   var2       newID      newID2 |
     |-------------------------------------------------------------|
  1. | 111111110   111111110      0      0   111111112   111111110 |
  2. | 111111110   111111110      0      0   111111112   111111110 |
  3. | 111111111   111111111      1      0   111111112   111111111 |
  4. | 111111112   111111112      1      0   111111112   111111112 |
  5. | 111111113   111111113      0      0   111111112   111111113 |
     |-------------------------------------------------------------|
  6. | 111111114   111111114      1      1   111111112   111111114 |
  7. | 111111116   111111116      0      1   111111120   111111116 |
  8. | 111111117   111111117      1      1   111111120   111111117 |
  9. | 111111118   111111118      0      0   111111120   111111118 |
 10. | 111111119   111111119      1      0   111111120   111111119 |
     +-------------------------------------------------------------+

One way to check if your ID variables are okay, run the command duplicates report ID in a dataset that contains only unique IDs. If you have duplicates, you must revisit your merge variables.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#14

17 Feb 2023, 15:01

Both William Lisowski and Daniel Shin have given you excellent advice.

That said, as I suggested might be the case in #7, your data is messed up. In data set 1, you have two observations with ID = "111111110". These two observations are in exact agreement on all variables. So this can be solved simply enough by eliminating one of them, although I would not recommend doing that. It usually is a sign of bad data management when a data set contains two identical observations. If the existence of these duplicates is, in fact, due to an error made when this data set was created, there may be other errors as well. So I would do a thorough review of the data management steps that led to this data set and fix any errors you find along the way. For example, one of the two observations with this ID 111111110 might be supposed to be for some other ID that is altogether absent from the data set!

The situation in data set 2 is even worse. You again have two observations with ID = "111111110". But this time, they contradict each other for the value of var3. So definitely one of these observations is wrong, or perhaps even both are. This definitely mandates a complete and thorough review of the creation of this data set and correction of all the errors.

Once you solve those problems, it looks like you will be able to use -merge 1:1 ID- to put these together. And it will work just fine with the string variable ID. The numeric version of ID, which, as was pointed out can be hazardous and inaccurate when we are dealing with 9 digit numbers, is not needed here. If you have some other need for a numeric ID variable later on, you can create it with -destring- (which will take care to give you the data storage type needed to keep all the digits correct), or you can also create a different numeric variable that is in 1-1 correspondence with ID but does not entail using 9 digit numbers by using -egen, group()-. But do that after the data sets are -merge-d, not before.
2 likes
Comment
sandeep kaur

Join Date: Jul 2022

Posts: 60
#15

17 Feb 2023, 15:07

Be sure you use the correct command -
merge 1:1 if the identifier is distinct in both datasets

merge 1:m if the identifier is only distinct in the primary dataset

merge m:1 if the identifier is only distinct in the using dataset

joinby if identifier is not distinct in either dataset.

Your example datasets fall into the last category.
[CODE]

Thanks a lot prof William Lisowski for detailed explanation. To have more clarity:

1) What does distinct mean? ID var is present in both datasets. Does it mean all ID's are different and no duplicates?

2) How can data can inserted in command?
. use `dataA', clear . joinby ID using `dataB', unmatched(both)
Sandeep

Last edited by sandeep kaur; 17 Feb 2023, 15:15.
Comment

Announcement