Merging database with unique identifier

Clyde Schechter

Join Date: Apr 2014

Posts: 29963
#16

17 Feb 2023, 15:14

Your example datasets fall into the last category.

Maybe, maybe not. I think it is more likely that the example datasets fall into none of these categories because they are just wrong.

If there are, indeed, supposed to be two different observations with ID = "111111110" in each data set, then, yes, the data are correct and -joinby- is appropriate. In that case, the results will have both observations from each data set paired with both of those observations in the other. In other words, there will be 2x2 = 4 observations with ID = "111111110" in the resulting dataset. If that is what is intended, go ahead and use -joinby-. It's legal, but it's uncommon. And given that all the other values of ID occur only once, I think it is more likely that we are looking at bad data. Only O.P. can tell for sure.
2 likes
Comment
sandeep kaur

Join Date: Jul 2022

Posts: 60
#17

17 Feb 2023, 15:21

Thanks everyone.

There were few duplicates in dataset of 20,000 IDs.. I ended up cleaning it. There is only one set of observation for each ID.

Identifier var/ID is distinct in both datasets and there should not be any duplicates. 1:1 merge works as well.

Appreciate all of your help. It cleared so many doubts while dealing with large datasets and long digits.

Regards
Sandeep
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#18

17 Feb 2023, 17:00

1) What does distinct mean? ID var is present in both datasets. Does it mean all ID's are different and no duplicates?

Distinct means any pair of observations in the datset have the different value(s) for their identifier variable(s). (If instead of one ID as you have, you are matching on ID and year, the IDs can be the same as long as the years are different; the years can be the same as long as the IDs are different.) As Clyde pointed out, that is not the case in either of your example datasets.

Code:

2) How can data can inserted in command? . use `dataA', clear . joinby ID using `dataB', unmatched(both)

I do not understand what this asks. Perhaps I have confused you by putting your two example datasets into temporary files ("tempfile"s) which are referred to by the local macros dataA and dataB. Reading the output of

Code:

help joinby

should make the syntax of the joinby command clearer.
1 like
Comment
sandeep kaur

Join Date: Jul 2022

Posts: 60
#19

17 Feb 2023, 17:21

Thanks William Lisowski . It totally makes sense to me now.
Comment

Announcement

Comment

Comment

Comment

Comment