Question about merging

Soumya Upadhyay

Join Date: Oct 2016

Posts: 43
#1

Question about merging

04 Jul 2018, 18:36

I have a dataset that contains information about hospitals from years 2011, 2012, 2013, 2014, and 2015. I need to merge it with a dataset with hospital characteristics which comes in a separate dataset for each of the above years. The unique ID is hospital ID. But there are duplicates both in the master and using dataset. These duplicates cannot be removed because sometimes few hospitals have the same ID (because they belong to the same system). How can I accomplish merging these datasets? Please help.
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#2

04 Jul 2018, 18:53

if you really have duplicates in both files, you probably want -joinby-; see

Code:

help joinby
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

04 Jul 2018, 18:54

This is unfortunately not a simple question of Stata merge command syntax. The problem is, you have not identified a "unique ID" for each hospital. You have something that is called a unique ID, but it doesn't identify hospitals uniquely - it identifies in some cases hospital systems, but not the hospitals within them.

The first thing you must do is decide how you would match the observations if you were doing it by hand. Is there a secondary identifier that distinguishes between the different hospitals in the same system? Or some other way of telling them apart, at least within the same year? Or perhaps you want to collapse your data, combining all the data for hospitals in the same system into one observation for each year, adding or averaging or in some other way choosing how to combine each variable.

There's nothing magic that Stata's merge can do for you. Instead, you must decide how you want to solve your problem, and then, if it's not clear to you, we can advise you on the Stata syntax to accomplish that.

Added in edit: I think I disagree with Rich's suggestion; that is, while joinby may be ultimately be the tool that is needed, you first need to decide what you want to do, and I think from your description, you want to match observations that represent the same hospital. The question remains, how can you tell when a pair of observations, one from each dataset, represent the same hospital?

Last edited by William Lisowski; 04 Jul 2018, 18:57.
Comment
Soumya Upadhyay

Join Date: Oct 2016

Posts: 43
#4

07 Jul 2018, 18:46

William,

Thank you for your response. My master and using datasets have different hospitals that have duplicates. As I said, these are not really duplicates because one hospital system with one unique ID has several hospitals with the same ID. It turns out that the same system has hospitals that are present in master dataset but not in the other and vice versa. This is possibly because each dataset is based on a survey, and each of the surveys were taken by different hospitals in the same system.

I separated out each year- 2011, 2012, 2013, 2014, and 2015. Then I merged datasets for each year. Then I appended them. To merge, I had to use m:m.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

07 Jul 2018, 19:58

To merge, I had to use m:m.

Then what you have created is just an unusable jumble of data; throw it in the garbage. Whatever you do, don't analyze it. -merge m:m- is not the solution to your problem. (It isn't the solution to any problem that actually arises in real life--I have been using Stata since 1994 and have only once encountered a situation where -merge m:m- would produce correct results. Even in that case, there was a better way to get the same result.)

Re-read the advice you have been given in #3 and #2 (though I am inclined to think that #3 is closer to what you need than #2.)
Comment
Soumya Upadhyay

Join Date: Oct 2016

Posts: 43
#6

13 Jul 2018, 00:24

I did throw that dataset in the garbage. I cross checked my duplicates between the two (master and using ) datasets. I removed the duplicates from the using dataset because I didn't need them. Then I used m:1 to do my mergers. Finally, I appended the years.
Now, I need to show a lag effect of a variable in 2011 ( variable name=diversity training) on a variable in 2012 and 2013 (variable name=quality). How could I do that?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#7

13 Jul 2018, 01:00

Soumya:
from your description, I'm not clear with your next research steps.
Lagging and leads are feasible via -help tsvarlist-.
I would advise you to provide further details about the statistical strategy of your analysis (a regression on a unique wave of data? A panel data regression? Else?).

Kind regards,
Carlo
(Stata 19.0)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

13 Jul 2018, 12:52

I need to show a lag effect of a variable in 2011 ( variable name=diversity training) on a variable in 2012 and 2013 (variable name=quality). How could I do that?

Carlo is quite right; you don't provide enough information for a confident and specific answer. But you will probably want to do something along these lines:

Code:

xtset ID year one_of_the_xt_regression_commands quality L1.diversity_training /*perhaps other variables*/, /*?fe or ?re, ?other options*/

Read -help xt- and the -xt- chapter of the [XT] volume of the PDF documentation installed with your Stata that is linked in the -help xt- file.
1 like
Comment

Announcement

Question about merging

Comment

Comment

Comment

Comment

Comment

Comment

Comment