Pulling together data from multiple tables with multiple entries for each subject

Adi Noiman

Join Date: Feb 2018

Posts: 8
#1

Pulling together data from multiple tables with multiple entries for each subject

03 Dec 2019, 13:00

I have multiple tables that contain multiple records for the same population. I would like to create one data file to describe the population demographics such as sex, race, marital status, etc. but merging one to many or many to one will not work since each subject has multiple records in each table. Is this an instance where merging many to many makes sense? Alternatively, is there code I can use to indicate that I would like the first available record of a variable (e.g. race) in one table to be added to my main table for each subject? Thanks in advance!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29925
#2

03 Dec 2019, 15:57

No, this is definitely not an instance where merging many to many makes sense.

We need to clarify what's going on in these data sets. Why are there mujltiple observations per person in a demographics file? Do you have serial observations over time? If so, then what you probably want here is a series of 1:1 merges with person identifier and time as the merge key variables. If you do not have multiple observations over time on people, then it is hard to see why there would be more than one observation per person in a file of demographic data: the extra observations are, at best, redundant, and probably indicative of poor data management in the creation of that file (so you should not trust the data in that file and should go back and verify that it has correct information). Once you've done that, you can just drop the duplicate observations per person and merge 1:m.
Comment
Adi Noiman

Join Date: Feb 2018

Posts: 8
#3

03 Dec 2019, 20:34

Hi Clyde, allow me to clarify. Each table represents a different registry of electronic health data (e.g. pharmacy data, inpatient care, etc.). Every time a subject has an encounter with the health system, a record is created, but not always in each registry since it depends on the nature of the visit. I am trying to reduce the amount of missing demographic information by pulling from all tables. So for example, the "race" variable may be missing for a subject in one table, but might exist in a different table from a different visit. Perhaps what I need to do is identify the first available record for each variable of interest across the tables and join these into one table. However, I'm not quite sure about the code to make this happen.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29925
#4

03 Dec 2019, 20:54

With this description, I would say that this doesn't sound like a task for -merge- at all. This sounds like a task for -append-. I think you want to go through each data set and drop all of the variables other than the ones you are trying to compile together. Rename the variables so that the same thing is called by the same variable name in each data set. Then append them all together. Now you have everything in one place. You can sort the data chronologically within subject and then -collapse (firstnm)…, by(subject_identifier)- to get the first non-missing value for each subject.

That said, I would be a bit cautious. It is fine if the information is missing in some of the databases and present in the others, as long as it is consistent where it is not missing. My experience working with distributed clinical databases and pulling them together is that inconsistencies are very frequent. The same patient will appear in 4 of the data set, listed as a male in 1, a female in 2, and missing in the last. Race is even worse: the majority of patients will have contradictory race designations in different data sets. Date of birth, surprisingly, is usually less of a problem, perhaps because here in the US they have to put the patient's date of birth on the billing forms in order to get paid. Anyway, I think you should check your data for these kinds of contradictions and then put some thought into how you want to resolve them. Using the first non-missing entry is probably the simplest, but it may not be the best. Sometimes it is better to develop more complicated rules for choosing among contradictory designations of an attribute.
Comment
Adi Noiman

Join Date: Feb 2018

Posts: 8
#5

03 Dec 2019, 21:13

Thanks for your additional thoughts on this! We are already encountering the issue of race changing over time for the same subject. One thought was to go with a frequency approach such that whichever race is reported most often over time for a particular subject is the one that is kept. In any case, thanks again for your consideration!
Comment
rezza cav

Join Date: Jul 2022

Posts: 13
#6

08 Jul 2022, 05:02

I have a very similar problem where I have not found the solution yet.

In my data (example from other post) I also have multiple entries for the same population ("entity") but also for different years ("year").

1. With "egen tag" and "egen ndistinct" I can find out how many different "firms" each combination of "entity" and "year" (group?) has.
( I will have to do this not only for "firm" but also for "value" and other variables)

2. Now I would like to know the frequencies for each value in "ndistinct" over all the combinations of "entity" and "year".
(To get an idea of the variation within different variables and think about how to merge them in the most sensible way).

3. Last I need to make sure there is only one observation per combinations of "entity" and "year" and therefore combine (merge, collapse, append...?) the values of other variables. If there is a most frequent value, choose that one, if there is not, choose the highest or first, or in some other cases the mean or sum (to be defined for each variable).

I really struggle to find a way how to do that. I hope my example is clear.

Original data

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte entity int year str1 firm float value 1 2010 "A" 15 1 2010 "A" 8 1 2010 "B" 12 1 2011 "B" 25 1 2012 "B" 8 2 2010 "A" 7 2 2011 "A" 5 2 2011 "A" 12 2 2011 "C" 13 2 2012 "A" 19 2 2012 "B" 25 2 2011 "B" 14 2 2012 "C" 18 2 2012 "D" 16 end sort entity year list, sepby(entity year)

1. Question

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte entity int year str1 firm float value 1 2010 "A" 15 1 2010 "A" 8 1 2010 "B" 12 1 2011 "B" 25 1 2012 "B" 8 2 2010 "A" 7 2 2011 "A" 5 2 2011 "A" 12 2 2011 "C" 13 2 2012 "A" 19 2 2012 "B" 25 2 2011 "B" 14 2 2012 "C" 18 2 2012 "D" 16 end egen tag = tag(firm entity year) egen ndistinct = total(tag), by(entity year) sort entity year list, sepby(entity year)

2. Question -> This is the table I would like to obtain

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte ndistinct freq 1 3 2 1 3 1 4 1 end sort ndistinct freq list, sepby(ndistinct freq)

3. Question?

See also:
https://www.statalist.org/forums/for...f-similar-type
https://www.statalist.org/forums/for...iven-condition
https://www.statalist.org/forums/for...cases-by-group

Last edited by rezza cav; 08 Jul 2022, 05:54.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29925
#7

08 Jul 2022, 09:55

Code:

collapse (first) ndistinct, by(entity year) contract ndistinct list, noobs clean
Comment
rezza cav

Join Date: Jul 2022

Posts: 13
#8

11 Jul 2022, 01:07

Dear Clyde

Thank you for your feedback. I didn't mean to spam the list, I'm sorry. The various threads all have similar questions to mine, but could only provide part of the answer I'm looking for, so I didn't know where it would fit best.

Now for your answer, the code fits exactly to answer my second question, thank you. I now know that more than two different companies per "entity" and "year" is relatively rare.

However, I now want to shrink the data set (Question 3) to one observation per group (combination of "entity" and "year") with the following condition: Choose most frequent value for "firm" within a group (group 1-5 in my example). If there isn't a value that occurs most often, then select company "A".

The cleaned data should then look like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte entity int year str1 firm 1 2010 "A" 1 2011 "B" 1 2012 "B" 2 2010 "A" 2 2011 "A" 2 2012 "A" end sort entity year list, sepby(entity year)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29925
#9

11 Jul 2022, 08:52

The example data you show doesn't really provide a good substrate for developing and testing this code as you have only one observation for each entity-year combination. So I have modified your data, and also added some observations to create situations where the code can actually do something.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte entity int year str1 firm 1 2010 "A" 1 2010 "B" 1 2010 "B" 2 2011 "A" 2 2011 "A" 2 2011 "A" 3 2010 "A" 3 2010 "B" 3 2010 "B" 3 2010 "A" end sort entity year list, sepby(entity year) by entity year firm, sort: gen freq = _N by entity year freq (firm), sort: gen n_firms = sum(firm != firm[_n-1]) by entity year (freq firm): replace firm = cond(n_firms[_N] == 1, firm[_N], "A")

If there isn't a value that occurs most often, then select company "A".

I have to say that I don't quite understand this condition: it implies that company "A" is a part of every entity in every year. I would have thought that in the event that there is no single modal firm within an entity-year that one would choose some way to break the tie(s) based on some other attributes of the firms such as their revenues or number of employees or something like that. But you asked for "A", so I gave you "A".
Comment
rezza cav

Join Date: Jul 2022

Posts: 13
#10

12 Jul 2022, 00:00

Dear Clyde

Thank you for your suggestion. I am getting closer but am not quite there yet, I guess I didn't specify my problem well enough:

The original data is still the one from the example above, with 14 observations (see table 1 below). Now I would like to obtain a "cleaned" dataset with only 6 observations, one per combination of "entity" and "year" (see table 2 below), stating for each the most frequent value for the variable "firm".

Your suggestion to choose some other way to break the tie(s) if there is no firm occuring most often, for example by choosing the firm with the highest value in "value", would be the next step to refine the whole process.

So here is the original data (Table 1):

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte entity int year str1 firm float value 1 2010 "A" 15 1 2010 "A" 8 1 2010 "B" 12 1 2011 "B" 25 1 2012 "B" 8 2 2010 "A" 7 2 2011 "A" 5 2 2011 "A" 12 2 2011 "C" 13 2 2012 "A" 19 2 2012 "B" 25 2 2011 "B" 14 2 2012 "C" 18 2 2012 "D" 16 end sort entity year list, sepby(entity year)

And this is the result I would like to get (Table 2)

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte entity int year str1 firm 1 2010 "A" 1 2011 "B" 1 2012 "B" 2 2010 "A" 2 2011 "A" 2 2012 "A" end sort entity year list, sepby(entity year)

Last edited by rezza cav; 12 Jul 2022, 00:09.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10043

#11

12 Jul 2022, 02:17

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte entity int year str1 firm float value
1 2010 "A" 15
1 2010 "A"  8
1 2010 "B" 12
1 2011 "B" 25
1 2012 "B"  8
2 2010 "A"  7
2 2011 "A"  5
2 2011 "A" 12
2 2011 "C" 13
2 2012 "A" 19
2 2012 "B" 25
2 2011 "B" 14
2 2012 "C" 18
2 2012 "D" 16
end

contract entity year firm
gsort entity year firm -_freq
by entity year: keep if _n==1

Res.:

Code:

. l, sep(0)

    
entity   year   firm   _freq    
    
1.       1   2010      A       2    
2.       1   2011      B       1    
3.       1   2012      B       1    
4.       2   2010      A       1    
5.       2   2011      A       2    
6.       2   2012      A       1

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29925
#12

12 Jul 2022, 09:26

The method shown in #11 does not implement the condition that in the event of a tie for most frequent firm, the tie be resolved in favor of the firm with the highest value. Rather, it will break the tie at random, and irreproducibly on reruns of the same data. The example data does not actually contain an instance where this part of the problem comes into play, so the results shown in #11 are correct for that example, but will not be in other situations.

The following is a complete solution:

Code:

by entity year firm, sort: gen freq = _N gsort entity year -freq -value by entity year: keep if _n == 1
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10043
#13

12 Jul 2022, 09:34

Correct. As we are sorting by the name of the firm in #11, the tie is broken in favor of the firm appearing first in alphabetical order. That was my reading of the data in #10, but it appears that I misunderstood what "most frequent value" means.

Last edited by Andrew Musau; 12 Jul 2022, 09:40.
Comment
rezza cav

Join Date: Jul 2022

Posts: 13
#14

13 Jul 2022, 07:45

This works great, thank you both!

I am struggling to combine some of the Ideas though, as there are more variables, where different conditions for breakting the ties apply. Let's say I want to keep only one observation per combination of entity and year and choose the most frequent value in all other variables. If there is no most frequent value, then choose :
- for "firm": first value within group
- for "turno" (turnover): highest value within group
- for "dom" (domestic): 1 over 0

This may not make much sense in the example but it does in my data.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte entity int year str1 firm turno dom 1 2010 "A" 15 0 1 2010 "B" 8 0 1 2010 "B" 12 0 2 2011 "A" 25 0 2 2011 "A" 8 1 2 2011 "A" 7 0 3 2010 "A" 12 1 3 2010 "B" 5 1 3 2010 "B" 12 1 3 2010 "A" 8 1 end sort entity year list, sepby(entity year)
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10043
#15

13 Jul 2022, 08:36

In the gsort, you add a minus sign to all the other variables, but the ones appearing first in order take priority.

Code:

gsort entity year -freq -valuex1 -valuex2 -valuex3

Here, the priority will be first assigned to the value of x1, if there is a tie, the value of x2 and so on.
Comment

Announcement

Pulling together data from multiple tables with multiple entries for each subject

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment