Populate unique value for a variable based on unique value of another variable

Cody Hitchcock

Join Date: Aug 2020

Posts: 3
#1

Populate unique value for a variable based on unique value of another variable

22 Aug 2020, 09:25

Hi,

I have a matched dataset of ~150k patients, with each matched pair consisting of one case and one control. I used ccmatch; cases and controls were matched on patient_id, a string variable unique to each patient. Each matched pair has a unique value for the variable match.

Now I have appended a new file to this dataset. Content from the new file has many (but not all) of the patient_ids in the existing dataset, plus some additional patient_ids (who I will want to discard). Since matching was not done on the new file, the match variable value is missing for all those patient_ids. I want to populate match for the newly-appended patient_ids who are in the existing dataset, i.e., for the matched cases and controls. After that, I will drop the excess patients who came from the new file (i.e., those patients not matched in the original dataset). The excess patients should be easy to identify since at that point, the non-cases and non-controls should all have missing match values and should be the only patients who have missing match values.

My question: how can I populate a unique value for one variable based on the unique value of another variable? Specifically, how can I populate missing match values, based on existing match and patient_id values? I am thinking this may start with a replacement of match if match==., based on patient_id, but am unsure exactly how to write this out.

An example, using match value 5565, is below. First is the case, then the control. In the newly-appended data, patient_ids 12345 and 67890 may be present, but match (and match_id) would be missing.
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2396
#2

22 Aug 2020, 12:01

Because you haven't offered us a data example (See the StataList FAQ on this point, which new participants are asked to read. See also the point that images of data aren't generally very helpful here.), the chances of giving you the answer you need are substantially lower than they would otherwise be. With that caveat, I believe you can get what you want with:

Code:

// Check that observations that share the same patient_id and which have a nonmissing // match value do have the same match value. bysort patient_id (match): gen OK = (match[1] = match[_N]) tabulate OK if !missing(match) // Assign match values to the new observations based on the ones // possessed by the existing observations. bysort patient_id (match): replace match = match[1]

The preceding does this: It sorts the data by patient_id and match. Because missing values sort last (high) in Stata, this will order the data set so that observations sharing a patient_id will be sorted together, with observations missing on match being sorted to the end of each such set. Within each set of observations that share a patient_id, it will then assign the lowest (i.e., nonmissing if available) value of match to all observations that share the same patient_id. The workability of this.

I haven't checked the preceding because I didn't take the effort to create a data example of my own to resemble what I think yours is like.

Again presuming I'm right about what you want, a less tricky and more standard way to accomplish the preceding would have been to use the -merge- command to put the match values onto the "new" observations before combining the two data sets. If -merge- is not a familiar command to you, you'll want to start reading about it.
1 like
Comment
Cody Hitchcock

Join Date: Aug 2020

Posts: 3
#3

22 Aug 2020, 13:18

Hi Mike,

Apologies for the lack of example. I did end up being able to work off your code. When I tried the first line, I received an error message:

equation [1] not found
r(303);

But I ended up using bysort patient_id (match): gen OK = match[1]

...which successfully created the variable OK based on the value of the variable match for all observations.

As for merge--that was actually my initial attempt. But in this case, the initial master dataset had up to multiple rows for each patient_id (each row corresponding to a lab result), and the using dataset also had up to multiple rows for each patient_id (each row corresponding to a comorbidity). So I believe that would have required a m:m merge, which I understand is frowned upon.

Many thanks,
Cody
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2396
#4

22 Aug 2020, 14:26

For Cody and any future reader here: My first line had a typo, with "=" where "==" was needed. This line should not be "fixed" the way Cody did. All that's needed is:

Code:

bysort patient_id (match): gen OK = (match[1] == match[_N])

Cody's substitution

Code:

bysort patient_id (match): gen OK = match[1]

is syntactically OK, but it won't do the job here, as it doesn't check whether there is only one nonmissing value of "match" for each patient_id, which is what my OK was supposed to help do by implicitly checking all observations, 1 to _N, that shared a common patient_id

Regarding Cody's point about a merge: An m:m merge is not needed. I'd do something that's conceptually like this:

Code:

// Only a single copy of each distinct patient_id along with its match value is needed: use originalfile.dta bysort patient_id: keep in 1 keep patient_id match save id_match_map.dta // clear use YourNewFile.dta // put the match variable onto your file to be appended. // and keep only the observations in that new file. merge m:1 id_match_map.dta, keep(1 3) // Combine with all the data in your original file. append using originalfile.dta
Comment

Announcement

Populate unique value for a variable based on unique value of another variable

Comment

Comment

Comment