Matching male and female patients with same/similar lab values

Joliene Post

Join Date: May 2019

Posts: 7
#1

Matching male and female patients with same/similar lab values

05 May 2019, 06:22

Hi! I'm looking into a sex-related sub question of my bachelor's thesis about heart failure and oncomarkers. Even though male/female values do not seem significantly different at first glance, I would like to create the following:

I want to match a female patient to a male patient who have the same/similar age and same/similar lab values (creatinine and nt-proBNP). I would like to see their oncomarker values.

Thank you in advance!
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17678
#2

05 May 2019, 06:57

Joliene:
welcome to this forum.
Without further details about your data, my advice is to take a look at -help propensity-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Joliene Post

Join Date: May 2019

Posts: 7
#3

05 May 2019, 07:56

Thank you! I've looked into propensity score matching and I have gotten further. However I can only use one 'treatment independent' while I want to use 2-3.

there is 1 propensity score less than 1.00e-05
treatment overlap assumption has been violated; use the osample() option to identify the observations

Is there anything I can do about this?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29964
#4

05 May 2019, 11:14

Propensity score matching is a particular approach to matching, in which you treated with untreated subjects based on having an aggregate similarity on all variables that are predictive of being treated. So it makes no sense to speak of propensity score matching with multiple treatments. The whole idea of propensity matching is centered around a particular treatment--and its desirable statistical properties stem from that fact. You can, for each treatment, do a new propensity score match--but you can't carry the propensity score match from one treatment over to an analysis of another treatment.

The kind of matching you asked about in #1 is a different approach to matching: you selected the matching variables a priori and not on the basis of their value as predictors of who is and isn't treated. If you want help with this approach to matching, please post back with the additional information needed:

1. How close in age do they have to be to count as similar?
2. How close in nt-proBNP do they have to be to count as similar?
3. How close in creatinine do they have to be to count as similar?

Be warned that the narrower the range you accept for matching, the harder it will be to find matches. If you make your match criteria too stringent, you will have large numbers of females for whom there is no matching male and vice versa. On the other hand, if you make the matching criteria too loose, then the matching looses its power to refine your analysis by reducing the effects of these nuisance variables.

Also provide an example of your data using the -dataex- command. In choosing the sample to show with -dataex-, please be sure to include some males and some females, and, in particular, some that satisfy the matching criteria and some that do not.

If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
1 like
Comment

Joliene Post

Join Date: May 2019
Posts: 7

05 May 2019, 13:29

I have never used dataex before. Is this what you mean?

Code:

 * Example generated by -dataex-. To install:	ssc	install dataex clear input double CA125UmL float age double(creat	sex	ntprocobas) 49.36  63.45517      48 1  1686 21.26  47.81656  141.44 0  2851 .  62.77071  123.76 0 18853 66.7  62.95688     107 0 16778 5.29   53.4538   79.56 0 360.8 3.03  50.27515    78.4 0 406.8 69.31  48.62149    77.4 0 125.6 .  48.16427      51 0 340.9 11.43  50.04244     122 0 10612 7.71   58.5462  87.516 0   360 288.4  63.48255     150 0 23014 19.38  62.07529   117.6 1 962.5 39.1  49.62902    88.4 0  1705 11.22  47.47707      77 0 117.8 45.98 33.054073     100 0  4924 16.9  61.04312 104.312 0 257.5 15.33  51.34291   79.56 0 957.6 693.8  61.84805     600 0 35000 12.63  63.19781 106.964 0  1055 .  40.89254   97.24 0 194.2 9.54  40.60233    88.4 0  2711 .  42.59001     159 0  4991 373.5 64.057495     125 0  5116 65.07 37.659138  106.08 0  2233 470.7  57.74127     213 0  5134 52.11  55.95346      79 1  2394 103.8  64.12868   79.56 0  1073 13.83   57.4976    78.6 0  1101 11.18  53.97673   97.24 0  1499 320.6    42.705  106.08 0 16440 98.41  51.47707 169.728 0  1722 .   54.3655  81.328 0     . 9.61  58.29158    78.2 1  2938 27.78  63.12115     121 0  3590 6.97 64.257355 104.312 0  1524 21.81  54.99795      74 0  2819 10.82  62.77892      87 0  1401 160.1  59.32101     110 0  2837 5.92  51.64682    88.4 0     . 7.2  61.75496      79 0 394.5 .  53.89733   70.72 0     . 15.06  62.72964     141 0  3567 .  56.44353      53 0  5339 104.9  63.66325   97.24 0  1246 8.63  45.69473      56 0 475.3 219.2  56.91992  141.44 0  3506 49.53  58.04791     105 0  4038 84.35  63.68241      56 0  5143 7.61  61.60164 146.744 1  2575 6.76   54.4668      61 1 295.7 8.92   57.4319      87 1  1150 175.7  43.50445      83 0  2603 138.7  47.21424   107.9 0  8781 289.1   60.1013     279 0 29639 23.61  63.56468      93 0     . 13.17  59.87132      79 0  2578 88.48  58.05339      75 0  4957 17.29  58.69678 112.268 0 848.7 19.42  51.44422    91.3 0 84.89 .  59.46886  123.76 1     . 384.2  53.14716   97.24 0  1253 36.72  55.10746      62 0 331.2 9.55   62.5462      94 1  4665 39.26  53.87269      64 1     . 570.6   57.0705      91 0  2397 173.2    57.577  85.748 0  1481 24.27  51.25257      43 1  2985 .  51.84942  91.936 0     . .  48.19986      92 0  2202 64.49  54.46407     144 0  2686 18.4  55.53456      81 0     . 359.6  40.96646     152 0  4142 162.7  62.35729  65.416 0  1208 .  63.49076      88 0  3896 1366  63.56194  159.12 0  2601 7.82  56.34771     120 0 184.1 10.94  59.26899      54 0   175 12.49  55.38672    88.4 0 41.58 10.36  58.95414     113 0  1053 12.62  62.83368      93 0 186.8 .  64.08214      95 1     . .  52.69268   79.56 0     . 17.23  63.23066      75 0  5396 11.83   61.7358  77.792 0 12755 6.11  62.93497 144.092 0 533.5 242.9  59.03354      76 1  1365 93.83 64.054756      97 0  9616 19.21  57.63997  95.472 0 699.2 28.24  35.14579   154.7 0  4121 14.04  63.54004 934.388 0 11987 15.72  56.55852      72 1  1832 .  55.24435       . 0     . 260.8  52.06571   79.56 1  1947 139.2  40.99932      84 0  1077 9.29  54.67214    95.3 0 297.7 .  58.89391      96 1  2637 13.71  43.61123      80 1 244.6 19.53  62.87474      71 0  1649 431.8 26.743326 109.616 0  4881 279  50.81451      72 1  5414 end label values sex sex label def sex 0 "male", modify label def sex 1 "female", modify

Comment

Joliene Post

Join Date: May 2019

Posts: 7
#6

05 May 2019, 13:35

Thanks for your reply by the way! I am not sure whether I can statistically answer your questions regarding the range of similarity, but lets say:
1. same age is +/- 2 years
2. nt-probnp is +/- 20.0
3. creat is +/- 2.00
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29964

05 May 2019, 14:21

Thank you. Somehow your -dataex- output got mangled, with everything coming out on one line. I was able to parse it manually and run it. The code below includes the -dataex- fixed up.

There is no statistical answer to the questions I posed earlier. They represent pragmatic judgments. On the one hand the windows should be narrow enough that people within those windows are similar in a clinically meaningful sense. On the other hand, narrow windows mean fewer possible matches and more people ending up with no admissible match.

As it turns out, there are no permissible matches in your data example with the windows you give. Just to illustrate how the code works, I have revised the age window to +/- 10 years, and the creat window to +/- 50. Those are probably too wide to be call people in those windows clinically similar. I imagine that in your full data set, you will be able to use somewhat narrower windows than that. But it is likely that the windows you propose in #6 are too strict to match an adequate number of patients in your real data. Anyway, the code segregates those window definitions into three lines of code near the middle, so it is easy enough to make the changes in just those lines and experiment to see if you can find "the sweet spot" where you get a reasonable number of matches and the windows are small enough to make clinical sense.

To run this code you will need to install the -rangejoin- command, written by Robert Picard. It is available from SSC. To use -rangejoin- you also need the -rangestat- command, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

In #4, I neglected to ask you to include the patient identifier variable. Such a variable is needed here. So I've just created an arbitrary one early in the code. You presumably have such a variable already. So you can delete that line of the code, and then replace all references to variable patient_id with the name of your actual patient identifier variable.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double CA125UmL float age double(creat sex ntprocobas)
49.36  63.45517      48 1  1686
21.26  47.81656  141.44 0  2851
    .  62.77071  123.76 0 18853
 66.7  62.95688     107 0 16778
 5.29   53.4538   79.56 0 360.8
 3.03  50.27515    78.4 0 406.8
69.31  48.62149    77.4 0 125.6
    .  48.16427      51 0 340.9
11.43  50.04244     122 0 10612
 7.71   58.5462  87.516 0   360
288.4  63.48255     150 0 23014
19.38  62.07529   117.6 1 962.5
 39.1  49.62902    88.4 0  1705
11.22  47.47707      77 0 117.8
45.98 33.054073     100 0  4924
 16.9  61.04312 104.312 0 257.5
15.33  51.34291   79.56 0 957.6
693.8  61.84805     600 0 35000
12.63  63.19781 106.964 0  1055
    .  40.89254   97.24 0 194.2
 9.54  40.60233    88.4 0  2711
    .  42.59001     159 0  4991
373.5 64.057495     125 0  5116
65.07 37.659138  106.08 0  2233
470.7  57.74127     213 0  5134
52.11  55.95346      79 1  2394
103.8  64.12868   79.56 0  1073
13.83   57.4976    78.6 0  1101
11.18  53.97673   97.24 0  1499
320.6    42.705  106.08 0 16440
98.41  51.47707 169.728 0  1722
    .   54.3655  81.328 0     .
 9.61  58.29158    78.2 1  2938
27.78  63.12115     121 0  3590
 6.97 64.257355 104.312 0  1524
21.81  54.99795      74 0  2819
10.82  62.77892      87 0  1401
160.1  59.32101     110 0  2837
 5.92  51.64682    88.4 0     .
  7.2  61.75496      79 0 394.5
    .  53.89733   70.72 0     .
15.06  62.72964     141 0  3567
    .  56.44353      53 0  5339
104.9  63.66325   97.24 0  1246
 8.63  45.69473      56 0 475.3
219.2  56.91992  141.44 0  3506
49.53  58.04791     105 0  4038
84.35  63.68241      56 0  5143
 7.61  61.60164 146.744 1  2575
 6.76   54.4668      61 1 295.7
 8.92   57.4319      87 1  1150
175.7  43.50445      83 0  2603
138.7  47.21424   107.9 0  8781
289.1   60.1013     279 0 29639
23.61  63.56468      93 0     .
13.17  59.87132      79 0  2578
88.48  58.05339      75 0  4957
17.29  58.69678 112.268 0 848.7
19.42  51.44422    91.3 0 84.89
    .  59.46886  123.76 1     .
36.72  55.10746      62 0 331.2
 9.55   62.5462      94 1  4665
39.26  53.87269      64 1     .
570.6   57.0705      91 0  2397
173.2    57.577  85.748 0  1481
24.27  51.25257      43 1  2985
    .  51.84942  91.936 0     .
    .  48.19986      92 0  2202
64.49  54.46407     144 0  2686
 18.4  55.53456      81 0     .
359.6  40.96646     152 0  4142
162.7  62.35729  65.416 0  1208
    .  63.49076      88 0  3896
 1366  63.56194  159.12 0  2601
 7.82  56.34771     120 0 184.1
10.94  59.26899      54 0   175
12.49  55.38672    88.4 0 41.58
10.36  58.95414     113 0  1053
12.62  62.83368      93 0 186.8
    .  64.08214      95 1     .
    .  52.69268   79.56 0     .
17.23  63.23066      75 0  5396
11.83   61.7358  77.792 0 12755
 6.11  62.93497 144.092 0 533.5
242.9  59.03354      76 1  1365
93.83 64.054756      97 0  9616
19.21  57.63997  95.472 0 699.2
28.24  35.14579   154.7 0  4121
14.04  63.54004 934.388 0 11987
15.72  56.55852      72 1  1832
    .  55.24435       . 0     .
260.8  52.06571   79.56 1  1947
139.2  40.99932      84 0  1077
 9.29  54.67214    95.3 0 297.7
    .  58.89391      96 1  2637
13.71  43.61123      80 1 244.6
19.53  62.87474      71 0  1649
431.8 26.743326 109.616 0  4881
  279  50.81451      72 1  5414
end
label values sex sex
label def sex 0 "male", modify
label def sex 1 "female", modify

gen long patient_id = _n    // SKIP THIS IF YOU ALREADY HAVE A patient_id VARIABLE

//    MAKE A FILE OF JUST MALES
preserve
tempfile males
keep if sex == 0
save `males'

//    AND NOW GET JUST THE FEMALES
restore
keep if sex == 1

//    DEFINE WINDOW RADII
//    YOU CAN CHANGE THE DEFINITIONS HERE 
local age_window 10
local ntprocobas_window 20
local creat_window 50

//    WILL FIRST JOIN RESTRICTING AGE
rangejoin age -`age_window' `age_window' using `males'
keep if !missing(patient_id_U)
//    NOW RESTRICT ON NTPROCOBAS
keep if abs(ntprocobas-ntprocobas_U) < `ntprocobas_window'
//    NOW RESTRICT ON CREATININE
keep if abs(creat-creat_U) < `creat_window'

//    ASSUMING YOU WISH TO MATCH JUST ONE MALE TO EACH FEMALE
set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
gen double shuffle = runiform()
by patient_id (shuffle), sort: keep if _n == 1

By the way, looking at the distribution of the ntprocobas variable, which is very skew, you might consider log-transforming it for purposes of matching. (Or, equivalently, for this variable base the similarity criterion on a ratio rather than a difference in the untransformed variable.)

Comment

Joliene Post

Join Date: May 2019

Posts: 7
#8

05 May 2019, 14:56

Thank you! I will log-transform nt-proBNP. I already have a patient_id variable.

I get the following error:
rangejoin age -`age_window' `age_window' using `males' extra argument after keyvar low high: 10
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29964

05 May 2019, 16:40

I can't reproduce the problem you are having. The code runs without difficulty on my setup:

Code:

. gen long patient_id = _n    // SKIP THIS IF YOU ALREADY HAVE A patient_id VARIABLE

. 
. //    MAKE A FILE OF JUST MALES
. preserve

. tempfile males

. keep if sex == 0
(18 observations deleted)

. save `males'
file C:\Users\CLYDES~1\AppData\Local\Temp\ST_1ed0_000002.tmp saved

. 
. //    AND NOW GET JUST THE FEMALES
. restore

. keep if sex == 1
(81 observations deleted)

. 
. //    DEFINE WINDOW RADII
. //    YOU CAN CHANGE THE DEFINITIONS HERE 
. local age_window 10

. local ntprocobas_window 20

. local creat_window 50

. 
. //    WILL FIRST JOIN RESTRICTING AGE
. rangejoin age -`age_window' `age_window' using `males'
  (using rangestat version 1.1.1)

. keep if !missing(patient_id_U)
(0 observations deleted)

. //    NOW RESTRICT ON NTPROCOBAS
. keep if abs(ntprocobas-ntprocobas_U) < `ntprocobas_window'
(1,038 observations deleted)

. //    NOW RESTRICT ON CREATININE
. keep if abs(creat-creat_U) < `creat_window'
(1 observation deleted)

. 
. //    ASSUMING YOU WISH TO MATCH JUST ONE MALE TO EACH FEMALE
. set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED

. gen double shuffle = runiform()

. by patient_id (shuffle), sort: keep if _n == 1
(0 observations deleted)

. 
end of do-file

I don't know what to tell you. Are you sure this is exactly what you ran? Did you copy/paste that directly from your do-file here? I can reproduce that error message (including the specific number 10) if I put a space between the - and the first `age_window'. Did you have an extra space there?

Comment

Joliene Post

Join Date: May 2019

Posts: 7
#10

06 May 2019, 02:38

Hi Clyde,
As I was using Citrix to access Stata, I wasn't able to copy/paste and I indeed put a space in between. The next problem I encouter is a memory one.

Code:

rangejoin age -`age_window' `age_window' using `males' (using rangestat version 1.1.1) op. sys. refuses to provide memory

A whole different problem...
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29964
#11

06 May 2019, 10:45

That's a more difficult problem to solve.

Apparently you're working with a very large data set (or your computer has very little memory). Given the nature of the matching you're trying to do, there isn't a natural way to split the data set into smaller segments and then match each of them separately and then put the results back together at the end. (That would work if your data set encompassed many different diagnoses and you were matching on the diagnosis: then you could do one diagnosis at a time.)

So there are a few things that may help. First, shut down all other applications on your computer before you run this: they compete with Stata for memory. Resist the temptation to browse the web while you're waiting for this to run: that, too, will take up memory resources that Stata may need.

Also, assuming your real data set contains more variables than just patient id, sex, creat, age, and ntprocobas, drop all of the other variables, drop observations with any missing values on these variables, then run -compress- and then try again.

Code:

// TRIM DOWN THE DATA SET keep patient_id sex creat age ntprocobas drop if missing(patient_id, sex, creat, age, ntprocobas) compress

That might shrink the data set enough for you to get through the matching. Then once you've got the matched pairs, you can bring back the other variables by -merge-ing.

If that doesn't work, you might benefit by changing the order in which the conditions for the three match variables are imposed. I just chose to do age first (with -rangejoin-) arbitrarily: I didn't anticipate you would be up against memory limits. But if you can figure out which of the matching criteria is most difficult to satisfy (i.e. eliminates the most potential matches) and do that one first, if it's sufficiently stringent, that might bring the memory burden down to what your computer can handle. The first match must always be done with -rangejoin-, and then the other two variables are handled with -keep if- commands. It is during and immediately after the -rangejoin- that the memory requirements are greatest: they are roughly proportional to the square of the size of your data set. Once you make it through -rangejoin- the memory requirements only shrink from there.

If doing all of the above still leaves you with inadequate memory, then I think you will have to find somebody who has a computer with more memory and Stata loaded on it who will let you run it there.
Comment
Joliene Post

Join Date: May 2019

Posts: 7
#12

08 May 2019, 01:48

It all worked! I used a different computer and trimmed down the dataset. It is really big, indeed.
Thank you so much. After the codes you have given me, how can I visualize the results?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29964
#13

08 May 2019, 13:14

I think you need to be a little more specific about just what results you want to visualize. You now have a matched-pair data set, and there are many things you might wish to look at.
Comment

Announcement