How best to deal with data of lab. test results?

Dawn Tok

Join Date: Sep 2014

Posts: 16
#1

How best to deal with data of lab. test results?

16 Sep 2014, 21:11

For example, hba1c test results are in numeric form. However once it detected below the normal range (4.4-6.4), data will be indicated simply as <4.3. Because of this, data is stored as string instead of numeric.

Any advise on how best to deal with such data?
Tags: None
Richard Williams

Join Date: Apr 2014

Posts: 4940
#2

16 Sep 2014, 21:31

This is very unclear to me. A listing of some data, code and output could help. How is hba1c coded when it is not below the normal range? What do you want to do with the variable -- is it an independent variable or a dependent variable?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#3

16 Sep 2014, 21:38

Just noticed this is a duplicate post. Richard Williams' response is pretty similar to my response to the other post. If the original poster responds, it would be best to respond to only one of these threads and close out the other.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4373
#4

16 Sep 2014, 22:43

Originally posted by Mytok View Post

. . . data is stored as string instead of numeric.

Any advise on how best to deal with such data?

Apparently, your major concern is to populate a numeric variable with those string values that represent numeric values. For that, you can try something like that below.

Code:

version 13.1 clear * set more off input str5 hba1c "7.6" "6.4" "5.1" "4.4" "<4.3" end * * Begin here * quietly generate double hba1c_n = real(hba1c) format hba1c_n %3.1f quietly replace hba1c_n = .l if trim(hba1c) == "<4.3" label define HbA1c .l "<4.3" label values hba1c_n HbA1c // Based upon the purpose of the clinical laboratory test, I assume // that the following two lines are unnecessary. quietly replace hba1c_n = .u if trim(hba1c) == ">6.5" label define HbA1c .u ">6.5", add list, noobs exit

As to how best to deal with the situation, as both Richard and Clyde have mentioned, it depends upon the purpose of your activity. You might not need to do much more than what's above in the do-file, or to take advantage of one or more of Stata's estimation commands for these kind of data. Or you might end up needing to go back to the source and retrieve the actually measured values for those below-normal-range values that weren't reported.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17671
#5

17 Sep 2014, 00:27

Mytok (please, as per FAQ, re-register with your full name surname, too. Just click on the Contact us button and follow the instructions):
- if hba1c is your "censored from below" continuous dependent variable to be regressed on a set of predictors, you may want to take a look at - help tobit - and related entry in Stata 13.1 .pdf manual (especially Example 1).

Kind regards,
Carlo

Kind regards,
Carlo
(StataNow 18.5)
Comment

Svend Juul

Join Date: Apr 2014
Posts: 515

17 Sep 2014, 05:49

I agree that the optimal solution depends on the purpose. Here I suggest a solution where the low hba1c measurements get a not too unrealistic value (kind of simple imputation), which may be better than giving them a missing value:

Code:

 clear
input str5 hba1c
"7.6"
"6.4"
"5.1"
"4.4"
"<4.3"
end

destring hba1c , generate(hba1c_n) ignore("<")
recode hba1c_n (4.3=4)
label define hba1c_n 4 "<4.3"
label values hba1c_n hba1c_n

. list, nolabel
     +-----------------+
     | hba1c   hba1c_n |
     |-----------------|
  1. |   7.6       7.6 |
  2. |   6.4       6.4 |
  3. |   5.1       5.1 |
  4. |   4.4       4.4 |
  5. |  <4.3         4 |
     +-----------------+

. list
     +-----------------+
     | hba1c   hba1c_n |
     |-----------------|
  1. |   7.6       7.6 |
  2. |   6.4       6.4 |
  3. |   5.1       5.1 |
  4. |   4.4       4.4 |
  5. |  <4.3      <4.3 |
     +-----------------+

Comment

Rich Goldstein

Join Date: Mar 2014

Posts: 4438
#7

17 Sep 2014, 07:20

yes, replacing by values below the limit of detection is in general what you want; however, I strongly recommend that you do a sensitivity analysis by repeating the analysis using different replacement values; you could even do this within a "missing data" situation by using MI (with a model that restricts the imputed values to be no higher than your limit)
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#8

17 Sep 2014, 19:52

A better procedure for handling observations below the limit of detection is to treat them as left-censored observations and analyze with survival data programs.. There are two approaches in Stata. The first is to use the ordinary Kaplan-Meier estimate on a reversed time scale. The second is to use Patrick Royston's stpm module, downloadable from SSC; it models left-censored data as a special case of interval censoring. See also Gillespie et al. (2010). I illustrate both below.

Now a personal note: Long-time Statalist etiquette, discussed in the FAQ, has been to register with full real names. This practice has promoted professionalism and friendship on the list and you can see that it is followed by every responder to your question so far. I urge you re-register with your real name to enjoy the full benefits of being on Statalist. Just use the Contact Us button on the bottom right of the page.

Code:

/* If below =1, the real observation fell below the limit of detection indicated by the corresponding value of x */ clear input x below 0.5 1 1 0 1 0 2.3 1 3 0 4 0 5 0 5.4 1 6 0 7 0 11 0 12 0 end label var below " Below LOD" sum x local xmax = r(max) /* Reverse Values: make them positive starting at 1 */ gen rx = -x + `xmax' + 1 stset rx, fail(below=0) /* Note */ stsum /* Get quantiles for original x */ di "v50 = "-r(p50) + `xmax' +1 di "v25 = "-r(p75) + `xmax' +1 di "v75 = "-r(p25) + `xmax' +1 /* Generate Cumulative distribution function for x This is found from the survival curve for rx */ sts gen cif1 = s label var cif "Cumulative Distribution" /* Compare to stpm */ stset x gen leftv = _t replace leftv =0 if below replace _d=0 if below stpm , left(leftv) scale(hazard) df(3) predict cif2, failure scatter cif1 cif2 x, sort(x) c(l l)

Reference: Gillespie, Brenda W, Qixuan Chen, Heidi Reichert, Alfred Franzblau, Elizabeth Hedgeman, James Lepkowski, Peter Adriaens, Avery Demond, William Luksemburg, and David H Garabrant. 2010. Estimating population distributions when some data are below a limit of detection by using a reverse Kaplan-Meier estimator. Epidemiology 21, no. 4 (Supplement): S64-S70.

Last edited by Steve Samuels; 17 Sep 2014, 20:26.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Svend Juul

Join Date: Apr 2014

Posts: 515
#9

18 Sep 2014, 03:32

Steve's suggestion is probably optimal for describing a distribution when information about some of the observations is imperfect (however, predict does not allow the failure option). But I don't see how to use it for analyzing hba1c as an outcome or as a predictor of some outcome.

For hba1c (a measure of long-term regulation of blood glucose in diabetes) the low values are considered "normal", and hba1c as a predictor could be categorized with these values as the reference category. For hba1c as on outcome the solution is less obvious, I think. I would tend to use the method I suggested in post #6.

Svend
Comment
Md Bayzidur Rahman

Join Date: Jul 2018

Posts: 5
#10

11 Jul 2018, 18:23

Dear Steve,
I found your following post very useful.
Although I have few questions:
Do you replace the censored observations from the variable cif1?

I can’t run the command “predict cif2, failure” after stpm it doesn’t allow the failure option. When I run it without the failure option it generates a flat variable with value -2.777787 for each observation. Am I doing something wrong?

Do you have any update on this topic?

Bayzid
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#11

11 Jul 2018, 19:54

I'm sorry to say that I know nothing more to add to this topic. I don't know why the predict statement isn't working for you and Sven. The code above worked fine for me just now in Stata 14.2 and "failure" is listed as a possible statistic for predict in the Help for stpm..

When stpm predicts for the values <1 it is extrapolating from the area of known data. Extrapolation like this always dangerous, so I agree with the suggestions of trying different methods.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Announcement

How best to deal with data of lab. test results?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment