Tobit Regression With Partially Censored Data

Will Bellamy

Join Date: Jan 2018

Posts: 6
#1

Tobit Regression With Partially Censored Data

06 Feb 2018, 13:49

Dear Statalist,
I am using Stata 13 on MacOS High Sierra. I have a dataset of 11,090 radon measurements conducted in 6,525 homes over a 25-year period and my intention is to use a form of regress to estimate the reduction in household radon levels achieved by radon-resistant new construction (RRNC) techniques, which are present in some homes and not in others. In my dataset, RRNC is represented as a dichotomous variable with 1 indicating that an observation occurred in a home with RRNC and 0 indicating that an observation occurred in a home without RRNC. Additionally, I have used categorical variables to indicate whether or not the tests occurred in the lowest level of the home, in an apartment, and during the summer, as well as what geological unit upon which the tested home is located.
The unit of measurement for my dependent variable, household radon level, is picocuries per liter (pCi/L) and the radon measurements in my dataset range from 0.1 pCi/L to 1037 pCi/L and take on an approximately lognormal distribution. For this reason, I am using the natural log of the radon measurements as my dependent variable.
However, my analysis is complicated by a cutoff at 0.5 pCi/L that exists for most, but not all of my observations, depending on whether or not the test device used to record the radon measurement was analyzed in a laboratory.
Among tests analyzed in a lab there are no measurements less than 0.5 pCi/L while among tests not analyzed in a lab there is a substantial spike at 0.5 pCi/L but 115 measurements (~1% of the dataset) with reported measurements less than 0.5 pCi/L. In effect I appear to have partially censored data. This difference between radon measurements analyzed in a lab and those that were not also exists in the overall dataset of ~2 million radon measurements from which my data was drawn.
My initial approach was to use tobit with the natural log of 0.5 (-.69314718) pCi/L as the value of the limit for left-censoring or censoring from below to address this issue, but I am unsure as to whether tobit is appropriate without completely censored data and I am having difficulty finding instances of this in the literature. Is my current approach to tobit appropriate in this instance or do I need to modify it or even take a different approach altogether? I have included histograms of my data, my current commands for tobit, and a sample from my data (altered due to IRB requirements) below. Thank you.

tobit lnaverage_measure_value RRNC Lowest_Level Apartment Summer Epler_Formation Felsic_to_mafic_gneiss Franklin_Marble Hardyston_Formation Hornblende_gneiss Jacksonburg_Formation Leithsville_Formation Martinsburg_Formation Rickenbach_Formation Unknown_Formation, ll(-.69314718)

test_id house_id measurement unit RRNC year
49925514 3680 18 pCi/L 0 2004
43502316 4340 9 pCi/L 0 2002
86108415 8727 7 pCi/L 1 1996
76108415 8727 4 pCi/L 1 1996
62573215 6976 5 pCi/L 0 2001
52092800 6984 1 pCi/L 0 1998
32092800 6984 0.5 pCi/L 0 1998
20168132 2985 0.5 pCi/L 1 2011
47979800 2986 0.5 pCi/L 0 2004
57979800 2986 0.5 pCi/L 0 2004
80781206 2987 6 pCi/L 0 2001
90402602 2992 2 pCi/L 0 1993
16515919 2994 3 pCi/L 0 2010
24050423 2995 0.6 pCi/L 0 2006
30412813 2997 4 pCi/L 0 2010
60633801 3999 1 pCi/L 0 2010
58617622 3999 2 pCi/L 0 2004
65555623 4001 1 pCi/L 1 2008
37882016 4001 0.5 pCi/L 1 2000
59905916 2002 3.5 pCi/L 0 2003
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#2

06 Feb 2018, 15:12

I think -tobit- is fine for what you want to do, but not quite the way you have specified it. By specifying -ll(-.69314718) you are telling -tobit- that any observation for which measurement <= -0.69314718 is a censored observation. But according to your problem description, only some of those are censored and the ones from a lab are actual values. So if I understand that correctly, you need to do the -tobit- differently. -tobit- allows you to express the -ll()- option as the name of a variable. That variable gives the lower limit at which each observation would be censored. So I would create a new variable which takes on the value -.69314718 for those observations which are censored, and some very large negative value (much lower than actuall appears in the data) for those observations which are not censored, and then specify that variable in the ll() option. That way -tobit- will know that a value of the measurement that is < -.69314718 but is real because it comes from a lab, should be treated as an uncensored observation.

Another alternative to consider is the -intreg- command. -intreg- is a generalization of tobit. You don't actually need the generality it provides for this application, but I find its way of expressing censored and non-censored observations more intuitive. YMMV.
1 like
Comment
Will Bellamy

Join Date: Jan 2018

Posts: 6
#3

27 Feb 2018, 10:00

Dear Dr. Schechter,
Thank you for your response. When I used intreg, I was able to obtain a model that accomplished my goal of treating only measurements of 0.5 pCi/L as censored, but I am trying to determine if my approach was appropriate. For the radon measurements of exactly 0.5 pCi/L, I set the lower limit (llevel) at 0.001 pCi/L and the upper limit (ulevel) at 0.5 pCi/L to reflect a range of possible values for these censored measurements with the censoring from below. For all other measurements, I set both the lower and upper limits at the original value, so if the measurement was 10 pCi/L, both the lower and upper limits are 10 pCi/L. I log transformed the lower (lnllevel) and upper limit (lnulevel) variables and then ran intreg, which produced almost identical results to my original tobit model, except that only measurements of 0.5 pCi/L and not measurements below this value were treated as censored. From your perspective, is my use of intreg appropriate in this context? Thank you. I have pasted my commands below.

g llevel=0.001 if average_measure_value == 0.5
g ulevel=0.5 if average_measure_value == 0.5
replace llevel=average_measure_value if average_measure_value > 0.5
replace ulevel=average_measure_value if average_measure_value > 0.5
replace llevel=average_measure_value if average_measure_value < 0.5
replace ulevel=average_measure_value if average_measure_value < 0.5
clonevar lnllevel = llevel
replace lnllevel = ln(llevel)
clonevar lnulevel = ulevel
replace lnulevel = ln(ulevel)
intreg lnllevel lnulevel RRNC Lowest_Level Apartment Summer Epler_Formation Felsic_to_mafic_gneiss Franklin_Marble Hardyston_Formation Hornblende_gneiss Jacksonburg_Formation Leithsville_Formation Martinsburg_Formation Rickenbach_Formation Unknown_Formation
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#4

27 Feb 2018, 10:22

This looks about right. The one thing I question is the lower limit of 0.001 for the censored values. Is it really the case that the true value cannot be less than that? In particular, that it cannot be zero? Color me skeptical. I suspect you were motivated to do this, in part, because you cannot take the log of zero and you planned to transform everything to logs. But, in fact, this is not really a problem for -intreg-. If you were to set llevel = 0 for the censored observations, lnllevel would becomes missing value for those observations. -intreg- would then interpret the observation as being left censored: the actual value of the measure is between . (interpreted in this case as negative infinity) and lnulevel (= log(0.5). And that would be the correct modeling. So unless that 0.001 limit is real, I would re-do this using llevel = 0 for the censored observations. (That said, I doubt the results will change very much.)

Finally, I am not commenting on your variable selection in the -intreg- command as I have no information about or expertise in these matters. I'll take it for granted that you know what you're doing there.
Comment

Announcement

Tobit Regression With Partially Censored Data

Comment

Comment

Comment