Dear Statalist,
I am using Stata 13 on MacOS High Sierra. I have a dataset of 11,090 radon measurements conducted in 6,525 homes over a 25-year period and my intention is to use a form of regress to estimate the reduction in household radon levels achieved by radon-resistant new construction (RRNC) techniques, which are present in some homes and not in others. In my dataset, RRNC is represented as a dichotomous variable with 1 indicating that an observation occurred in a home with RRNC and 0 indicating that an observation occurred in a home without RRNC. Additionally, I have used categorical variables to indicate whether or not the tests occurred in the lowest level of the home, in an apartment, and during the summer, as well as what geological unit upon which the tested home is located.
The unit of measurement for my dependent variable, household radon level, is picocuries per liter (pCi/L) and the radon measurements in my dataset range from 0.1 pCi/L to 1037 pCi/L and take on an approximately lognormal distribution. For this reason, I am using the natural log of the radon measurements as my dependent variable.
However, my analysis is complicated by a cutoff at 0.5 pCi/L that exists for most, but not all of my observations, depending on whether or not the test device used to record the radon measurement was analyzed in a laboratory.
Among tests analyzed in a lab there are no measurements less than 0.5 pCi/L while among tests not analyzed in a lab there is a substantial spike at 0.5 pCi/L but 115 measurements (~1% of the dataset) with reported measurements less than 0.5 pCi/L. In effect I appear to have partially censored data. This difference between radon measurements analyzed in a lab and those that were not also exists in the overall dataset of ~2 million radon measurements from which my data was drawn.
My initial approach was to use tobit with the natural log of 0.5 (-.69314718) pCi/L as the value of the limit for left-censoring or censoring from below to address this issue, but I am unsure as to whether tobit is appropriate without completely censored data and I am having difficulty finding instances of this in the literature. Is my current approach to tobit appropriate in this instance or do I need to modify it or even take a different approach altogether? I have included histograms of my data, my current commands for tobit, and a sample from my data (altered due to IRB requirements) below. Thank you.
tobit lnaverage_measure_value RRNC Lowest_Level Apartment Summer Epler_Formation Felsic_to_mafic_gneiss Franklin_Marble Hardyston_Formation Hornblende_gneiss Jacksonburg_Formation Leithsville_Formation Martinsburg_Formation Rickenbach_Formation Unknown_Formation, ll(-.69314718)
test_id house_id measurement unit RRNC year
49925514 3680 18 pCi/L 0 2004
43502316 4340 9 pCi/L 0 2002
86108415 8727 7 pCi/L 1 1996
76108415 8727 4 pCi/L 1 1996
62573215 6976 5 pCi/L 0 2001
52092800 6984 1 pCi/L 0 1998
32092800 6984 0.5 pCi/L 0 1998
20168132 2985 0.5 pCi/L 1 2011
47979800 2986 0.5 pCi/L 0 2004
57979800 2986 0.5 pCi/L 0 2004
80781206 2987 6 pCi/L 0 2001
90402602 2992 2 pCi/L 0 1993
16515919 2994 3 pCi/L 0 2010
24050423 2995 0.6 pCi/L 0 2006
30412813 2997 4 pCi/L 0 2010
60633801 3999 1 pCi/L 0 2010
58617622 3999 2 pCi/L 0 2004
65555623 4001 1 pCi/L 1 2008
37882016 4001 0.5 pCi/L 1 2000
59905916 2002 3.5 pCi/L 0 2003


I am using Stata 13 on MacOS High Sierra. I have a dataset of 11,090 radon measurements conducted in 6,525 homes over a 25-year period and my intention is to use a form of regress to estimate the reduction in household radon levels achieved by radon-resistant new construction (RRNC) techniques, which are present in some homes and not in others. In my dataset, RRNC is represented as a dichotomous variable with 1 indicating that an observation occurred in a home with RRNC and 0 indicating that an observation occurred in a home without RRNC. Additionally, I have used categorical variables to indicate whether or not the tests occurred in the lowest level of the home, in an apartment, and during the summer, as well as what geological unit upon which the tested home is located.
The unit of measurement for my dependent variable, household radon level, is picocuries per liter (pCi/L) and the radon measurements in my dataset range from 0.1 pCi/L to 1037 pCi/L and take on an approximately lognormal distribution. For this reason, I am using the natural log of the radon measurements as my dependent variable.
However, my analysis is complicated by a cutoff at 0.5 pCi/L that exists for most, but not all of my observations, depending on whether or not the test device used to record the radon measurement was analyzed in a laboratory.
Among tests analyzed in a lab there are no measurements less than 0.5 pCi/L while among tests not analyzed in a lab there is a substantial spike at 0.5 pCi/L but 115 measurements (~1% of the dataset) with reported measurements less than 0.5 pCi/L. In effect I appear to have partially censored data. This difference between radon measurements analyzed in a lab and those that were not also exists in the overall dataset of ~2 million radon measurements from which my data was drawn.
My initial approach was to use tobit with the natural log of 0.5 (-.69314718) pCi/L as the value of the limit for left-censoring or censoring from below to address this issue, but I am unsure as to whether tobit is appropriate without completely censored data and I am having difficulty finding instances of this in the literature. Is my current approach to tobit appropriate in this instance or do I need to modify it or even take a different approach altogether? I have included histograms of my data, my current commands for tobit, and a sample from my data (altered due to IRB requirements) below. Thank you.
tobit lnaverage_measure_value RRNC Lowest_Level Apartment Summer Epler_Formation Felsic_to_mafic_gneiss Franklin_Marble Hardyston_Formation Hornblende_gneiss Jacksonburg_Formation Leithsville_Formation Martinsburg_Formation Rickenbach_Formation Unknown_Formation, ll(-.69314718)
test_id house_id measurement unit RRNC year
49925514 3680 18 pCi/L 0 2004
43502316 4340 9 pCi/L 0 2002
86108415 8727 7 pCi/L 1 1996
76108415 8727 4 pCi/L 1 1996
62573215 6976 5 pCi/L 0 2001
52092800 6984 1 pCi/L 0 1998
32092800 6984 0.5 pCi/L 0 1998
20168132 2985 0.5 pCi/L 1 2011
47979800 2986 0.5 pCi/L 0 2004
57979800 2986 0.5 pCi/L 0 2004
80781206 2987 6 pCi/L 0 2001
90402602 2992 2 pCi/L 0 1993
16515919 2994 3 pCi/L 0 2010
24050423 2995 0.6 pCi/L 0 2006
30412813 2997 4 pCi/L 0 2010
60633801 3999 1 pCi/L 0 2010
58617622 3999 2 pCi/L 0 2004
65555623 4001 1 pCi/L 1 2008
37882016 4001 0.5 pCi/L 1 2000
59905916 2002 3.5 pCi/L 0 2003
Comment