Hello all,
I posted this question in response to an old thread but I thought I should probably just post it as a new one.
My question pertains to the National Health Interview Survey, a large US publicly-available health survey (https://www.cdc.gov/nchs/nhis/nhis_2...ta_release.htm).
There is a great deal of missing income information in the main dataset. However, NHIS makes available imputed income in the form of five separate downloadable datasets (m=1, m=2, ... m = 5). Notably, however, there is no "original" variable for income: the imputed datasets appear to rely on exact reported income information that is not available in the original dataset. I have a few questions about how this data can be used in stata, and I'd be curious if the following steps I've taken sound reasonable.
1-) First, I re-created the original income variable. I did this using a variable in the imputed datasets that indicates whether income for a given individual was imputed or reported. If it was reported, I re-created an original income variable as equal to income in the first imputed file, although the value should be the same across all five imputations if it was not an imputed value.
2-) Next, I then renamed the income variable in each imputed dataset as income1, income2, income3, ... income5, for each of the five imputations. I then merged my original file and the five imputed files, producing a single "wide" format dataset that includes "income" (re-created original variable with missing), and "income1," "income2", "income3". ... "income5" representing each of the five imputations.
3-) I did the same thing for another variable, based on an income, which is also in the imputed dataset, which is income as a percentage of the federal poverty level, which I'll call "povertyratio."
4-) I wanted to classify all observations by povertyratio, and for the sake of simplicy I'll just say I wanted a variable to indicate whether each individual was poor or not poor. So basically something like:
generate poor = .
replace poor = 0 if povertyratio >= 1
replace poor = 1 if povertyratio < 1
I did this for the original dataset and for each imputation, thus producing six variables: "poor" (m=0), "poor1" (m=1) ... "poor5" (m=5).
5-) I then used "mi import" and imported the dataset as multiply imputed "wide format" data. I labelled "poor" as a "passive variable," and "income" and "poverty ratio" as imputed variables.
My questions are two fold
A-) Is the above approach reasonable/sound?
B-) Once I do 1 -5, is it appropriate to use my new passive variable, "poor," like I would any other variable? In regressions, can I use it in an interaction term, for instance? Can I use it to define subpopulations for regressions or other procedures? I should add that it is complex survey data.
I hope this has been clear. I would be extremely appreciative if anyone can offer any words of advice.
Best,
Adam
I posted this question in response to an old thread but I thought I should probably just post it as a new one.
My question pertains to the National Health Interview Survey, a large US publicly-available health survey (https://www.cdc.gov/nchs/nhis/nhis_2...ta_release.htm).
There is a great deal of missing income information in the main dataset. However, NHIS makes available imputed income in the form of five separate downloadable datasets (m=1, m=2, ... m = 5). Notably, however, there is no "original" variable for income: the imputed datasets appear to rely on exact reported income information that is not available in the original dataset. I have a few questions about how this data can be used in stata, and I'd be curious if the following steps I've taken sound reasonable.
1-) First, I re-created the original income variable. I did this using a variable in the imputed datasets that indicates whether income for a given individual was imputed or reported. If it was reported, I re-created an original income variable as equal to income in the first imputed file, although the value should be the same across all five imputations if it was not an imputed value.
2-) Next, I then renamed the income variable in each imputed dataset as income1, income2, income3, ... income5, for each of the five imputations. I then merged my original file and the five imputed files, producing a single "wide" format dataset that includes "income" (re-created original variable with missing), and "income1," "income2", "income3". ... "income5" representing each of the five imputations.
3-) I did the same thing for another variable, based on an income, which is also in the imputed dataset, which is income as a percentage of the federal poverty level, which I'll call "povertyratio."
4-) I wanted to classify all observations by povertyratio, and for the sake of simplicy I'll just say I wanted a variable to indicate whether each individual was poor or not poor. So basically something like:
generate poor = .
replace poor = 0 if povertyratio >= 1
replace poor = 1 if povertyratio < 1
I did this for the original dataset and for each imputation, thus producing six variables: "poor" (m=0), "poor1" (m=1) ... "poor5" (m=5).
5-) I then used "mi import" and imported the dataset as multiply imputed "wide format" data. I labelled "poor" as a "passive variable," and "income" and "poverty ratio" as imputed variables.
My questions are two fold
A-) Is the above approach reasonable/sound?
B-) Once I do 1 -5, is it appropriate to use my new passive variable, "poor," like I would any other variable? In regressions, can I use it in an interaction term, for instance? Can I use it to define subpopulations for regressions or other procedures? I should add that it is complex survey data.
I hope this has been clear. I would be extremely appreciative if anyone can offer any words of advice.
Best,
Adam
Comment