Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Analyzing multiply imputed data in Stata

    Hello all,

    I posted this question in response to an old thread but I thought I should probably just post it as a new one.

    My question pertains to the National Health Interview Survey, a large US publicly-available health survey (https://www.cdc.gov/nchs/nhis/nhis_2...ta_release.htm).

    There is a great deal of missing income information in the main dataset. However, NHIS makes available imputed income in the form of five separate downloadable datasets (m=1, m=2, ... m = 5). Notably, however, there is no "original" variable for income: the imputed datasets appear to rely on exact reported income information that is not available in the original dataset. I have a few questions about how this data can be used in stata, and I'd be curious if the following steps I've taken sound reasonable.

    1-) First, I re-created the original income variable. I did this using a variable in the imputed datasets that indicates whether income for a given individual was imputed or reported. If it was reported, I re-created an original income variable as equal to income in the first imputed file, although the value should be the same across all five imputations if it was not an imputed value.

    2-) Next, I then renamed the income variable in each imputed dataset as income1, income2, income3, ... income5, for each of the five imputations. I then merged my original file and the five imputed files, producing a single "wide" format dataset that includes "income" (re-created original variable with missing), and "income1," "income2", "income3". ... "income5" representing each of the five imputations.

    3-) I did the same thing for another variable, based on an income, which is also in the imputed dataset, which is income as a percentage of the federal poverty level, which I'll call "povertyratio."

    4-) I wanted to classify all observations by povertyratio, and for the sake of simplicy I'll just say I wanted a variable to indicate whether each individual was poor or not poor. So basically something like:
    generate poor = .
    replace poor = 0 if povertyratio >= 1
    replace poor = 1 if povertyratio < 1

    I did this for the original dataset and for each imputation, thus producing six variables: "poor" (m=0), "poor1" (m=1) ... "poor5" (m=5).

    5-) I then used "mi import" and imported the dataset as multiply imputed "wide format" data. I labelled "poor" as a "passive variable," and "income" and "poverty ratio" as imputed variables.

    My questions are two fold
    A-) Is the above approach reasonable/sound?
    B-) Once I do 1 -5, is it appropriate to use my new passive variable, "poor," like I would any other variable? In regressions, can I use it in an interaction term, for instance? Can I use it to define subpopulations for regressions or other procedures? I should add that it is complex survey data.

    I hope this has been clear. I would be extremely appreciative if anyone can offer any words of advice.

    Best,

    Adam


  • #2
    With regards to B, passive imputation tends to be looked down on. See pp. 10-11 of

    https://www3.nd.edu/~rwilliam/xsoc73994/MD02.pdf
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    StataNow Version: 18.5 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      Thanks so much Richard, just had a look at this. But does it matter that only one of the two terms in the interaction term is imputed, versus the example you cite on page 10 - 11 where both are? Basically let's say I want a binary variable that designates each individual in the data set as poor vs. not poor relying on imputed income so I don't lose 25% of my data or whatever. And then I want to use that binary variable in interactions with a binary variable that reflects pre- and post-Affordable Care Act implementation. Or I just want to look at sample means for various other variables based on whether an individual is poor or not poor. Does that make sense?

      Comment


      • #4
        My understanding is far from complete, but here goes. I'll hope Richard will correct any egregious misunderstanding.

        First, from the link you provided in post #1 we have

        Multiple imputation is a technique that allows analysts to incorporate the extra variability due to imputation into their analyses. This is accomplished by analyzing each of the five completed data sets separately using methods and software that are appropriate for survey data, and then combining the estimates and standard errors using the combining rules described in Section 2.2 and Appendix A of the document available via the Technical Documentation link below. The extra variability due to imputation cannot be incorporated by simply analyzing a single completed data set as if the imputed values were true values. Moreover, analysts should not create a single completed data set using the average of the five sets of imputed values. Examples of correct data analyses using SAS-callable SUDAAN and SAS-callable IVEware are provided in Section 4 of the document available via the Technical Documentation link below; the document also provides information on the procedures used to create the imputation
        Also, from the Technical Documentation linked to from that page, we have

        Suppose that the primary interest is in estimating a scalar population quantity, such as a mean, a proportion, or a regression coefficient. The analysis of the M completed data sets resulting from multiple imputation proceeds as follows:
        • Analyze each of the M completed data sets separately using a suitable software package designed for complete data (for example, SUDAAN or Stata).
        • Extract the point estimate and the estimated standard error from each analysis.
        • Combine the point estimates and the estimated standard errors to arrive at a single point estimate, its estimated standard error, and the associated confidence interval or significance test.
        This seems to contradict the approach you propose.

        For an overview of Stata techniques for handling multiply imputed data, start with the documentation in the Stata Multiple-Imputation Reference Manual PDF included with your Stata installation and accessible through Stata's Help menu, and look particularly at the discussion around the mi import command.

        The bottom line is that analyzing multiply imputed data in a statistically rigorous fashion is going to take more effort than you planned on.

        Comment


        • #5
          Thanks very much for your response. To be clear, and I probably wasn't, I am handling the multiply imputed data in that fashion - I'm just assembling it in a "wide" format (i.e. imputed values are different variables, as opposed to different observations), and then importing it into stata's multiple imputation software with "mi import," and then performing all analyses using "mi estimate". So I believe that is consistent with the documentation and the appropriate way to analyze multiply imputed data. My main concern was more with issue (B) - the proper handling of those variables.

          Comment


          • #6
            Adam Gaffney I wonder whether you were able to get the imputed income files to work - and what code you used...
            Thank you!

            Comment


            • #7
              I stumbled on this old thread while looking for something related. I am not sure if anyone is still interested but, in any case, to answer the original question: if you get the NHIS data from IPUMS, rather than from the CDC website, the imputed income variables have already been merged together. So you just need a single download (after selecting all the relevant variables) rather than merging 5 different datasets. A good place to start is this page:

              https://nhis.ipums.org/nhis-action/v...conomic_income

              then one click on a variable and read the documentation.

              If one wants some variable that is in the NHIS but that is not available in IPUMS (there are few such variables), then there are instructions on how to merge the IPUMS data with the original NHIS data from the CDC website. See here:

              https://nhis.ipums.org/nhis/userNotes_links.shtml

              Comment


              • #8
                Tommaso Tempesti I appreciate your contribution to this thread. Please do you (or anyone else) have specific stata resource(s) (with sample codes if possible) for using the five multiply imputed income variables in the NHIS data from IPUMS? Also, do you know if there is anything else that I need to do when analyzing multiply imputed data pooled across multiple years? Thank you.

                Comment


                • #9
                  I have the same question as Esther Lamidi. The NHIS provides five imputed income measures, but it is not clear how to use those in Stata. Any help is appreciated!

                  Comment

                  Working...
                  X