Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • imputing missing values

    Hello,

    I'm using Stata 12.1. My dataset has the variables net income and gross income. There are missing values for the variable net income (coded ".a"). I would like to replace the missing values by information on the relation between gross and net income. I am using the mi impute command. My questions are:
    1) Which mi impute option do i use? ((regress, pmm, truncreg)
    2) How do I decide which iteration of the imputation process i should use?

    I hope someone can help me

  • #2
    Clarification: My objective is to have a variable netincome_imp that contains all non-missing values of netincome and generated values for the cases that have ".a" in netincome.

    Comment


    • #3
      My objective is to have a variable netincome_imp that contains all non-missing values of netincome and generated values for the cases that have ".a" in netincome.
      Do not do that. Just follow Stata's mi approach, mi set your dataset, mi register your net income variable imputed and mi impute the missing values.

      Before your do this, you need to set the observations with missing value codes .a to system missing values . as the former are treated as "hard missing" and are not imputed (a feature that I very much like about mi).

      1) Which mi impute option do i use? ((regress, pmm, truncreg)
      The model you chose is more or less a matter of choice - but I would be interested in any empirical evidence which methods works best under which circumstances and why. Personally, I prefer using pmm for continuous variables.

      2) How do I decide which iteration of the imputation process i should use?
      If you refer to the missing value pattern present in your data, as far as I understand you only have missing value on one variable and one variable with observed values. Therefore your pattern is monotone. But you do not need to worry about this too much, as Stata is smart enough to apply the appropriate method.

      Best
      Daniel

      Comment


      • #4
        Thank you for your answers.
        I have imputed the missing values in netincome via the command
        mi impute pmm netincome grossincome, add(1) rseed(12345)
        . Is this the best way to achieve my objective? Assuming that the values in the original variable were missing at random, can I use the new variable _1_netincome_imp instead of netincome in further analyses of income?

        Comment


        • #5
          I discovered the command ipolate
          ipolate netincome grossincome, gen(netincome_new)
          This procedure yield results that seem to make more sense. The downside is that the command also generates calculated values for the cases that do not have missing values. Hence the desired new netincome variable has to be recoded from the variable netincome (for non-missing cases) and netincome_new (for cases with missing values).

          My Question: Which method do you consider more appropriate for the purpose I described?

          Comment


          • #6
            MI is generally preferable. The statistical properties are known, standard errors are correct, and it will handle cases where, for example, in 2003 it was a normal year, 2004 there were dramatic changes leading to a higher outcome on a certain variable (but that variable was missing), and 2005 was a normal year. Ipolate simply does a linear trend. With ipolate, you really are simply "making up data" you don't have, whereas with MI, your parameter estimates and standard errors take into account uncertainty.

            That said, in small doses, ipolate can be handy if you can make an argument for a linear trend. Standard errors would still be inflated, though.

            Comment


            • #7
              The core idea behind multiple imputation is to create more than one imputed value. That way the procedure reflects our uncertainty about the "true" values. By adding just one imputation (or using interpolation) we treat the imputed values as fix/known. As Ben mentioned, this might suffice if we are interested in point estimates only, but the estimated standard errors will be wrong.

              Best
              Daniel

              Comment


              • #8
                I will use the mi command. However I need to have a single value for each case, because the variable will be used in further calculations. Do You think I should stick with one imputation, or should I use 10 imputations and calculate a mean of the ten imputated variables?
                By the way: I forgot to mention that my data is cross-sectional, it is not a time series.

                Comment


                • #9
                  However I need to have a single value for each case, because the variable will be used in further calculations.
                  No, you most likely do not need a single value to do further calculations. If that was the case, the hole mi estimate suite would be pointless.

                  I suggest you read (or re-read) mi intro and follow the advice given there. If you have further questions, describing in more detail what exactly the calculations you intend to do are, would probably help.

                  Best
                  Daniel

                  Comment


                  • #10
                    Thank you for your advice.
                    I am using the german socio-economic panel wave 2012 and I am trying to compare different poverty concepts. I have now decided to use the generated employment income variables that come with the dataset, instead of doing a multiple imputation myself. I will address the uncertainty of the imputation procedures that the authors used by doing my calculations/simulations with the generated variables and again just with the cases that have no missing values in the original data.

                    Comment

                    Working...
                    X