Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing values while using fixed effects model



    I have household-level panel data from two time periods.

    I found out that a good portion of my households have missing values in some variables that I need to control for, in the second time period.

    In this case, given the fact that the fixed effects model focuses on the within variation, will it be flawed/pointless to only keep the first year values of those households (shaped 'long' in Stata) that have missing values in the second time period? Or, should I exclude those households altogether?

    Thanks!

  • #2
    There are no truly good solutions to the problem of missing data. One seeks to find the least bad approach for particular situations.

    You observe quite correctly that carrying forward the preceding non-missing value will be particularly problematic for estimating within variation and will bias coefficients towards 0.

    Excluding households with missing observations may be equally problematic: it depends on why the missing values are missing. If the process that causes missing data is completely uninformative (i.e. is entirely independent of the unobserved values that you would see if the data aren't missing), then omitting such households would introduce no bias--the only damage would be the shrinkage of the sample size. But in many contexts the missingness of the data is (partly) systematic and the complete cases are a biased subset. So you need a good understanding of why the missing values are missing before you can decide whether this approach is suitable.

    Another possibility is that the missing values can be predicted in an unbiased way from the non-missing values of the same variables and other variables. There are no statistical tests that can be done to determine whether this assumption holds in your data: once again it depends on understanding the process that generated the missing data in the first place, and also understanding the extent to which the variables are related to each other. With longitudinal data, this often works well, with missing values of a given variable being predictable in an unbiased way from preceding and following values of the same variables and values of other variables. In this case, the multiple imputation procedure will be a useful approach. Stata's multiple imputation command includes a number of approaches to imputation and is usable with most (but not all) regression models. The main drawbacks to this approach, when it is applicable, is that it is complicated to implement, is computationally intensive, and does not support many post-estimation commands.

    In earlier times, linear or other interpolation was often used to (singly) impute the missing values. But single imputation of any kind has serious limitations and results in biased estimation. It really should be used only in situations where there is good reason to believe that the imputed values created in this way are actually very accurate proxies for the unobserved values themselves. Good reason for such belief is seldom available.

    A fairly deep introduction to the general issues of missing data is from Paul Allison: https://pdfs.semanticscholar.org/58d...c218e126e4.pdf

    Comment


    • #3
      Samyam:
      as an aside to Clyde's as always excellent reply, you may want to take a look at https://missingdata.lshtm.ac.uk/.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment

      Working...
      X