Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Complete data analysis/missing values

    Hello,
    I have a study sample with missing values being less than 10% except for one variable in which missing value exceed 20%. The sample is very large, 7012 individuals i want to do a analysis without missing cases, aka reducing the overall sample size from 7012 to 5318, however im not sure how to.
    Ty

  • #2
    So, if the variables in question are w, x, y, and z you can do this:
    Code:
    drop if missing(w, x, y, z)
    If this list of variables is, for practical purposes, too long to write out but can be expressed with wildcards, then you can do a loop over those variables, dropping any observation with a missing value.

    All of that said, this is usually not a good way to handle missing data. And 10% missing is not small; 20% is very much not small. The issue is not so much that the sample size goes down, but that the resulting sample is usually biased as a result. Unless you are sure that the missingness of the data is a completely random event that is independent of the values of all of these variables, you will almost certainly be introducing bias into your analysis by dropping cases with missing values.

    Comment


    • #3
      So do you suggest i run the test individually by variable which would result in different number of observation for each data but would remove the bias. Im using Mantel-Haenszel and linear regression to test out association etc which will exclude missing data automatically

      Comment


      • #4
        Well, depending on your research goals and the way in which the data came to have missing values, that might be better. Missingness is a very complicated topic and there are no truly good solutions to it in most circumstances. Even with an in depth understanding of your situation, most likely we would be searching for some "least bad" solution to the problem.

        You are correct in pointing out that Mantel-Haenszel and linear regression automatically exclude observations with missing values on any involved variables. And it would be less damaging to just let them run their course on the full data set than to eliminate all observations with any missing values (including missing values on variables that aren't involved in a particular analysis). But even that is sometimes not a good solution.

        May I recommend http://www.statisticalhorizons.com/w...ap-Allison.pdf for a good overview of the various approaches to missing data and some pros and cons.

        Comment

        Working...
        X