Missing values in a logit model

Ana Guerrero

Join Date: Oct 2022

Posts: 12
#1

Missing values in a logit model

15 Jan 2024, 12:26

Hi I'm Ana,

I'm currently working with a logit model and had a question regarding missing values in my database.
My database belongs to a survey, I currently have around 300 thousand data, my question is, if I have missing values, should I eliminate them or should I consider them when constructing my dummy variables (consider them as a value of 0) ? or is there any special treatment I should consider for missing values?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

15 Jan 2024, 13:45

This is actually a rather complicated question.

First, you should understand what Stata does with missing values when you run commands. In most situations, observations that contain a missing value on any variable that is mentioned in the command are excluded from the calculations. This is what is known as "complete cases" analysis.

When complete cases analysis does not fulfill your needs, dealing with the missing values requires understanding how those values came to be missing in the first place. One thing that can be said is that picking an arbitrary value like 0 to substitute for the missing values will almost always be a terrible way of doing it. Unless the missing values arose because the process that created the data specifically omitted values when they are, in reality, zero, imputing zero for missing values will produce a grotesquely distorted data set that will produce erroneous, but possibly plausible looking results. That's really pretty close to the worst case scenario.

The best solution to the problem of missing data, and really the only truly good one, is to go out and find the values themselves. But that is rarely practical--and in a large scale survey such as your data, almost surely impossible. Another relatively happy situation is when the missing values arise from a completely random process: that is, the missingness is completely independent of what the true values are--this is known as missingness completely at random (MCAR). In this case, a complete cases analysis is an analysis of a random subset of the true data and will produce unbiased results. The reason this is only a "relatively happy" situation is that it is impossible to verify that the MCAR mechanism is actually what happened just from looking at the data that you have in hand. So it becomes, instead, an assumption that one makes based on what you know about how the data were gathered and the subsequent management of that data leading up to the data set you have. It can be plausible, for example, if missingness results from responses having not been received by a predetermined cutoff date, or a particular batch of laboratory specimens got lost in transit to the lab and were never processed--those are all entirely exogenous causes of missing data and the resulting data are MCAR.

Again, in your situation, MCAR is unlikely to hold. People sometimes decline to respond to particular questions because the applicable response for them is awkward or embarrassing to disclose. Consequently, missingness is likely to be greater for some values of the true response than others. However, it may be that the missing values can be unbiasedly predicted from responses to other items in the same survey. Where this is the case (or, more technically, where the missingness is independent of the true value of the response conditional on the non-missing data values--this is known as missingness at random, MAR), then there are two overall procedures that can reduce the bias that a complete cases analysis would provide: full-information maximum likelihood, and multiple imputation. Stata offers full information maximum likelihood in its structural equations modeling commands, but not, as far as I know, elsewhere. Stata has a suite of commands for multiple imputation that can be used with most (but not all) of its estimation commands. The drawback here is that the availability of post-estimation commands is pretty limited after using multiple imputation. Also, just as with MCAR, the validity of the MAR assumption cannot be determined by any analysis of the data in hand--it is an act of faith based on your understanding of how the data set came to be. (Do not be fooled by programs that purport to test the MAR assumption. Many of them are just plain wrong, based on misunderstandings of what MAR is. And others are valid only if you make other assumptions about the distributions of the true data--those assumptions being just as unverifiable as MAR itself.)

As you can see, this is a complicated topic. You may find https://statisticalhorizons.com/wp-c...aterials-1.pdf helpful in exploring it farther.
4 likes
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#3

15 Jan 2024, 14:59

Originally posted by Clyde Schechter View Post

Also, just as with MCAR, the validity of the MAR assumption cannot be determined by any analysis of the data in hand-- [....]

Not to undermine Clyde's valuable advice here but you can test for MCAR and MAR. You cannot test either against MNAR (missing not a random), which would arguably be the much more interesting test; but you can test MCAR against MAR for all variables in the data.

Edit: Perhaps the above is still misleading. I should have written: You can test MCAR against MAR (or MNAR). That is, there are tests that can reject the Null of MCAR. Whether the data are MAR or MNAR cannot be tested.

Last edited by daniel klein; 15 Jan 2024, 15:18.
2 likes
Comment

Announcement

Missing values in a logit model

Comment

Comment