What to do with Not applicable, Don't know, Refusal (spontaneous) responses in survey data for regression

Archie Millar

Join Date: Mar 2022

Posts: 5
#1

What to do with Not applicable, Don't know, Refusal (spontaneous) responses in survey data for regression

01 Apr 2022, 11:23

Hello,

In the survey data I am using for my regression, most of the variables have one or more of the following responses: Not applicable, Don't know, Refusal, which are coded as outliers.
E.g. one of my variables, self-reported job satisfaction is coded 1=very satisfied; 2=satisfied; 3=not very satisfied; 4=not at all satisfied; 8=DK/no opinion (spontaneous); 9=Refusal (spontaneous).

My question is how should I deal with such observations?
Should I just leave them as is, so they will be included in regression? or should I drop them from my dataset? or does it depend how large a proportion of observations from that variable such responses make up?

Thanks.
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4412
#2

01 Apr 2022, 12:12

what it depends on is what your research question is and how these relate to that question - since you have told us nothing about that, there is no way to give good advice without writing a text
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29809
#3

01 Apr 2022, 12:17

Leaving them as is and including them in regression is about the worst possible thing you could do. In effect, that would say that somebody who expressed no opinion or said he or she didn't know is twice as dissatisfied as somebody who said he or she is not at all satisfied! Clearly that is nonsense.

What you should do is replace those values with Stata missing values. If you wish to specifically maintain the distinction between the DK/no opinion and Refusal categories, you can do something like this:

Code:

mvdecode list_of_applicable_variables, mv(8 = .d \ 9 = .r)

That will replace the values 8 and 9 in those variables by Stata's "extended" missing values.d and .r, respectively. And in all calculations, not just regressions, Stata will understand that these values are excluded. (Note, it doesn't have to be specifically .d and .r; I chose those because they have mnemonic value. Stata has 26 "extended" missing values, .a through .z, and you can use whichever ones you like for this.)

Now, depending on your situation, you may or may not have any need to maintain the difference between those two categories, and it may be simpler to just lump them together as "missing." In that case, simpler is just

Code:

mvdecode list_of_applicable_variables, mv(8 9)

and both 8 and 9 will be replaced by Stata's "system" missing value (which shows up in listings as a period.)
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4412
#4

01 Apr 2022, 13:01

note that Clyde Schechter and I have made a different assumption here - Clyde's response assumes you will enter that variable as a quantitative variable while I assumed possible entry as a categorical variable; note also that if you follow Clyde's advice and you have numerous missing values you will need to do something about this (e.g., multiple imputation)

Last edited by Rich Goldstein; 01 Apr 2022, 13:03.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29809
#5

01 Apr 2022, 13:20

First, something weird happened here on Statalist. I was not aware of Rich Goldstein's response in #2 when I wrote what is now #3. But strangely, after I posted that reply, it showed up here as #2 and there was still no sign of Rich Goldstein's first response on this thread, even though it is timestamped earlier than mine! Anyway, this writing is the first I have seen what he wrote here.

And he is right. I did assume you intend to treat the 1-4 response scale as a quantitative (ordinal or interval-level) variable. If you are going to treat these variables as categorical, then you might well preserve the coding as 8 and 9. You might even choose to do that for some analyses, but convert them to missing values for others, or to handle Refused one way and Don't know another way. It requires some thought. Conside the "Refused" response group. In one sense, they are distinct group from those who responded somewhere on the 1 through 4 scale. In another sense, however, you might reason that they must have had some level of satisfaction that fits somewhere along that 1 to 4 scale--they're just withholding that information. From that perspective, this "Refused" group is actually a mixture of people from each of the 1 through 4 response options. By including that as a separate category, you may be biasing the estimates associated with the 1 through 4 categories themselves. So, it's complicated. (This same argument would not be so readily applicable to the Don't Know category as, if we take them at their word that they don't know, then they really aren't a mixture of 1 through 4 people who are just withholding the information.)

My point is, it's complicated!
Comment
Archie Millar

Join Date: Mar 2022

Posts: 5
#6

03 Apr 2022, 12:20

Thank you very much for your responses, actually you are both right - that particular variable is the ordinal dependent variable, but I will be using some continuous explanatory variables with the same issue as well.
So, from what you have both said, I think replacing with missing values is a good way to go, and my dataset is sufficiently large not to worry about many missing values.
For some of my variables it may be fruitful to include that information for refusals too, so I will have a think.

Thanks again!
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4412
#7

03 Apr 2022, 12:35

based on the lit., it is the percentage of missing values that brings danger, not the number; I know of 2 guidelines: 3% and 5% - i.e., if in your analysis you lose more than, say, 5% of the observations to missing data, then you may be in dangerous territory
Comment
Archie Millar

Join Date: Mar 2022

Posts: 5
#8

06 Apr 2022, 11:19

Originally posted by Rich Goldstein View Post

based on the lit., it is the percentage of missing values that brings danger, not the number; I know of 2 guidelines: 3% and 5% - i.e., if in your analysis you lose more than, say, 5% of the observations to missing data, then you may be in dangerous territory

Ok, so for example my income variable (continuous) includes 88888888.0=DK (spontaneous); 99999999.0=Refusal (spontaneous). The frequency of those two observations in the variable are 3.88% and 11.43% respectively. So with the 5% rule, is it that these should not be converted into missing values, but instead either single imputation (eg as mean value) or multiple imputation?

Thanks.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4412
#9

06 Apr 2022, 12:17

the potential problem with this much missing data is bias so, yes, if you can believe that the data are missing at random (or even completely at random but that seems unlikely), you can use multiple imputation; I would not use single imputation as the variance/standard deviation will be under-estimated
Comment

Announcement

What to do with Not applicable, Don't know, Refusal (spontaneous) responses in survey data for regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment