Compute mean and mode for missing data

Dave Velthoff

Join Date: Feb 2018

Posts: 4
#1

Compute mean and mode for missing data

27 Feb 2018, 01:25

Hello,

I'm somewhat new to STATA and was looking for days to find an appropriate solution. I hope you can help me to solve my question...

My dataset contains some missing data. For illustrative purposes, I provide three (example!) variables and their type:

1. Age: Continuous variable (integer)
2. Level of education: Ordinal variable (1: low, 2: intermediary, 3: high)
3. Gender: Nominal variable / dummy (1: male, 2: female)

For Age, I want to compute the sample mean (but exclude missing values in the computation) and assign the computed sample mean only to the missing values.
For Level of education, I want to compute the mode (value with highest frequency) of the sample and assign that value of that mode to the missing values.
For Gender, I want to compute the mode of the sample and assign that value of that mode to the missing values.

Furthermore, what is the best way to deal with multiple modes? Given that the ordinal and nominal variables have categorical values, taking the average of two modes is not going to work.

Thank you!

Best, Dave
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35433
#2

27 Feb 2018, 01:53

The command egen has functions for mean and mode which you can apply groupwise.

There is no easy answer when there are ties for mode. But the function just mentioned has various options that might appeal.

This is a marker for all those discussions that might point out that these methods are widely considered long past their sell-by date for imputation. For example, assume two categories. Then assigning the mode to missing values will just bias estimation of the probability of the more frequent category unless you're certain that the missings all belong to the modal category. Concretely, imagine 5 females, 3 males, 2 missing. After imputation we are estimating pr(female) as 0.7 rather than 0.625. Naturally, this is a fairly cheap criticism and anything more elaborate is also much harder work and not white magic in any case.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#3

27 Feb 2018, 01:57

You are better of not replacing the missing values than replacing them by the mean/median/mode. The latter will tend to make things worse. Imagine a scatterplot for a bivariate regression, and where your "imputed" values end up in that scatterplot. If you really want to deal with the missing values you can look at help mi, but the default of ignoring cases with missing values tend to be fairly robust (compared to the alternatives). So my recommendation would generally be that, unless you are an expert, you are better of leaving the missing values alone and just focus on the data you do have.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#4

27 Feb 2018, 06:11

Indeed. There is software that gives up if missing values are met anywhere and not explicitly excluded, but Stata's general convention is to ignore missing data unless you specify otherwise (and sometimes even then).
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

27 Feb 2018, 06:50

For continuous variable age

Code:

egen age_new = mean(age) replace age_new = age if age!=.

and similar logic for the ordinal variable.

The syntax you used computes the mean of age limited to just those observations where age is missing, so for those observations age_new will be missing, and for observations where age is not missing, nothing is computed, so for those observations, age_new will also be missing.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#6

27 Feb 2018, 06:54

William is right, but to make it concrete, imagine ages 24, 42, 124 and two missings. You're asking what's the mean of the missings and Stata can only return missing as a result.
Comment
john ferejohn

Join Date: Feb 2018

Posts: 2
#7

27 Feb 2018, 07:11

I am having trouble setting up a panel data set. I am using the European Social Survey Cumulative File, for those countries that had surveys in all 7 rounds, and the data are categoried by country (numerical) and essround (numerical) and when i tabulate these variables they look right (the table looks like a balanced panel).
but if I do xtset country it says it is unbalanced. why
and if I try xtset country essround I get error 451, repeated time values within panel.

i followed the posted fix by Nick Cox but when i get to the remedy (duplicates tag....) it basically classifies the whole dataset as duplicates.
I am stuck.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#8

27 Feb 2018, 07:46

John:
welcome to this forum.
Please, repost your query following FAQ advice (mainly: do not queue up your query to an existing one with a totally different subject; provide an exxcerpt/example of your data via -dataex-; post what you typed and what Stata gave you back within CODE delimiters). Thanks.
With a bit of guess-work I would say that your panel dataset has missing values.
Please note that Stata can easily handles both balanced and unbalanced panel datasets.

Kind regards,
Carlo
(Stata 19.0)
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#9

27 Feb 2018, 07:49

Dave:
an addition to the excellent advice already provided, you may want to take a look at https://missingdata.lshtm.ac.uk/, that debunks the black magic behind the tragically oversold naive methods to deal with missing values (such as replacing missing data with the mean of the observed ones).

Kind regards,
Carlo
(Stata 19.0)
Comment
john ferejohn

Join Date: Feb 2018

Posts: 2
#10

27 Feb 2018, 17:32

carlo

thanks. i am not sure where to post the query. But this is not really about missing data. I cannot actually get the xtset command to work without errors and none of nick's fixes work. I just happened to mention that when I just did a one way panel (no time series) stata reported that the panel was unbalanced. but I know that Stata can handle situations once I get in the door. \I cannot get in the door! is that clearer?

john
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

27 Feb 2018, 18:06

john ferejohn

While you are reading this answer, scroll to the top of the page above the first post, locate the word General shown in the screenshot below, and click on it. In the page that opens, you will see a button labelled "+ New Topic". That is how you create a new topic as Carlo advised.

Before posting, you should also follow Carlo's advice and review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

The more you help others understand your problem, the more likely others are to be able to help you solve your problem.
Comment

Announcement

Compute mean and mode for missing data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment