Replacing missing values by sample average to retain sample size

Jean Ndenzako

Join Date: Mar 2014

Posts: 23
#1

Replacing missing values by sample average to retain sample size

25 Feb 2017, 12:48

I intend to replace missing values by sample average in order to retain the sample size.
It is an acceptable approach and if so, how can I proceed ( need code help)?

Thank you for your usual assistance and knowledge sharing.

Jean
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

25 Feb 2017, 13:32

Depending on what you are doing it might be a minimally acceptable approach, but usually not. In general, any single-imputation procedure has the problem of resulting in a variable in which the variance is biased downward. This tends to result in overestimating the associations of that variable with other variable outcomes. In addition, filling in missing values in this way creates the illusion of having a full sample, but not the reality.

Why don't you describe your data and your research goals more fully, and perhaps somebody can suggest a suitable approach. It would also be particularly important to know how the missing values came to be missing. What process in the world caused those values to not be available? Also, how much of your data is missing?
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

25 Feb 2017, 14:18

Just as a side note, Stata has the command - mi impute pmm - for imputing values according to the predicted mean matching. Personally, I never used it, because - mi impute chained - and - mi impute mvn - applied perfectly to my needs,

That said, shall you really wish to deal with the mean for the imputation, I believe this method is better than just inserting the mean value, whose main caveats Clyde underlined.

Comparing to linear regression, the Stata Manual informs:

Predictive mean matching may be preferable to linear regression when the normality of the underlying model is suspect. Predictive mean matching (PMM) is a partially parametric method that matches the missing value to the observed value with the closest predicted mean (or linear prediction). It was introduced by Little (1988) based on Rubin’s (1986) ideas applied to statistical file matching. PMM combines the standard linear regression and the nearest-neighbor imputation approaches. It uses the normal linear regression to obtain linear predictions. It then uses the linear prediction as a distance measure to form the set of nearest neighbors (possible donors) consisting of the complete values. Finally, it randomly draws an imputed value from this set. By drawing from the observed data, PMM preserves the distribution of the observed values in the missing part of the data, which makes it more robust than the fully parametric linear regression approach.

Last edited by Marcos Almeida; 25 Feb 2017, 14:24.

Best regards,

Marcos
Comment
Jean Ndenzako

Join Date: Mar 2014

Posts: 23
#4

25 Feb 2017, 14:48

Thank you both Clyde and Marcos for your response. I have a Labour Force Survey with a sample of 73,099 respondents to the question: "are you a member of a union?". I am interested in estimating a probit model of the determinants of union membership. Some of the independent variables inter alia age or sector of occupation have missing values for some individuals ( may be random non response). From the response contained in your first paragraph, filling in missing values may not be a good idea, especially that I do not know how the missing values come to be missing . Sorry if I extended the discussion beyond the purpose of this forum, but my real question was about the merits of imputation of missing
values as an approach to preserving the sample size and the stata code to do so if it is an econometrically acceptable approach.

Last edited by Jean Ndenzako; 25 Feb 2017, 14:55.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1133
#5

25 Feb 2017, 15:49

Hi Jean. Here are a few nice (and quite readable) articles that discuss the main issues around missing data, including the problems with the traditional methods of dealing with missing data.
http://onlinelibrary.wiley.com/doi/1...191.x/abstract

http://folk.ntnu.no/slyderse/medstat...006/Shafer.pdf

http://imaging.mrc-cbu.cam.ac.uk/sta...get=graham.pdf

HTH.

Last edited by sladmin; 27 Feb 2017, 08:09. Reason: administrative approval to display links

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1133
#6

25 Feb 2017, 15:57

Hi Jean. I attempted an earlier reply with some hyperlinks to articles inserted, but it was rejected, apparently. So I'll give you the citations to the articles instead. All of these are very readable, I think, and discuss the main issues around missing data, including the problems with traditional methods of dealing with it (like mean substitution). HTH.
Acock, A. C. (2005). Working with missing values. Journal of Marriage and family, 67(4), 1012-1028.

Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual review of psychology, 60, 549-576.

Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological methods, 7(2), 147

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#7

25 Feb 2017, 16:15

Hello Jean,

Missing data is a quite complex situation. The resources suggested by Bruce will surely help you.

That said, considering you stated that your

[...] real question was about the merits of imputation of missing values

Here it goes, well, sort of a tentative reply.

In a nutshell (really, short to a default), essentially, missing data is classified in three classes: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR).

In general terms, with MCAR, the results won't change much, compared to listwise deletion.

With MNAR there is a situation, too hard to demonstrate validiity of "an acceptable approach", albeit the existence of alternatives, mostly arcane to me.

Well, there is only MAR left to us for discussion. Indeed, MAR is the ideal scenario to employ multiple imputation!

Single-imputation, as Clyde warned, is full of pitfalls.

To end, in case you haven't yet read the Stata Manual on multiple imputation, I strongly recommend you to do so: there you will find lots of helpful information.

Hopefully that helped.

Best regards,

Marcos
Comment
Jean Ndenzako

Join Date: Mar 2014

Posts: 23
#8

26 Feb 2017, 07:14

Thank you Bruce and Marcos. I am reading the references that you both indicated and already has a slightly better understanding
of the question I raised. I will seek more guidance shall the need still arises.
Jean
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#9

26 Feb 2017, 07:19

Jean:
I do share all the previous hints, recomnendations, references and warnings about imputing mean sample size (as Clyde pointed out, at best your variance will collapse).
I would also recommend you to take a look at the following source: http://missingdata.lshtm.ac.uk/downloads/guidelines.pdf

PS: Crossed with Jean's reply.

Kind regards,
Carlo
(Stata 19.0)
Comment
Jean Ndenzako

Join Date: Mar 2014

Posts: 23
#10

26 Feb 2017, 13:41

Thanks Carlo
Comment
danishussalam

Join Date: Jul 2014

Posts: 140
#11

02 Mar 2017, 05:03

Hey Marcos,

How do I save the imputed values into my data-set? I realize that mi impute pmm or logit doesn't change the missing values in the data-set. I plan to use the imputed dataset for PSM analysis later?
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#12

02 Mar 2017, 05:09

Answered here.

While it is understandable that you want fast answers, please do not use old/other threads with only somehow related questions.

Best
Daniel
1 like
Comment

Announcement

Replacing missing values by sample average to retain sample size

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment