Is there a way in Stata to drop variables if they are greater than a certain percent missing in data

row2014

Join Date: Oct 2014

Posts: 5
#1

Is there a way in Stata to drop variables if they are greater than a certain percent missing in data

13 Nov 2014, 15:34

Hi, I am pretty new to Stata, so please forgive me if this is somewhere in the Help. I tried to find it, but haven't had any luck.

I have a dataset with over 8000 variables. I need a way to drop variables if they are missing more than x percent of the time in a dataset, or alternatively keep variables that are present more than x percent of the time. So far, I haven't seen a way to drop variables unless I drop the variable if it is ever missing. I ran MDESC in my dataset, and at least 1000 of them have no values whatsoever. I just want to drop the variables that are either 100% empty or empty more than X percent of the time.

Thanks for your help.

Rachel Owsley
Tags: None

Jorge Eduardo Perez Perez

Join Date: Mar 2014
Posts: 429

13 Nov 2014, 16:13

There may be a shorter way to do this, but this will work:

Code:

clear
* Generate some data
input x y
1 1
1 1
1 .
. .
end
* x missing 25% of the time, y missing 50% of the time
* Percent missing above which we'll drop variables
glo p=0.5
* Loop over variables
foreach var of varlist * {
    count if `var'==.
    if (r(N)/_N) >= $p drop `var'    
}

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#3

13 Nov 2014, 16:18

Jorge Eduardo's code needs generalisation if string variables are present.

Code:

foreach var of varlist * { qui count if missing(`var') if (r(N)/_N) >= $p drop `var' }

I wrote a dropmiss to drop variables or observations that are all missing.

Not supporting what is asked for here was a deliberate personal choice.

SJ-8-4 dm89_1 . . . . Dropping variables or observations with missing values
(help dropmiss if installed) . . . . . . . . . . . . . . . N. J. Cox
Q4/08 SJ 8(4):594
update in style and content; added a new force option

STB-60 dm89 . . . . . Dropping variables or observations with missing values
(help dropmiss if installed) . . . . . . . . . . . . . . . N. J. Cox
3/01 pp.7--8; STB Reprints Vol 10, pp.44--46
drops variables or observations with all values (optionally
any values) missing
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

13 Nov 2014, 17:55

See also nmissing

SJ-5-4 dm67_3 . . . . . . . . . . Software update for nmissing and npresent
(help nmissing if installed) . . . . . . . . . . . . . . . N. J. Cox
Q4/05 SJ 5(4):607
now produces saved results

SJ-3-4 sg67_2 . . . . . . . . . . Software update for nmissing and npresent
(help nmissing, npresent if installed) . . . . . . . . . . N. J. Cox
Q4/03 SJ 3(4):449
updated to include support for by, options for checking
string values that contain spaces or periods, documentation
of extended missing values .a to .z, and improved output

STB-60 dm67.1 . . . . Enhancements to numbers of missing and present values
(help nmissing if installed) . . . . . . . . . . . . . . . N. J. Cox
3/01 pp.2--3; STB Reprints Vol 10, pp.7--9
updated with option for reporting on observations

STB-49 dm67 . . . . . . . . . . . . . Numbers of missing and present values
(help nmissing if installed) . . . . . . . . . . . . . . . N. J. Cox
5/99 pp.7--8; STB Reprints Vol 9, pp.26--27
commands to list the numbers of missing values and nonmissing
values in each variable in varlist
Comment
row2014

Join Date: Oct 2014

Posts: 5
#5

24 Nov 2014, 23:26

Originally posted by Nick Cox View Post

Jorge Eduardo's code needs generalisation if string variables are present.

Code:

foreach var of varlist * { qui count if missing(`var') if (r(N)/_N) >= $p drop `var' }

I wrote a dropmiss to drop variables or observations that are all missing.

Not supporting what is asked for here was a deliberate personal choice.

SJ-8-4 dm89_1 . . . . Dropping variables or observations with missing values
(help dropmiss if installed) . . . . . . . . . . . . . . . N. J. Cox
Q4/08 SJ 8(4):594
update in style and content; added a new force option

STB-60 dm89 . . . . . Dropping variables or observations with missing values
(help dropmiss if installed) . . . . . . . . . . . . . . . N. J. Cox
3/01 pp.7--8; STB Reprints Vol 10, pp.44--46
drops variables or observations with all values (optionally
any values) missing

Hi Nick,

Sorry for my delayed response and thank you for your reply. The code very helpful. I looked at dropmiss, nmiss, mdesc, and mvpatterns. I haven't tried dropmiss yet, because normally, I wouldn't just drop variables without checking their relationship to the target. Unfortunately, with over 8000,the variable selection techniques I've tried have not worked. Should PCA or stepwise work on such a large dataset with Stata? I haven't had luck with it, but I am pretty new to Stata. I have Stata MP/ 13.1 on a dual core processor. I am loathe to just drop them because some of the variables that are missing could be significant when they are present so i could transform them into binaries. Any suggestions for how to proceed would be much appreciated.

Best Regards,

Rachel
Comment

row2014

Join Date: Oct 2014
Posts: 5

24 Nov 2014, 23:27

Originally posted by Jorge Eduardo Perez Perez View Post

There may be a shorter way to do this, but this will work:

Code:

clear
* Generate some data
input x y
1 1
1 1
1 .
. .
end
* x missing 25% of the time, y missing 50% of the time
* Percent missing above which we'll drop variables
glo p=0.5
* Loop over variables
foreach var of varlist * {
count if `var'==.
if (r(N)/_N) >= $p drop `var'
}

Thank you, Jorge!

Much appreciated.

Rachel

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#7

25 Nov 2014, 03:04

Rachel:

Please change your identifier to "Rachel Owsley" as signed in post #1. Otherwise list etiquette is to request that you use that signature always.

I don't understand quite what you are asking about whether PCA or stepwise regression will work in your case. Both techniques will ignore missing values if they exist and/or fail with "no observations" if presented with variables that are all missing. Otherwise it is usually better to choose variables according to their pertinence to a research problem; no software can do that for you.
Comment
row2014

Join Date: Oct 2014

Posts: 5
#8

25 Nov 2014, 17:25

Hi Nick, I will make those changes to the signature going forward. Thank you. As I mentioned, those procedures did not work-- I assumed it was because of the # of variables and due to a memory problem because of the size of the data set. I will try them again. I am not sure if I am understanding correctly. Are you saying that that the entire PCA and stepwise procedure will fail or that Stata will ignore that variable and run the other variables?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#9

25 Nov 2014, 17:34

In any Stata estimation command, including PCA and all the various regressions, observations are included only if they have non-missing values for every variable named in the command's variable list. As a corollary, if some variable has all missing values and you include it in the variable list, there will be no observations in the estimation sample and the command will fail. Stata does not ignore variables that are specified in the command: the only time Stata will omit a variable that is in a command's variable list is if there is collinearity. In that case, it will drop one (or more) of the variables involved.

Apart from variables that have all observations with missing values, scattered missing data can result in severe depletion of the set of observations available for the command because an observation will be dropped if any of the variables named in the variable list has a missing value.
Comment
row2014

Join Date: Oct 2014

Posts: 5
#10

25 Nov 2014, 18:37

Originally posted by Clyde Schechter View Post

In any Stata estimation command, including PCA and all the various regressions, observations are included only if they have non-missing values for every variable named in the command's variable list. As a corollary, if some variable has all missing values and you include it in the variable list, there will be no observations in the estimation sample and the command will fail. Stata does not ignore variables that are specified in the command: the only time Stata will omit a variable that is in a command's variable list is if there is collinearity. In that case, it will drop one (or more) of the variables involved.

Apart from variables that have all observations with missing values, scattered missing data can result in severe depletion of the set of observations available for the command because an observation will be dropped if any of the variables named in the variable list has a missing value.

. Thank you, Clyde. That clears it up for me. Very good point about scattered missing data. It's a problem in a lot of the data I deal with. I was hoping to run the mvpatterns command and find the most common combinations of variables that are missing and drop those variables, but this dataset was too large for mvpatterns and I got an error stating that. In this case, it may be better to start with the variables that are all there, and if those are not good enough, add those that are missing 5% of the time, 10% of the time, etc, until I have a set of significant variables. Out of the 8000, only 200 are never missing, and about another 100 are missing 10% of the time or less.
Comment

Announcement

Is there a way in Stata to drop variables if they are greater than a certain percent missing in data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment