How to find the most non-missing cases with selected variables and observations

xu yr

Join Date: Apr 2022

Posts: 6
#1

How to find the most non-missing cases with selected variables and observations

07 Apr 2022, 03:51

My data set contains multiple variables, each of them contains missing values, I want to have a subset with available data on each remaining variable. Is there a way that I can know which variables and observations to keep so that I will have a subset with the most non-missing cases.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

07 Apr 2022, 03:53

See

Code:

help egen

for row functions to count missing values across variables.
Comment
xu yr

Join Date: Apr 2022

Posts: 6
#3

07 Apr 2022, 04:25

Originally posted by Nick Cox View Post

See

Code:

help egen

for row functions to count missing values across variables.

Thank you for your quick reply. From my understanding that egen rowmiss only gives the number of missing values in each row. Then I can drop all observations with the missing values to have non-missing data set. However, if one variable contains many missing values, dropping that variable may lead to a non-missing dataset with more cases/cells. If I want the modified dataset to have the most cases/cells, how should go there from egen rowmiss?
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#4

07 Apr 2022, 07:08

That is a very complex search question. Think (don't actually do that, but just think) about implementing this as a grid search. You can think of the solution as a string of bits, where the first \(N\) bits represent whether or not to include that observation, and the remaining \(k\) bits whether or not to include that variable. So to do a grid search you would have to go through \(2^{N + k}\) possible solutions, where \(N\) is the number of observations and \(k\) is the number of variables. That number gets ridiculously large ridiculously quickly. So forget about that. You could try smarter algorithms, like a genetic algorithm, to search through that solution space. I don't know of such smarter algorithms implemented in Stata, so you would probably have to implement that yourself (probably in Mata). I would not do it. It is not worth it. Because this entire exercise is pointless. You typically have variables you want/need to include, and that just determines the problem. If you have too many missing values in those variables, then the dataset is not appropriate for your problem. In that case it just makes no sense to continue using that dataset for your problem. If the number of missing values in those variables is manageable, then you're fine. There is just no need to find such an "optimal" set of observations and variables, as the really optimal set is just the set of variables that answer your research question and all the observations within those variables.

Last edited by Maarten Buis; 07 Apr 2022, 07:10.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
xu yr

Join Date: Apr 2022

Posts: 6
#5

07 Apr 2022, 07:49

Originally posted by Maarten Buis View Post

That is a very complex search question. Think (don't actually do that, but just think) about implementing this as a grid search. You can think of the solution as a string of bits, where the first \(N\) bits represent whether or not to include that observation, and the remaining \(k\) bits whether or not to include that variable. So to do a grid search you would have to go through \(2^{N + k}\) possible solutions, where \(N\) is the number of observations and \(k\) is the number of variables. That number gets ridiculously large ridiculously quickly. So forget about that. You could try smarter algorithms, like a genetic algorithm, to search through that solution space. I don't know of such smarter algorithms implemented in Stata, so you would probably have to implement that yourself (probably in Mata). I would not do it. It is not worth it. Because this entire exercise is pointless. You typically have variables you want/need to include, and that just determines the problem. If you have too many missing values in those variables, then the dataset is not appropriate for your problem. In that case it just makes no sense to continue using that dataset for your problem. If the number of missing values in those variables is manageable, then you're fine. There is just no need to find such an "optimal" set of observations and variables, as the really optimal set is just the set of variables that answer your research question and all the observations within those variables.

Thanks for your answer. I have a multi year data set, and I try to determine which years to use that can give me the richest data. You are right, if there is no quick way to do that, then it is not worth it.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#6

07 Apr 2022, 07:58

That is a very different, and more manageable problem. Just create the right table and look at it...

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
xu yr

Join Date: Apr 2022

Posts: 6
#7

07 Apr 2022, 08:27

haha, sorry I didn't make this clear. I have twenty years of data from each state. So it is about 100 variables. and the observations are the sales in each state each year from 500 firms. The search question makes a lot of sense but it does require a lot of work. I guess I will just pick the variables and rows with fewer missing values to get a data set with acceptable size, rather than find an optimal combination.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

07 Apr 2022, 10:57

missings from the Stata Journal offers some handles here.

Code:

.              search dm0085 , entry

Search of official help files, FAQs, Examples, and Stata Journals

SJ-20-4 dm0085_2  . . . . . . . . . . . . . . . . Software update for missings
        (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q4/20   SJ 20(4):1028--1030
        sorting has been extended for missings report

SJ-17-3 dm0085_1  . . . . . . . . . . . . . . . . Software update for missings
        (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q3/17   SJ 17(3):779
        identify() and sort options have been added

SJ-15-4 dm0085  Speaking Stata: A set of utilities for managing missing values
        (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q4/15   SJ 15(4):1174--1185
        provides command, missings, as a replacement for, and extension
        of, previous commands nmissing and dropmiss

Comment

xu yr

Join Date: Apr 2022

Posts: 6
#9

12 Apr 2022, 10:39

Thank you. It helps.
Comment

Announcement

How to find the most non-missing cases with selected variables and observations

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment