Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to find the most non-missing cases with selected variables and observations

    My data set contains multiple variables, each of them contains missing values, I want to have a subset with available data on each remaining variable. Is there a way that I can know which variables and observations to keep so that I will have a subset with the most non-missing cases.

  • #2
    See

    Code:
    help egen
    for row functions to count missing values across variables.

    Comment


    • #3
      Originally posted by Nick Cox View Post
      See

      Code:
      help egen
      for row functions to count missing values across variables.
      Thank you for your quick reply. From my understanding that egen rowmiss only gives the number of missing values in each row. Then I can drop all observations with the missing values to have non-missing data set. However, if one variable contains many missing values, dropping that variable may lead to a non-missing dataset with more cases/cells. If I want the modified dataset to have the most cases/cells, how should go there from egen rowmiss?

      Comment


      • #4
        That is a very complex search question. Think (don't actually do that, but just think) about implementing this as a grid search. You can think of the solution as a string of bits, where the first \(N\) bits represent whether or not to include that observation, and the remaining \(k\) bits whether or not to include that variable. So to do a grid search you would have to go through \(2^{N + k}\) possible solutions, where \(N\) is the number of observations and \(k\) is the number of variables. That number gets ridiculously large ridiculously quickly. So forget about that. You could try smarter algorithms, like a genetic algorithm, to search through that solution space. I don't know of such smarter algorithms implemented in Stata, so you would probably have to implement that yourself (probably in Mata). I would not do it. It is not worth it. Because this entire exercise is pointless. You typically have variables you want/need to include, and that just determines the problem. If you have too many missing values in those variables, then the dataset is not appropriate for your problem. In that case it just makes no sense to continue using that dataset for your problem. If the number of missing values in those variables is manageable, then you're fine. There is just no need to find such an "optimal" set of observations and variables, as the really optimal set is just the set of variables that answer your research question and all the observations within those variables.
        Last edited by Maarten Buis; 07 Apr 2022, 08:10.
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          Originally posted by Maarten Buis View Post
          That is a very complex search question. Think (don't actually do that, but just think) about implementing this as a grid search. You can think of the solution as a string of bits, where the first \(N\) bits represent whether or not to include that observation, and the remaining \(k\) bits whether or not to include that variable. So to do a grid search you would have to go through \(2^{N + k}\) possible solutions, where \(N\) is the number of observations and \(k\) is the number of variables. That number gets ridiculously large ridiculously quickly. So forget about that. You could try smarter algorithms, like a genetic algorithm, to search through that solution space. I don't know of such smarter algorithms implemented in Stata, so you would probably have to implement that yourself (probably in Mata). I would not do it. It is not worth it. Because this entire exercise is pointless. You typically have variables you want/need to include, and that just determines the problem. If you have too many missing values in those variables, then the dataset is not appropriate for your problem. In that case it just makes no sense to continue using that dataset for your problem. If the number of missing values in those variables is manageable, then you're fine. There is just no need to find such an "optimal" set of observations and variables, as the really optimal set is just the set of variables that answer your research question and all the observations within those variables.
          Thanks for your answer. I have a multi year data set, and I try to determine which years to use that can give me the richest data. You are right, if there is no quick way to do that, then it is not worth it.

          Comment


          • #6
            That is a very different, and more manageable problem. Just create the right table and look at it...
            ---------------------------------
            Maarten L. Buis
            University of Konstanz
            Department of history and sociology
            box 40
            78457 Konstanz
            Germany
            http://www.maartenbuis.nl
            ---------------------------------

            Comment


            • #7
              haha, sorry I didn't make this clear. I have twenty years of data from each state. So it is about 100 variables. and the observations are the sales in each state each year from 500 firms. The search question makes a lot of sense but it does require a lot of work. I guess I will just pick the variables and rows with fewer missing values to get a data set with acceptable size, rather than find an optimal combination.

              Comment


              • #8
                missings from the Stata Journal offers some handles here.

                Code:
                .              search dm0085 , entry
                
                Search of official help files, FAQs, Examples, and Stata Journals
                
                SJ-20-4 dm0085_2  . . . . . . . . . . . . . . . . Software update for missings
                        (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                        Q4/20   SJ 20(4):1028--1030
                        sorting has been extended for missings report
                
                SJ-17-3 dm0085_1  . . . . . . . . . . . . . . . . Software update for missings
                        (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                        Q3/17   SJ 17(3):779
                        identify() and sort options have been added
                
                SJ-15-4 dm0085  Speaking Stata: A set of utilities for managing missing values
                        (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                        Q4/15   SJ 15(4):1174--1185
                        provides command, missings, as a replacement for, and extension
                        of, previous commands nmissing and dropmiss

                Comment


                • #9
                  Thank you. It helps.

                  Comment

                  Working...
                  X