Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is there a way in Stata to drop variables if they are greater than a certain percent missing in data

    Hi, I am pretty new to Stata, so please forgive me if this is somewhere in the Help. I tried to find it, but haven't had any luck.

    I have a dataset with over 8000 variables. I need a way to drop variables if they are missing more than x percent of the time in a dataset, or alternatively keep variables that are present more than x percent of the time. So far, I haven't seen a way to drop variables unless I drop the variable if it is ever missing. I ran MDESC in my dataset, and at least 1000 of them have no values whatsoever. I just want to drop the variables that are either 100% empty or empty more than X percent of the time.

    Thanks for your help.

    Rachel Owsley

  • #2
    There may be a shorter way to do this, but this will work:

    Code:
    clear
    * Generate some data
    input x y
    1 1
    1 1
    1 .
    . .
    end
    * x missing 25% of the time, y missing 50% of the time
    * Percent missing above which we'll drop variables
    glo p=0.5
    * Loop over variables
    foreach var of varlist * {
        count if `var'==.
        if (r(N)/_N) >= $p drop `var'    
    }
    Jorge Eduardo Pérez Pérez
    www.jorgeperezperez.com

    Comment


    • #3
      Jorge Eduardo's code needs generalisation if string variables are present.

      Code:
        
      foreach var of varlist * {    
          qui count if missing(`var')      
          if (r(N)/_N) >= $p drop `var'    
      }
      I wrote a dropmiss to drop variables or observations that are all missing.

      Not supporting what is asked for here was a deliberate personal choice.

      SJ-8-4 dm89_1 . . . . Dropping variables or observations with missing values
      (help dropmiss if installed) . . . . . . . . . . . . . . . N. J. Cox
      Q4/08 SJ 8(4):594
      update in style and content; added a new force option

      STB-60 dm89 . . . . . Dropping variables or observations with missing values
      (help dropmiss if installed) . . . . . . . . . . . . . . . N. J. Cox
      3/01 pp.7--8; STB Reprints Vol 10, pp.44--46
      drops variables or observations with all values (optionally
      any values) missing

      Comment


      • #4
        See also nmissing

        SJ-5-4 dm67_3 . . . . . . . . . . Software update for nmissing and npresent
        (help nmissing if installed) . . . . . . . . . . . . . . . N. J. Cox
        Q4/05 SJ 5(4):607
        now produces saved results

        SJ-3-4 sg67_2 . . . . . . . . . . Software update for nmissing and npresent
        (help nmissing, npresent if installed) . . . . . . . . . . N. J. Cox
        Q4/03 SJ 3(4):449
        updated to include support for by, options for checking
        string values that contain spaces or periods, documentation
        of extended missing values .a to .z, and improved output

        STB-60 dm67.1 . . . . Enhancements to numbers of missing and present values
        (help nmissing if installed) . . . . . . . . . . . . . . . N. J. Cox
        3/01 pp.2--3; STB Reprints Vol 10, pp.7--9
        updated with option for reporting on observations

        STB-49 dm67 . . . . . . . . . . . . . Numbers of missing and present values
        (help nmissing if installed) . . . . . . . . . . . . . . . N. J. Cox
        5/99 pp.7--8; STB Reprints Vol 9, pp.26--27
        commands to list the numbers of missing values and nonmissing
        values in each variable in varlist

        Comment


        • #5
          Originally posted by Nick Cox View Post
          Jorge Eduardo's code needs generalisation if string variables are present.

          Code:
          foreach var of varlist * {
          qui count if missing(`var')
          if (r(N)/_N) >= $p drop `var'
          }
          I wrote a dropmiss to drop variables or observations that are all missing.

          Not supporting what is asked for here was a deliberate personal choice.

          SJ-8-4 dm89_1 . . . . Dropping variables or observations with missing values
          (help dropmiss if installed) . . . . . . . . . . . . . . . N. J. Cox
          Q4/08 SJ 8(4):594
          update in style and content; added a new force option

          STB-60 dm89 . . . . . Dropping variables or observations with missing values
          (help dropmiss if installed) . . . . . . . . . . . . . . . N. J. Cox
          3/01 pp.7--8; STB Reprints Vol 10, pp.44--46
          drops variables or observations with all values (optionally
          any values) missing

          Hi Nick,

          Sorry for my delayed response and thank you for your reply. The code very helpful. I looked at dropmiss, nmiss, mdesc, and mvpatterns. I haven't tried dropmiss yet, because normally, I wouldn't just drop variables without checking their relationship to the target. Unfortunately, with over 8000,the variable selection techniques I've tried have not worked. Should PCA or stepwise work on such a large dataset with Stata? I haven't had luck with it, but I am pretty new to Stata. I have Stata MP/ 13.1 on a dual core processor. I am loathe to just drop them because some of the variables that are missing could be significant when they are present so i could transform them into binaries. Any suggestions for how to proceed would be much appreciated.

          Best Regards,

          Rachel

          Comment


          • #6
            Originally posted by Jorge Eduardo Perez Perez View Post
            There may be a shorter way to do this, but this will work:

            Code:
            clear
            * Generate some data
            input x y
            1 1
            1 1
            1 .
            . .
            end
            * x missing 25% of the time, y missing 50% of the time
            * Percent missing above which we'll drop variables
            glo p=0.5
            * Loop over variables
            foreach var of varlist * {
            count if `var'==.
            if (r(N)/_N) >= $p drop `var'
            }
            Thank you, Jorge!

            Much appreciated.

            Rachel

            Comment


            • #7
              Rachel:

              Please change your identifier to "Rachel Owsley" as signed in post #1. Otherwise list etiquette is to request that you use that signature always.

              I don't understand quite what you are asking about whether PCA or stepwise regression will work in your case. Both techniques will ignore missing values if they exist and/or fail with "no observations" if presented with variables that are all missing. Otherwise it is usually better to choose variables according to their pertinence to a research problem; no software can do that for you.

              Comment


              • #8
                Hi Nick, I will make those changes to the signature going forward. Thank you. As I mentioned, those procedures did not work-- I assumed it was because of the # of variables and due to a memory problem because of the size of the data set. I will try them again. I am not sure if I am understanding correctly. Are you saying that that the entire PCA and stepwise procedure will fail or that Stata will ignore that variable and run the other variables?

                Comment


                • #9
                  In any Stata estimation command, including PCA and all the various regressions, observations are included only if they have non-missing values for every variable named in the command's variable list. As a corollary, if some variable has all missing values and you include it in the variable list, there will be no observations in the estimation sample and the command will fail. Stata does not ignore variables that are specified in the command: the only time Stata will omit a variable that is in a command's variable list is if there is collinearity. In that case, it will drop one (or more) of the variables involved.

                  Apart from variables that have all observations with missing values, scattered missing data can result in severe depletion of the set of observations available for the command because an observation will be dropped if any of the variables named in the variable list has a missing value.

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    In any Stata estimation command, including PCA and all the various regressions, observations are included only if they have non-missing values for every variable named in the command's variable list. As a corollary, if some variable has all missing values and you include it in the variable list, there will be no observations in the estimation sample and the command will fail. Stata does not ignore variables that are specified in the command: the only time Stata will omit a variable that is in a command's variable list is if there is collinearity. In that case, it will drop one (or more) of the variables involved.

                    Apart from variables that have all observations with missing values, scattered missing data can result in severe depletion of the set of observations available for the command because an observation will be dropped if any of the variables named in the variable list has a missing value.
                    . Thank you, Clyde. That clears it up for me. Very good point about scattered missing data. It's a problem in a lot of the data I deal with. I was hoping to run the mvpatterns command and find the most common combinations of variables that are missing and drop those variables, but this dataset was too large for mvpatterns and I got an error stating that. In this case, it may be better to start with the variables that are all there, and if those are not good enough, add those that are missing 5% of the time, 10% of the time, etc, until I have a set of significant variables. Out of the 8000, only 200 are never missing, and about another 100 are missing 10% of the time or less.

                    Comment

                    Working...
                    X