Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Selecting all variales with <n unique categories

    Hi,
    I have a huge dataset and I want to keep only those with <10 unique values. Is it possible to do? I tried a few methods using ds and distinct (ssc), but not getting any result.

    Thanks for your suggestions.

  • #2
    The version of distinct on SSC is no longer current. Please cite it as coming from the Stata Journal. The original 2008 paper makes the case at length that unique is a poor term, and indeed that distinct is a better one. In essence, unique still carries as a main if not leading sense that something occurs once only.

    Regardless, distinct will show the variables concerned, but does not return the list of their names. Perhaps that should be returned as a saved result in some future version, but for now why not just write your own loop?

    I excluded missing values in both instances. The code needs revision if missings are to be counted too.

    Code:
    webuse nlswork, clear 
    
    tempvar work 
    
    quietly foreach v of var * { 
        egen `work' = group(`v') if !missing(`v')
        su `work', meanonly 
        if r(max) <= 9 local wanted `wanted' `v'
        drop `work'
    }
    
    di "`wanted'"
    
    race msp nev_mar collgrad not_smsa c_city south union
    
    distinct, max(9)
    
    ---------------------------------
              |     total   distinct
    ----------+----------------------
         race |     28534          3
          msp |     28518          2
      nev_mar |     28518          2
     collgrad |     28534          2
     not_smsa |     28526          2
       c_city |     28526          2
        south |     28526          2
        union |     19238          2
    ---------------------------------

    Comment


    • #3
      For a large dataset (whatever that means), egen followed by summarize might be slower than tabuate

      Code:
      foreach var of varlist _all {
          
          capture tabulate `var'  if !mi(`var')
          if (_rc == 134) continue  // too many values; thus, more than 9
          else if ( _rc ) error _rc // unexpected error
          
          if (r(r) < 10) local wanted `wanted' `var'
          
      }

      Comment


      • #4
        Originally posted by Nick Cox View Post
        The version of distinct on SSC is no longer current. Please cite it as coming from the Stata Journal. The original 2008 paper makes the case at length that unique is a poor term, and indeed that distinct is a better one. In essence, unique still carries as a main if not leading sense that something occurs once only.

        Regardless, distinct will show the variables concerned, but does not return the list of their names. Perhaps that should be returned as a saved result in some future version, but for now why not just write your own loop?

        I excluded missing values in both instances. The code needs revision if missings are to be counted too.

        Code:
        webuse nlswork, clear
        
        tempvar work
        
        quietly foreach v of var * {
        egen `work' = group(`v') if !missing(`v')
        su `work', meanonly
        if r(max) <= 9 local wanted `wanted' `v'
        drop `work'
        }
        
        di "`wanted'"
        
        race msp nev_mar collgrad not_smsa c_city south union
        
        distinct, max(9)
        
        ---------------------------------
        | total distinct
        ----------+----------------------
        race | 28534 3
        msp | 28518 2
        nev_mar | 28518 2
        collgrad | 28534 2
        not_smsa | 28526 2
        c_city | 28526 2
        south | 28526 2
        union | 19238 2
        ---------------------------------
        Thanks so much, Nick! I want to be able to keep or drop the variables that matches the conditions, so I am stuck again at this step:

        Code:
        keep "`wanted'"

        Comment


        • #5
          [QUOTE=daniel klein;n1744543]For a large dataset (whatever that means), egen followed by summarize might be slower than tabuate

          Code:
          foreach var of varlist _all {
          
          capture tabulate `var' if !mi(`var')
          if (_rc == 134) continue // too many values; thus, more than 9
          else if ( _rc ) error _rc // unexpected error
          
          if (r(r) < 10) local wanted `wanted' `var'
          
          }
          [/QUOTE


          Thank you, Daniel! I was able to access the wanted list from local:

          foreach var of local wanted {
          tabulate `var'
          }

          Comment


          • #6
            No reason to be stuck. The


            Code:
            keep `wanted' 
            is what you want. keep expects that it might see a list of variable names, but the "" impede that interpretation.

            Comment

            Working...
            X