Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to see if a variable is categorical?

    Hi
    It is not unusual that I have to do a huge amount of exploratory analysis.
    So I would like to identify categorical variables in my code (Stata or Mata) in order to set up the correct analysis for the variables.
    One way could be to check whether label values is attached to the variable.

    But I can't find any examples on that.

    So I looked into the code behind codebook and has defined:

    Code:
    capture program drop variabel_has_a_label_value
    program define variabel_has_a_label_value, rclass
        /*Program returning 1 if argument "variable_name" has a label value in scalar r(has_a_label_value).
        Otherwise it returns 0.
        
        Examples:
        . variabel_has_a_label_value age
        . return list
        scalars:
          r(has_a_label_value) =  0
    
        . variabel_has_a_label_value sex
        . return list
        scalars:
          r(has_a_label_value) =  1
        */
        args variabel_name
        local label_name :value label `variabel_name'    
        if ("`label_name'" != ""){
            return scalar has_a_label_value = 1
        }
        else{
            return scalar has_a_label_value = 0
        }
    end
    But is this the best way? And am I missing something crucial? (I'm fairly new to Stata)

    Kind regards/nhb
    Kind regards

    nhb

  • #2
    tabulate and tab1 store r(r) which is the number of rows (or different values):
    Code:
    . sysuse auto.dta
    
    . quietly tab1 rep78
    . display r(r)
    5
    
    . quietly tab1 rep78 , missing
    . display r(r)
    6

    Comment


    • #3
      But this way I can only see the number different values or if I combine with _N whether the distinct values are lower than the total number of observations.
      A continous variable might have duplicate values and what then?

      I would be interested in doing eg chisquare tests on categorical variables and regression on continous variables.
      Kind regards

      nhb

      Comment


      • #4
        If you just want to determine whether something is categorical or not, please consider the following example:

        Code:
        sysuse auto
        tab price
        tab foreign
        You clearly see that foreign is categorical and price should be continuous.

        Please correct me if I did not understand your question correctly.

        PS: I am not aware of a method to do this automatically though. The problem is just like what you stated, if something occurs multiple times in your observation, it can be a categorical variable or a continuous variable. It is i.m.h.o. based on the context and interpretation. I think you should manually categorize them whether they are continuous or not.
        Last edited by bsc.j.j.w; 06 Aug 2014, 03:12.

        Comment


        • #5
          Well, I don't sympathize much with the idea of automatically testing "everything". But anyway: after compress, you can make a list of byte variables (integers <100). These may be (but need not be) categorical (possibly meaning: being numerical but not representing a quantity).

          Code:
          webuse lbw.dta, clear
          compress
          local bytelist ""
          foreach V of varlist _all {
             local type : type `V'
             if "`type'"=="byte" {
                local bytelist = "`bytelist'"+" `V'"
             }
          }
          display "`bytelist'"

          Comment


          • #6
            Code:
            ds, has(vallabel)
            returns the names of variables with a value label. findname (SJ) was written to do what ds does, and more, and with a better syntax.

            There is an intersecting question of what is a categorical variable. I note merely that Stata has no such notion. If anyone comes up with a different working definition of categorical variables, there will usually be a Stata way to find them. As Svend notes, there are ways of finding the number of distinct values (with the idea tacit here that categorical variable will often have few such values).
            Last edited by Nick Cox; 06 Aug 2014, 03:56.

            Comment


            • #7
              If you simply want to test if a certain variable has a value label (as opposed to Nick's solution, which checks if any variable in the data set has value labels attached using -ds- or -findname-), you simply can use the extended macro function value label:
              Code:
              sysuse auto , clear
              foreach var in foreign price {
                  if (!missing(`"`: value label `var''"')) {
                      display "variable `var' has a value label"
                      *... do stuff for variables with value label here
                  }
                  else {
                      display "variable `var' does not have a value label"
                      *... do stuff for variables without value label here
                  }
              }
              See the help for extended macro functions (help extended_fcn) for details.

              Regards
              Bela

              Comment


              • #8
                Daniel's code may help programmers, but if you are interested in a particular variable using describe will tell you if it has value labels. Also.

                Code:
                 
                ds, has(vallabel)
                can be applied to one or more variable names.

                Comment


                • #9
                  Hi All
                  Thank you for your replies.

                  Nick has the key point in that Stata doesn't handle categorical variables allthough the concept is widely used in statistics.
                  So the next best thing for me is data discipline and defining categorical variables as those variables with label values.
                  It is fine to see that Stata do handle this with the -ds-.
                  However for my purposes (which is programming) I like Daniel's solution best
                  Code:
                  !missing(`"`: value label `var''"')
                  .

                  As for Svend I agree to a certain point.
                  But I do not think I get any wiser just by doing the same bunch of tests manually.
                  By setting the outputs systematically in tables you might find patterns you otherwise do not see.
                  Also I like to present the results of my analysis as close as possible to the final tables in the articles.
                  Something which is relatively easy to do with the eg the excel tools in Stata.
                  And then one just have to copy/paste the table into the article. And do some minor alterations.
                  Further the tables in articles are this way well documented by a do-file. And easily reproduced if necessary.


                  Kind regards/nhb
                  Kind regards

                  nhb

                  Comment

                  Working...
                  X