Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Checking whether a variable is continuous or categorical

    Hello Statalisters,

    I wish to obtain an enhanced version of the dataset provided by the - describe, replace - command. In this enhanced dataset, I want to include whether the variable is continuous or categorical.

    I know these adjectives means nothing for Stata, which is why I made a set of assumptions to define whether a variable is continuous or categorical.

    A variable should be continuous if and only if :
    - It has more than 20 categories
    - It is not a string
    - There is no constant scale between its category, no matter the scale

    So now, suppose I have the following dataset:

    Code:
    sysuse auto, clear
    describe, replace
    I would like one more binary variable called "cat_var" equal to 1 if the corresponding pre-describe variable was following the 3 conditions above.

    I have no idea whether what I'm asking is feasible or not, but in any case, I'd appreciate any lead on this matter ! If you have other suggestions of criteria to better identify continuous / categorical variables (even if it will always be more or less imperfect), please feel free to share
    Last edited by Adam Sadi; 15 Sep 2023, 06:24.

  • #2
    Adam:
    while the first two requirements seem manageable (if a variable is in -string- format any issue concerning numerical values won't go):
    Code:
    . use "C:\Program Files\Stata17\ado\base\a\auto.dta"
    (1978 automobile data)
    
    . g cat_var=1 if rep78<20
    (5 missing values generated)
    
    . g cat_var2=1 if make <20
    type mismatch
    r(109);
    
    .
    the third point sounds a bit more complicated to me, as you might have constant scale in continuous variable (es, height measured in cm).
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Carlo : Thank you for your insights!

      As for my third point, sorry for my unclearness. To give more details, I'm working on data obtained from surveys in which the respondents must declare numerical values within a certain range (expressed in euros). There is about 10000 respondents.

      What I mean by "no constant scale between categories of a variable" is that a variable must never have the same difference across its different values, i.e., for the variable height, that not every possible height is represented in my dataset.

      This could be a risky definition as you mention, however I'm working with money data expressed in euros, so I'm thinking it's unlikely that every possible value of euros between 0 and 999,999,999 is represented in my dataset, so this could be a good way to identify the continuous from the categorical variables, whose values are arbitrary and ordered in a constant scale (1, 2, 3, 4)

      In fact, while writing this paragraph, I'm thinking that an additional criteria to identify categorical variable would be "no category should be greater than 2000" given that the only continuous variables I'm working with are years and money.

      I don't know if I make sense or not but thank you anyways for this lead!
      Last edited by Adam Sadi; 15 Sep 2023, 07:24.

      Comment


      • #4
        I do not want to comment on the criteria much. Sonner or later you will misclassify some variables. One obvious criterion for continuous variables might be non-integer values. Relying on what has and what has not been observed in a specific sample is probably not a good idea. Also, if you already know all the variables, why use a data-driven approach in the first place?

        Anyway, this should implement your original criteria:*

        Code:
        program is_continuous , sortpreserve
            
            version 17
            
            syntax varname
            
            confirm numeric variable `varlist'
            
            capture tabulate `varlist'
            if (_rc != 134) { // too many values; so more than 20
                
                if (r(r) < 20) {
                    display as err "too few distinct values"
                    exit 7
                }
                
            }
            
            tempvar delta
            
            sort `varlist'
            quietly generate double `delta' = `varlist'-`varlist'[_n-1]
            summarize `delta' if `delta' , meanonly
            
            if (r(min) == r(max)) {
                display as err "constant delta"
                exit 7
            }
            
        end
        Running this on auto.dta, we get:

        Code:
        sysuse auto 
        
        foreach var of varlist _all {
            
            display as res "`var'"
            capture noisily is_continuous `var'
            
        }
        Code:
        make
        'make' found where numeric variable expected
        price
        mpg
        rep78
        too few distinct values
        headroom
        too few distinct values
        trunk
        too few distinct values
        weight
        length
        turn
        too few distinct values
        displacement
        gear_ratio
        foreign
        too few distinct values

        * Note that you switch from defining a continuous variable to defining a categorical variable. What happens if a variable has values larger than 2,000 but those values are equally spaced, such as years from 2001 to 2023?

        Comment


        • #5
          Daniel : Thank you very much for this code!

          My ultimate purpose was outside the scope of this thread. I want to compare several datasets over time, in particular changes in categories for certain variables over time.

          Therefore I wanted to identify continuous variable to remove them... And only focus on categorical variable.

          * Note that you switch from defining a continuous variable to defining a categorical variable. What happens if a variable has values larger than 2,000 but those values are equally spaced, such as years from 2001 to 2023?
          There is no instance of such case, because all years variable I use have a minimum of 20 categories.

          Of course there will be mistakes I will correct later on, but this is a first screening.

          Comment


          • #6
            This tells you if the increment is always 1.
            Code:
            sysuse auto, clear
            foreach var in price mpg rep78 headroom trunk {
                egen base_`var' = min(`var')
                g spread_`var' = (`var' - (base_`var'-1))/`var'
            }
            summ spread*
            added. doesn't work if 0 or if there are - values indicating missing. might just add if `var'>0. This will exclude the 0, but you'll still get the 1 increment. also might think about iffing the top end somehow in case there a value that sticks out.

            a catvar has a min == max
            Last edited by George Ford; 15 Sep 2023, 15:26.

            Comment


            • #7
              Originally posted by George Ford View Post
              This tells you if the increment is always 1.
              Code:
              (code omitted)
              egen base_`var' = min(`var')
              g spread_`var' = (`var' - (base_`var'-1))/`var'
              (code omitted)
              summ spread*
              The code fails whenever the minimum is not 1; here is an example:

              Code:
              . clear
              
              . set obs 5
              Number of observations (_N) was 0, now 5.
              
              . generate foo = _n+42
              
              . list
              
                   +-----+
                   | foo |
                   |-----|
                1. |  43 |
                2. |  44 |
                3. |  45 |
                4. |  46 |
                5. |  47 |
                   +-----+
              
              . egen base_foo = min(foo)
              
              . g spread_foo = (foo - (base_foo-1))/foo
              
              . summ spread
              
                  Variable |        Obs        Mean    Std. dev.       Min        Max
              -------------+---------------------------------------------------------
                spread_foo |          5    .0657433    .0328605   .0232558    .106383

              I have shown in #4 how to check for constant differences (although the code might not be very efficient). If you cared about the size of the constant difference, you could simply assert that r(min) and r(max) equal the size, e.g., 1.

              Comment

              Working...
              X