Checking whether a variable is continuous or categorical

Adam Sadi

Join Date: Jul 2022

Posts: 68
#1

Checking whether a variable is continuous or categorical

15 Sep 2023, 06:14

Hello Statalisters,

I wish to obtain an enhanced version of the dataset provided by the - describe, replace - command. In this enhanced dataset, I want to include whether the variable is continuous or categorical.

I know these adjectives means nothing for Stata, which is why I made a set of assumptions to define whether a variable is continuous or categorical.

A variable should be continuous if and only if :
- It has more than 20 categories
- It is not a string
- There is no constant scale between its category, no matter the scale

So now, suppose I have the following dataset:

Code:

sysuse auto, clear describe, replace

I would like one more binary variable called "cat_var" equal to 1 if the corresponding pre-describe variable was following the 3 conditions above.

I have no idea whether what I'm asking is feasible or not, but in any case, I'd appreciate any lead on this matter ! If you have other suggestions of criteria to better identify continuous / categorical variables (even if it will always be more or less imperfect), please feel free to share

Last edited by Adam Sadi; 15 Sep 2023, 06:24.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#2

15 Sep 2023, 06:50

Adam:
while the first two requirements seem manageable (if a variable is in -string- format any issue concerning numerical values won't go):

Code:

. use "C:\Program Files\Stata17\ado\base\a\auto.dta" (1978 automobile data) . g cat_var=1 if rep78<20 (5 missing values generated) . g cat_var2=1 if make <20 type mismatch r(109); .

the third point sounds a bit more complicated to me, as you might have constant scale in continuous variable (es, height measured in cm).

Kind regards,
Carlo
(Stata 19.0)
Comment
Adam Sadi

Join Date: Jul 2022

Posts: 68
#3

15 Sep 2023, 07:20

Carlo : Thank you for your insights!

As for my third point, sorry for my unclearness. To give more details, I'm working on data obtained from surveys in which the respondents must declare numerical values within a certain range (expressed in euros). There is about 10000 respondents.

What I mean by "no constant scale between categories of a variable" is that a variable must never have the same difference across its different values, i.e., for the variable height, that not every possible height is represented in my dataset.

This could be a risky definition as you mention, however I'm working with money data expressed in euros, so I'm thinking it's unlikely that every possible value of euros between 0 and 999,999,999 is represented in my dataset, so this could be a good way to identify the continuous from the categorical variables, whose values are arbitrary and ordered in a constant scale (1, 2, 3, 4)

In fact, while writing this paragraph, I'm thinking that an additional criteria to identify categorical variable would be "no category should be greater than 2000" given that the only continuous variables I'm working with are years and money.

I don't know if I make sense or not but thank you anyways for this lead!

Last edited by Adam Sadi; 15 Sep 2023, 07:24.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3849

15 Sep 2023, 07:42

I do not want to comment on the criteria much. Sonner or later you will misclassify some variables. One obvious criterion for continuous variables might be non-integer values. Relying on what has and what has not been observed in a specific sample is probably not a good idea. Also, if you already know all the variables, why use a data-driven approach in the first place?

Anyway, this should implement your original criteria:*

Code:

program is_continuous , sortpreserve
    
    version 17
    
    syntax varname
    
    confirm numeric variable `varlist'
    
    capture tabulate `varlist'
    if (_rc != 134) { // too many values; so more than 20
        
        if (r(r) < 20) {
            display as err "too few distinct values"
            exit 7
        }
        
    }
    
    tempvar delta
    
    sort `varlist'
    quietly generate double `delta' = `varlist'-`varlist'[_n-1]
    summarize `delta' if `delta' , meanonly
    
    if (r(min) == r(max)) {
        display as err "constant delta"
        exit 7
    }
    
end

Running this on auto.dta, we get:

Code:

sysuse auto 

foreach var of varlist _all {
    
    display as res "`var'"
    capture noisily is_continuous `var'
    
}

Code:

make
'make' found where numeric variable expected
price
mpg
rep78
too few distinct values
headroom
too few distinct values
trunk
too few distinct values
weight
length
turn
too few distinct values
displacement
gear_ratio
foreign
too few distinct values

* Note that you switch from defining a continuous variable to defining a categorical variable. What happens if a variable has values larger than 2,000 but those values are equally spaced, such as years from 2001 to 2023?

Comment

Adam Sadi

Join Date: Jul 2022

Posts: 68
#5

15 Sep 2023, 08:06

Daniel : Thank you very much for this code!

My ultimate purpose was outside the scope of this thread. I want to compare several datasets over time, in particular changes in categories for certain variables over time.

Therefore I wanted to identify continuous variable to remove them... And only focus on categorical variable.

* Note that you switch from defining a continuous variable to defining a categorical variable. What happens if a variable has values larger than 2,000 but those values are equally spaced, such as years from 2001 to 2023?

There is no instance of such case, because all years variable I use have a minimum of 20 categories.

Of course there will be mistakes I will correct later on, but this is a first screening.
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#6

15 Sep 2023, 15:13

This tells you if the increment is always 1.

Code:

sysuse auto, clear foreach var in price mpg rep78 headroom trunk { egen base_`var' = min(`var') g spread_`var' = (`var' - (base_`var'-1))/`var' } summ spread*

added. doesn't work if 0 or if there are - values indicating missing. might just add if `var'>0. This will exclude the 0, but you'll still get the 1 increment. also might think about iffing the top end somehow in case there a value that sticks out.

a catvar has a min == max

Last edited by George Ford; 15 Sep 2023, 15:26.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3849

15 Sep 2023, 16:16

Originally posted by George Ford View Post

This tells you if the increment is always 1.

Code:

(code omitted)
egen base_`var' = min(`var')
g spread_`var' = (`var' - (base_`var'-1))/`var'
(code omitted)
summ spread*

The code fails whenever the minimum is not 1; here is an example:

Code:

. clear

. set obs 5
Number of observations (_N) was 0, now 5.

. generate foo = _n+42

. list

     +-----+
     | foo |
     |-----|
  1. |  43 |
  2. |  44 |
  3. |  45 |
  4. |  46 |
  5. |  47 |
     +-----+

. egen base_foo = min(foo)

. g spread_foo = (foo - (base_foo-1))/foo

. summ spread

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
  spread_foo |          5    .0657433    .0328605   .0232558    .106383

I have shown in #4 how to check for constant differences (although the code might not be very efficient). If you cared about the size of the constant difference, you could simply assert that r(min) and r(max) equal the size, e.g., 1.

Announcement