Annoyingly coded ordinal independent variables

Richard Williams

Join Date: Apr 2014

Posts: 4946
#1

Annoyingly coded ordinal independent variables

12 Mar 2016, 07:05

Surveys often include questions with options like "daily," "once a week", "a few times a month",..."once a year", "never". Or something like that. I understand why Qs are worded that way but I find them annoying to deal with. The coding clearly isn't continuous or even roughly continuous. But, I hate to just break the variable up into a bunch of dummies -- you get a lot of variables that way and you lose the fact that the categories are ordered.

What I often suggest doing is treating the variable as categorical, then treat it as continuous, and then do a test to see whether it is ok to treat it as continuous.

I am curious what other people do. I suspect a lot of times people just treat the variable as continuous. But are there other guidelines or suggestions on how to proceed?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35438
#2

12 Mar 2016, 07:10

What would treating such a variable as "continuous" mean?
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4946
#3

12 Mar 2016, 07:18

Just treat the variable as though it was continuous, i.e. categories are evenly spaced. That often can work ok if, say, the categories range from strongly disagree to strongly agree. But it is highly dubious with odd spacing like this.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35438
#4

12 Mar 2016, 07:34

Continuous doesn't mean discrete and ordered.... It's been a while since your last physics lesson, and longer since my last, but something like height, temperature or pressure is in my idea of a continuous variable.

Your examples are not even unambiguously ordered: "once a week" could be less than "a few times a month" for several interpretations of "few".

Last edited by Nick Cox; 12 Mar 2016, 07:40.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#5

12 Mar 2016, 08:09

When a scale presents something like that, coding the midpoint of each category, e.g.:

""daily"=1
"once a week"=1/7
"a few times a month"=(1/7)*(3/4)
"once a year"=1/365.25
"never".=0

is a common, if not necessarily correct way.
1 like
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4946
#6

12 Mar 2016, 08:36

Sure, it isn't continuous (and the Qs are probably worded a bit better than the phrasing I did from memory). But the question is, how bad is it to treat it as continuous? This article argues that variations in spacing often don't matter that much:

http://support.sas.com/resources/pap...9/248-2009.pdf

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Cordula Kiel

Join Date: Feb 2016

Posts: 45
#7

12 Mar 2016, 08:52

Hello,

I have also a question related to the same topic, as I am currently wondering how to correctly include my ordinally scaled independent variables in a discrete choice model.
I already completed the survey and as I am a beginner, I did not properly thought about how to deal with the variables afterwards
For example, I have a variable for the educational level, where it is clear which level is "higher" than the other, so I would expect this to be ordinally scaled and not nominally scaled.
Is it possible in some way to use this variable without converting it into several dummy variables?

Another example: I asked for the household size, but with categories: 1,2,3,4 and "5 or more" house hold members, which now seems quite stupid to me because I don't know how to deal with that "5 or more" problem, as now this is not a ratio or interval scale anymore.

Has anybody any idea how to deal with that correctly while avoiding a lot of dummy variables?

Thanks a lot in advance!

(And sorry if it is not okay to jump on that threat with my own question, I just saw it and thought it might be better to add it here than to start a separate topic.)
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#8

12 Mar 2016, 10:08

Good questions from Richard and Cordula. Much of it depends on what is the norm in your field. In political science, our surveys often contain response options like those described by Richard. We typically treat them as continuous/interval level data (no dummies), but it is always good to check if there are significant variations from a linear effect by running with dummies.

The problem with the type of data that Cordula describes is that there are typically one or two very extreme values (people who respond with 25 household members). What do you do with those cases? Typically, they are grouped with lower values ("or more"), or respondents are not given the opportunity to give extreme responses (given the "or more" option in the survey). This is usually done b/c of the fear that outliers might unduly affect the relationship.

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#9

12 Mar 2016, 11:08

One way of getting a single effect for such an ordinal variable without imposing the arbitrary scale would be to use sheaf coefficients, which estimates the scale such that it maximizes the linear effect the scaled variable. This is implemented in Stata in the sheafcoef package available from SSC.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Cordula Kiel

Join Date: Feb 2016

Posts: 45
#10

12 Mar 2016, 13:42

Thanks for that hint! =)
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4946
#11

13 Mar 2016, 11:12

Thanks for the comments and suggestions. Ben, I have often done what you suggest, although a final open-ended category that can run off to infinity can be a pain. Maarten, I will try sheafocef. Right now I can't find it but maybe SSC is down for maintenance.

Even though the intervals seem to differ in their length, it wouldn't surprise me if such Qs often behave like a Likert scale that runs from strongly agree to strongly disagree. For one thing, I suspect many people only have a rough feel feel for the true value; and I suspect the difference between doing something 15 times a year and 17 times a year doesn't matter much. So, the categories are more like an intensity measure, rather than a precise measure of the activity in question.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#12

13 Mar 2016, 11:47

While search sheafcoef returns no hits, ssc install sheafcoef will work. I think the SSC index may have been munged today; I had similar problems on either this package or a different package earlier today.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4946
#13

13 Mar 2016, 12:14

Thanks for the tip William, it works. It looks to me like the program is pre-factor variables, so we have to compute dummies and interactions the old fashioned way?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Richard Williams

Join Date: Apr 2014
Posts: 4946

#14

03 May 2016, 07:39

I wanted to revive this thread because I finally got around to trying Maarten's sheafcoef idea. Hopefully he can tell me if this is a good example of what he had in mind! In this example, LR tests show that the ordinal variable agegrp should not be treated as continuous (although a BIC test suggests that it wouldn't be so bad to do so). I therefore use Maarten's sheafcoef command, which still lets me estimate a single effect for the underlying latent age variable while not requiring that the observed agegrp variable be considered continuous. As far as I know the sheafcoef command does not work with factor variables so we have to compute dummies on our own. Just based on this one example, the sheafocef approach strikes me as being rather appealing, at least in instances where treating the ordinal variable as continuous is highly questionable.

Code:

. webuse nhanes2f, clear

. quietly logit diabetes c.agegrp, nolog

. est store m1

. quietly logit diabetes i.agegrp, nolog

. est store m2

. lrtest m1 m2, stats

Likelihood-ratio test                                 LR chi2(4)  =     10.19
(Assumption: m1 nested in m2)                         Prob > chi2 =    0.0374

Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
-------------+---------------------------------------------------------------
          m1 |     10,335 -1999.067  -1835.578       2    3675.155   3689.642
          m2 |     10,335 -1999.067  -1830.484       6    3672.967   3716.427
-----------------------------------------------------------------------------
               Note: N=Obs used in calculating BIC; see [R] BIC note.

. * Sheaf coefficients for agegrp
. quietly tab agegrp, gen(xage)

. logit diabetes xage2 xage3 xage4 xage5 xage6, nolog

Logistic regression                             Number of obs     =     10,335
                                                LR chi2(5)        =     337.17
                                                Prob > chi2       =     0.0000
Log likelihood = -1830.4836                     Pseudo R2         =     0.0843

------------------------------------------------------------------------------
    diabetes |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       xage2 |   .7021745   .3396247     2.07   0.039     .0365223    1.367827
       xage3 |   1.660128   .3028614     5.48   0.000      1.06653    2.253725
       xage4 |   2.207308   .2860264     7.72   0.000     1.646706    2.767909
       xage5 |    2.63842   .2677401     9.85   0.000     2.113659     3.16318
       xage6 |   2.971236   .2779455    10.69   0.000     2.426472    3.515999
       _cons |  -5.034786   .2590377   -19.44   0.000     -5.54249   -4.527081
------------------------------------------------------------------------------

. sheafcoef, latent(age: xage2 xage3 xage4 xage5 xage6)
------------------------------------------------------------------------------
    diabetes |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
main         |
         age |   1.106507   .0915181    12.09   0.000     .9271344    1.285879
       _cons |  -5.034786   .2590377   -19.44   0.000     -5.54249   -4.527081
-------------+----------------------------------------------------------------
on_age       |
       xage2 |   .6345868   .2841502     2.23   0.026     .0776627    1.191511
       xage3 |   1.500333   .1910889     7.85   0.000     1.125805     1.87486
       xage4 |   1.994844   .1405728    14.19   0.000     1.719326    2.270362
       xage5 |   2.384459   .0891692    26.74   0.000     2.209691    2.559227
       xage6 |    2.68524   .1076525    24.94   0.000     2.474245    2.896235
------------------------------------------------------------------------------

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Comment

Maarten Buis

Join Date: Mar 2014

Posts: 3426
#15

03 May 2016, 07:53

Yes, that is how I would use sheafcoef. You could add the eform option to interpret the effect of age as an odds ratio. The latent variable is standardized, so you would look at the effect of a standard deviation change in age.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment

Announcement