Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Annoyingly coded ordinal independent variables

    Surveys often include questions with options like "daily," "once a week", "a few times a month",..."once a year", "never". Or something like that. I understand why Qs are worded that way but I find them annoying to deal with. The coding clearly isn't continuous or even roughly continuous. But, I hate to just break the variable up into a bunch of dummies -- you get a lot of variables that way and you lose the fact that the categories are ordered.

    What I often suggest doing is treating the variable as categorical, then treat it as continuous, and then do a test to see whether it is ok to treat it as continuous.

    I am curious what other people do. I suspect a lot of times people just treat the variable as continuous. But are there other guidelines or suggestions on how to proceed?
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    Stata Version: 17.0 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

  • #2
    What would treating such a variable as "continuous" mean?

    Comment


    • #3
      Just treat the variable as though it was continuous, i.e. categories are evenly spaced. That often can work ok if, say, the categories range from strongly disagree to strongly agree. But it is highly dubious with odd spacing like this.
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      Stata Version: 17.0 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://www3.nd.edu/~rwilliam

      Comment


      • #4
        Continuous doesn't mean discrete and ordered.... It's been a while since your last physics lesson, and longer since my last, but something like height, temperature or pressure is in my idea of a continuous variable.

        Your examples are not even unambiguously ordered: "once a week" could be less than "a few times a month" for several interpretations of "few".
        Last edited by Nick Cox; 12 Mar 2016, 07:40.

        Comment


        • #5
          When a scale presents something like that, coding the midpoint of each category, e.g.:

          ""daily"=1
          "once a week"=1/7
          "a few times a month"=(1/7)*(3/4)
          "once a year"=1/365.25
          "never".=0

          is a common, if not necessarily correct way.

          Comment


          • #6
            Sure, it isn't continuous (and the Qs are probably worded a bit better than the phrasing I did from memory). But the question is, how bad is it to treat it as continuous? This article argues that variations in spacing often don't matter that much:

            http://support.sas.com/resources/pap...9/248-2009.pdf
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            Stata Version: 17.0 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment


            • #7
              Hello,

              I have also a question related to the same topic, as I am currently wondering how to correctly include my ordinally scaled independent variables in a discrete choice model.
              I already completed the survey and as I am a beginner, I did not properly thought about how to deal with the variables afterwards
              For example, I have a variable for the educational level, where it is clear which level is "higher" than the other, so I would expect this to be ordinally scaled and not nominally scaled.
              Is it possible in some way to use this variable without converting it into several dummy variables?

              Another example: I asked for the household size, but with categories: 1,2,3,4 and "5 or more" house hold members, which now seems quite stupid to me because I don't know how to deal with that "5 or more" problem, as now this is not a ratio or interval scale anymore.

              Has anybody any idea how to deal with that correctly while avoiding a lot of dummy variables?

              Thanks a lot in advance!

              (And sorry if it is not okay to jump on that threat with my own question, I just saw it and thought it might be better to add it here than to start a separate topic.)

              Comment


              • #8
                Good questions from Richard and Cordula. Much of it depends on what is the norm in your field. In political science, our surveys often contain response options like those described by Richard. We typically treat them as continuous/interval level data (no dummies), but it is always good to check if there are significant variations from a linear effect by running with dummies.

                The problem with the type of data that Cordula describes is that there are typically one or two very extreme values (people who respond with 25 household members). What do you do with those cases? Typically, they are grouped with lower values ("or more"), or respondents are not given the opportunity to give extreme responses (given the "or more" option in the survey). This is usually done b/c of the fear that outliers might unduly affect the relationship.

                Stata/MP 14.1 (64-bit x86-64)
                Revision 19 May 2016
                Win 8.1

                Comment


                • #9
                  One way of getting a single effect for such an ordinal variable without imposing the arbitrary scale would be to use sheaf coefficients, which estimates the scale such that it maximizes the linear effect the scaled variable. This is implemented in Stata in the sheafcoef package available from SSC.
                  ---------------------------------
                  Maarten L. Buis
                  University of Konstanz
                  Department of history and sociology
                  box 40
                  78457 Konstanz
                  Germany
                  http://www.maartenbuis.nl
                  ---------------------------------

                  Comment


                  • #10
                    Thanks for that hint! =)

                    Comment


                    • #11
                      Thanks for the comments and suggestions. Ben, I have often done what you suggest, although a final open-ended category that can run off to infinity can be a pain. Maarten, I will try sheafocef. Right now I can't find it but maybe SSC is down for maintenance.

                      Even though the intervals seem to differ in their length, it wouldn't surprise me if such Qs often behave like a Likert scale that runs from strongly agree to strongly disagree. For one thing, I suspect many people only have a rough feel feel for the true value; and I suspect the difference between doing something 15 times a year and 17 times a year doesn't matter much. So, the categories are more like an intensity measure, rather than a precise measure of the activity in question.
                      -------------------------------------------
                      Richard Williams, Notre Dame Dept of Sociology
                      Stata Version: 17.0 MP (2 processor)

                      EMAIL: [email protected]
                      WWW: https://www3.nd.edu/~rwilliam

                      Comment


                      • #12
                        While search sheafcoef returns no hits, ssc install sheafcoef will work. I think the SSC index may have been munged today; I had similar problems on either this package or a different package earlier today.

                        Comment


                        • #13
                          Thanks for the tip William, it works. It looks to me like the program is pre-factor variables, so we have to compute dummies and interactions the old fashioned way?
                          -------------------------------------------
                          Richard Williams, Notre Dame Dept of Sociology
                          Stata Version: 17.0 MP (2 processor)

                          EMAIL: [email protected]
                          WWW: https://www3.nd.edu/~rwilliam

                          Comment


                          • #14
                            I wanted to revive this thread because I finally got around to trying Maarten's sheafcoef idea. Hopefully he can tell me if this is a good example of what he had in mind! In this example, LR tests show that the ordinal variable agegrp should not be treated as continuous (although a BIC test suggests that it wouldn't be so bad to do so). I therefore use Maarten's sheafcoef command, which still lets me estimate a single effect for the underlying latent age variable while not requiring that the observed agegrp variable be considered continuous. As far as I know the sheafcoef command does not work with factor variables so we have to compute dummies on our own. Just based on this one example, the sheafocef approach strikes me as being rather appealing, at least in instances where treating the ordinal variable as continuous is highly questionable.

                            Code:
                            . webuse nhanes2f, clear
                            
                            . quietly logit diabetes c.agegrp, nolog
                            
                            . est store m1
                            
                            . quietly logit diabetes i.agegrp, nolog
                            
                            . est store m2
                            
                            . lrtest m1 m2, stats
                            
                            Likelihood-ratio test                                 LR chi2(4)  =     10.19
                            (Assumption: m1 nested in m2)                         Prob > chi2 =    0.0374
                            
                            Akaike's information criterion and Bayesian information criterion
                            
                            -----------------------------------------------------------------------------
                                   Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
                            -------------+---------------------------------------------------------------
                                      m1 |     10,335 -1999.067  -1835.578       2    3675.155   3689.642
                                      m2 |     10,335 -1999.067  -1830.484       6    3672.967   3716.427
                            -----------------------------------------------------------------------------
                                           Note: N=Obs used in calculating BIC; see [R] BIC note.
                            
                            . * Sheaf coefficients for agegrp
                            . quietly tab agegrp, gen(xage)
                            
                            . logit diabetes xage2 xage3 xage4 xage5 xage6, nolog
                            
                            Logistic regression                             Number of obs     =     10,335
                                                                            LR chi2(5)        =     337.17
                                                                            Prob > chi2       =     0.0000
                            Log likelihood = -1830.4836                     Pseudo R2         =     0.0843
                            
                            ------------------------------------------------------------------------------
                                diabetes |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                            -------------+----------------------------------------------------------------
                                   xage2 |   .7021745   .3396247     2.07   0.039     .0365223    1.367827
                                   xage3 |   1.660128   .3028614     5.48   0.000      1.06653    2.253725
                                   xage4 |   2.207308   .2860264     7.72   0.000     1.646706    2.767909
                                   xage5 |    2.63842   .2677401     9.85   0.000     2.113659     3.16318
                                   xage6 |   2.971236   .2779455    10.69   0.000     2.426472    3.515999
                                   _cons |  -5.034786   .2590377   -19.44   0.000     -5.54249   -4.527081
                            ------------------------------------------------------------------------------
                            
                            . sheafcoef, latent(age: xage2 xage3 xage4 xage5 xage6)
                            ------------------------------------------------------------------------------
                                diabetes |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                            -------------+----------------------------------------------------------------
                            main         |
                                     age |   1.106507   .0915181    12.09   0.000     .9271344    1.285879
                                   _cons |  -5.034786   .2590377   -19.44   0.000     -5.54249   -4.527081
                            -------------+----------------------------------------------------------------
                            on_age       |
                                   xage2 |   .6345868   .2841502     2.23   0.026     .0776627    1.191511
                                   xage3 |   1.500333   .1910889     7.85   0.000     1.125805     1.87486
                                   xage4 |   1.994844   .1405728    14.19   0.000     1.719326    2.270362
                                   xage5 |   2.384459   .0891692    26.74   0.000     2.209691    2.559227
                                   xage6 |    2.68524   .1076525    24.94   0.000     2.474245    2.896235
                            ------------------------------------------------------------------------------
                            -------------------------------------------
                            Richard Williams, Notre Dame Dept of Sociology
                            Stata Version: 17.0 MP (2 processor)

                            EMAIL: [email protected]
                            WWW: https://www3.nd.edu/~rwilliam

                            Comment


                            • #15
                              Yes, that is how I would use sheafcoef. You could add the eform option to interpret the effect of age as an odds ratio. The latent variable is standardized, so you would look at the effect of a standard deviation change in age.
                              ---------------------------------
                              Maarten L. Buis
                              University of Konstanz
                              Department of history and sociology
                              box 40
                              78457 Konstanz
                              Germany
                              http://www.maartenbuis.nl
                              ---------------------------------

                              Comment

                              Working...
                              X