Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • When to treat an ordinal independent variable as continuous?

    Hello Statalist community,

    I'd like to ask for your advice whether or not it is reasonable to treat a specific independent variable in my model as continuous or ordinal. Arguing from a purely theoretical perspective, I’d say that the variable I'm speaking about – gorigin (5 ordered groups of social origin) – should be treated as an ordered categorical variable. However, when comparing the model fit between the first model that treats gorigin as a continuous variable and the second model that uses factor notation to treat gorigin as an ordinal variable, the afterwards computed Likelihood-ratio test indicates that the second model (i.gorigin) does not provide a better fit to my data compared to the first model (c.gorigin).

    The test I did relies heavily on a technical paper written by Richard Williams from the University of Notre Dame – https://www3.nd.edu/~rwilliam/stats3...ndependent.pdf . What I did is essentially the same just with the difference that I applied Richard Williams' approach to my own dataset. Similar to Richard's example, my data shows me no significant differences between both models.

    The example he used can be easily replicated using the following code:
    Code:
    webuse nhanes2f, clear
    logit diabetes c.health, nolog
    est store m1
    logit diabetes i.health, nolog
    est store m2
    lrtest m1 m2, stats
    The stupid question I'm asking is whether or not I should stay with my original plan to use groups of social origin as an ordinal variable or argue in favour of my data and treat it as continuous? I'm concerned about the correctness of my approach. Could some reviewer tell me that it is a mistake to treat variable gorigin as ordinal when the test I did clearly showed that I can go either way? Isn't it a more conservative approach when I say that I'm not treating an ordered categorical variable as continuous?

    Some input on your side would be highly appreciated! I'm a bit afraid to make a mistake here. Even though I think my approach is theoretically correct.

    Thanks
    Patrick



  • #2
    I'm sorry to say that your attempt to link to Richard Willams's lecture notes somehow failed, so I am not able to determine what you are following. Did you copy and paste that link from another post in Statalist? Better to open the link in your browser and copy and paste the address from your browser's toolbar.

    Since you started with Richard Williams's work, which is among the best, I'm reluctant to criticize what you've done without being able to see the guidance you were following.

    I do want to say that that the factor variable notation
    Code:
    i.health
    causes health to be treated as a categorical, rather than continuous, variable. It does nothing to cause it to be treated as an ordered categorical variable.

    In the case of the example you show, it works out as if ordering had been part of the model, because the estimated coefficients grow more negative as the value of health increase, as the relevant part of the output shows.
    Code:
        diabetes |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
          health |
           fair  |  -.7493387   .1262017    -5.94   0.000    -.9966895   -.5019878
        average  |  -1.567205   .1302544   -12.03   0.000    -1.822499   -1.311911
           good  |  -2.554012   .1780615   -14.34   0.000    -2.903006   -2.205018
      excellent  |  -3.116457   .2262238   -13.78   0.000    -3.559848   -2.673067
    But nothing requires that. Consider this example.
    Code:
    sysuse auto, clear
    drop if rep78==.
    regress price c.rep78
    est store m1
    regress price i.rep78
    est store m2
    lrtest m1 m2, stats
    The corresponding output is
    Code:
           price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           rep78 |
              2  |   1403.125   2356.085     0.60   0.554    -3303.696    6109.946
              3  |   1864.733   2176.458     0.86   0.395    -2483.242    6212.708
              4  |       1507   2221.338     0.68   0.500    -2930.633    5944.633
              5  |     1348.5   2290.927     0.59   0.558    -3228.153    5925.153
    As the which treatment of health is correct - continuous or categorical - you might read Richard Williams's work on the margins command and consider the following code.
    Code:
    webuse nhanes2f, clear
    logit diabetes c.health, nolog
    margins, at(health=(1 2 3 4 5))
    logit diabetes i.health, nolog
    margins health
    From the two margins commands, you'll see that the probabilities are very similar, so it has little effect which version you choose. But treating it as continuous imposes the assumption that the difference a one-unit increase in health has the same effect on logit(p) across the range of health. This is an assumption I would not care to make, even though your results render it plausible.
    Last edited by William Lisowski; 22 Sep 2018, 12:50.

    Comment


    • #3
      As you know, there is considerable furor in the statistical literature today about the disappointingly low extent to which scientific studies are reproducible. One contributing factor, a major one in my view, is that investigators often explore multiple analytic approaches to their data along the way, and then publish the one whose results they are most pleased with. So, I think that if you had an original plan, unless the data prove to be compellingly inconsistent with that model or recent science developments expose the original plan as utterly unreasonable, even silly, you should stick with the original plan.

      As William Lisowski nicely points out, in the example you give, the two models give substantially the same predicted probabilities and both models exhibit good fit to the data. So there is nothing to suggest that your original model is glaringly inconsistent with the data. I don't know what your actual subject matter is and whether there is any new science that would make your original modeling choice look foolish, but my prior about that is that the probability is very low. So it seems to me you should stick with the original plan.

      Finally, in those circumstances where I do make data-driven changes to the modeling plan (or where I am doing an exploratory study in an attempt to identify good model to use for later confirmatory work), I do not rely on likelihood-ratio or other such test statistics to make those choices. Such test statistics, in large samples, will sometimes reject a model simply because the distributional assumptions underlying the test statistic itself are not met, even though the model is just fine. Also a test statistic comparing two models might choose one model over the other, or say they are equivalent, when both models are just plain lousy. Moreover, these test statistics are usually answering the wrong question anyway. The value of a model is usually based on the correctness of the predictions it makes, and likelihood-ratio tests, at best, address that in an indirect way, using a metric that is not well-aligned with pragmatic considerations.

      Comment


      • #4
        @ William,

        Thank you for your input and your information about not to copy and paste a link directly from another post in Statalist. Here’s the link again, this time I opened it in my browser and copied the link from there. https://www3.nd.edu/~rwilliam/stats3...ndependent.pdf

        Regarding your comment about the i. prefix: Of course, you are absolutely right when saying that i.health does not necessarily mean that my variable is ordered categorical. However, as you pointed out correctly, the i. prefix specifies health as a categorical variable and - if implied in the model – gives Stata the opportunity to return coefficients that show a somewhat ordered pattern. What is important for me, though, is what you mentioned at the end of your post. Treating health as a categorical variable does not imply the assumption that an one-unit increase in health has the same effect across the range of health. In my case, with a variable that represents five different groups of social origin, I think that’s the most reasonable approach. Even though my data tell me that I could go either way.


        @ Clyde,

        Thank you for your thoughts about data-driven changes to the modelling plan. I absolutely agree that in regards of transparency and traceability it is best to stick to theoretical framework rather than building theories that match the data. Your post gave me further support to stick to my original plan. Thank you!


        Comment


        • #5
          Like many/most of my handouts, I make few or no claims about having an original thought in this one. I am mostly citing others. But, I do think this quote from David J. Pasta is pretty provocative. He basically says all the concerns about ordinal variables being equally spaced should also be concerns with continuous variables:

          One concern often expressed is that “we don't know that the ordinal categories are equally spaced.” That is true enough – we don't. But we also don't “know” that the relationship between continuous variables is linear, which means we don't “know” that a one-unit change in a continuous variable has the same effect no matter whether it is a change between two relatively low values or a change between two relatively high values. In fact, when it's phrased that way -- rather than “is the relationship linear?” -- I find a lot more uncertainty in my colleagues. It turns out that it doesn't matter that much in practice – the results are remarkably insensitive to the spacing of an ordinal variable except in the most extreme cases. It does, however, matter more when you consider the products of ordinal variables.

          I am squarely in the camp that says “everything is linear to a first approximation” and therefore I am very cheerful about treating ordinal variables as continuous. Deviations from linearity can be important and should be considered once you have the basics of the model established, but it is very rare for an ordinal variable to be an important predictor and have it not be important when considered as a continuous variable. That would mean that the linear component of the relationship is negligible but the non-linear component is substantial. It is easy to create artificial examples of this situation, but they are very, very rare in practice.
          There are many things we value in research, and parsimony is one of them. If little is lost by treating an ordinal variable as continuous, life is simpler and more understandable if we treat it as continuous. If you don't like that, why do you feel comfortable with treating a continuous variable as having linear effects? Maybe you should have a lot of spline functions. There are all sorts of things you could do, but it could get incredibly complicated once you start doing that. And, the more complicated you make the model, the more likely it is that you are capitalizing on idiosyncratic/chance features of the sample.

          In short, I think Pasta is saying that treating ordinal vars as continuous may be shakey, but it is usually not too bad, and it is no worse than a lot of others things we routinely do without giving it much thought.
          -------------------------------------------
          Richard Williams, Notre Dame Dept of Sociology
          StataNow Version: 19.5 MP (2 processor)

          EMAIL: [email protected]
          WWW: https://www3.nd.edu/~rwilliam

          Comment


          • #6
            On a semi-related note, this piece, "Modeling continuous response variables using ordinal regression" (co-authored by Frank Harrell and others) sort of blows my mind:

            https://onlinelibrary.wiley.com/doi/....1002/sim.7433

            I have always argued against collapsing continuous dependent variables to ordinal, because it throws away information. But these authors recommend it! Here is the abstract:

            We study the application of a widely used ordinal regression model, the cumulative probability model (CPM), for continuous outcomes. Such models are attractive for the analysis of continuous response variables because they are invariant to any monotonic transformation of the outcome and because they directly model the cumulative distribution function from which summaries such as expectations and quantiles can easily be derived. Such models can also readily handle mixed type distributions. We describe the motivation, estimation, inference, model assumptions, and diagnostics. We demonstrate that CPMs applied to continuous outcomes are semiparametric transformation models. Extensive simulations are performed to investigate the finite sample performance of these models. We find that properly specified CPMs generally have good finite sample performance with moderate sample sizes, but that bias may occur when the sample size is small. Cumulative probability models are fairly robust to minor or moderate link function misspecification in our simulations. For certain purposes, the CPMs are more efficient than other models. We illustrate their application, with model diagnostics, in a study of the treatment of HIV. CD4 cell count and viral load 6 months after the initiation of antiretroviral therapy are modeled using CPMs; both variables typically require transformations, and viral load has a large proportion of measurements below a detection limit.
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            StataNow Version: 19.5 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment


            • #7
              I find Richard's post provocative and will keep it in mind.

              With that said, going back to post #1, I don't know what to make of "gorigin (5 ordered groups of social origin)" - I'm not sure I understand how "social origin" is defined, how it would be grouped, and how those groups would placed into some sort of order. And that as much as anything was what I was thinking at the end of post #2.

              But for the purposes of my post #2, my intent was to point out that there is nothing about the analysis done treating the independent variable gorigin as categorical using i.gorigin that in any way enforced - as opposed to revealed - an ordered structure on the estimated coefficients in the way that the ordered logistic and probit models enforce the ordered view of the dependent variable.

              Comment


              • #8
                Yes, I think William is right. You can treat an ordinal independent variable as continuous, or you can treat is as nominal/unordered, If you are clever, you might be able to enforce ordered effects, but if so I don't know how to do it.
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 19.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  William Lisowski ,

                  let me try to explain to you how variable gorigin is built and why I think those 5 groups can be ordered. When building the variable gorigin I relied heavily on a working paper by Pia Blossfeld (https://www.neps-data.de/Portals/0/W.../WP_LXXIII.pdf). In essence, the paper suggests that parental education (low/medium/high), parental class (low/medium/high) and parental status (low/medium/high) should be combined in a 3x3x3 = 27 combination matrix. Finally, these 27 combinations will then be collapsed into 5 different groups of social origin. Pages 11 and 12 in the paper I linked visualize this steps perfectly.

                  At least theoretically the created variable should somewhat represent the social order that exist within society. Relying on the variable with 5 groups of social origin I did not know whether or not it is reasonable to treat this variable as continuous. I know that there is a hierarchy implied in this variable but it is not equally spaced. Not theoretically and not in the way it is created (some groups include more combinations than others). However, after I found out that it does not make much of a difference whether or not I treat it as continuous or categorical, I posted here to ask around and see what you guys think.


                  Richard Williams ,

                  You are absolutely right. Treating variable gorigin as continuous would make thinks so much easier. My dependent variable is a 5-point Likert item. To have a continuous independent predictor variable would be better for marginsplots. But I thought it is theoretically more precise to treat variable gorigin as (ordered) categorical and use factor notation for my model (i.gorigin).

                  By the way, thank you Richard for the paper you linked in post #6. That's very interesting to me. I'm also running an ordinal regression model using your gologit2 command.

                  Comment


                  • #10
                    gorigin is a lot more complicated than many ordinal variables. With Likert scales that range from Strongly Agree to Strongly Disagree, I don't feel too badly about treating them as continuous. With gorigin I might wonder whether I should combine the three variables in the first place, let alone collapse them down to a 5 point scale. Whatever you do, I think any theoretical justifications will seem more plausible if you back them up with empirical proof.
                    -------------------------------------------
                    Richard Williams, Notre Dame Dept of Sociology
                    StataNow Version: 19.5 MP (2 processor)

                    EMAIL: [email protected]
                    WWW: https://www3.nd.edu/~rwilliam

                    Comment


                    • #11
                      Richard Williams , please, you are killing me. I read all your papers about ologit, gologit2 and oglm to know what I’m doing when not treating a 5-point Likert item (highly skewed btw) not as a continuous variable and know you are telling me that you don’t feel too bad about doing it. I should have asked you straight away for your guidance. However, I read in various papers that a single Likert item should not be treated as continuous. The paper said that, if any, a minimum of four Likert items are necessary to create a Likert scale that can be treated as quasi continuous.

                      Comment


                      • #12
                        Like I said, David Pasta's argument is provocative, and I follow his comments with a quote from Long and Freese where they challenge the ordinal as continuous argument. Also, almost all of my stuff is talking about ordinal dependent variables, not independent. And, the handout shows how to test the assumption that ordinal can be treated as continuous, as well as discusses other ways you can handle ordinal independent variable.

                        Also, when likert scales are combined, you usually just add them, perhaps giving some items more weight than other. In your case, you are creating this 27 item multiplicative scale. You could, say, get a score of 4 in different ways -- should they be considered the same score? I don't know what all you are doing, and it may have great justification, but it doesn't seem like your run of the mill ordinal independent variable problem. You do cite a working paper, and its argument may be great, but it isn't a citation classic yet so the coding decisions will have to be justified.
                        -------------------------------------------
                        Richard Williams, Notre Dame Dept of Sociology
                        StataNow Version: 19.5 MP (2 processor)

                        EMAIL: [email protected]
                        WWW: https://www3.nd.edu/~rwilliam

                        Comment


                        • #13
                          Incidentally, in general, if you have a 5 item ordinal measure, using i.var instead of c.var is not going to kill you. You may waste a few degrees of freedom and it may be harder to interpret, but the models will be nested so you should still get at the truth. Of course, the more ordinal variables you have, the more unwieldy the model becomes if you are unnecessarily treating them as categorical rather than continuous.

                          I'm actually more concerned with William's point that "I don't know what to make of "gorigin (5 ordered groups of social origin)" - I'm not sure I understand how "social origin" is defined, how it would be grouped, and how those groups would placed into some sort of order." I wonder if the scale actually will be ordinal, let alone be an ordinal variable that can be treated as continuous. Rather than try to push my readers too far, I might treat it as categorical or even just use the component parts as separate variables.
                          -------------------------------------------
                          Richard Williams, Notre Dame Dept of Sociology
                          StataNow Version: 19.5 MP (2 processor)

                          EMAIL: [email protected]
                          WWW: https://www3.nd.edu/~rwilliam

                          Comment


                          • #14
                            Richard Williams,

                            first of all, thank you for your help and your detailed explanations! It means a lot to me that you took your time to answer my questions. I will definitely consider your input and will therefore rethink my approach whether or not it is correct to combine parental education, parental class and parental social status to a single variable. I will probably run two models, one that uses variable gorigin as the only indicator for parents social origin and another where I include all three items separately.

                            I know that there is seldom just one single way how to approach a research issue like mine. That’s exactly why I asked for help here. To get some high values input that I can think about and then draw my own conclusions upon. Again, thank you for your help, Richard. I will definitely look into your handouts again.

                            Comment


                            • #15
                              Richard Williams

                              Thanks so much for your answer. I have a similar question.


                              My DV is job satisfaction, one IV is household income. Both DV and IV are ordinal variables. If I want to check whether household income can be considered a continuous variable. Are the following the right commands? The commands in ologit should be the same with the simple regression command? The post estimation commands of ologit are the same with simple regression?

                              ologit job_satis i.household_income


                              contrast p.household_income, noeffects

                              Comment

                              Working...
                              X