Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Margins after regress gives impossible predicted values

    Dear All

    I suspect the answer to this question may be that I am doing something statistically erroneous rather than a problem with my Stata code, but as a self-taught novice in both I would be grateful for any help.

    I am using margins after regress in Stata 16 to explore the interaction between a binary variable - walk - and a continuous variable - frail - upon predicted quality of life score - qol.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(walk frail) float qol
    1 6 -.166
    0 4   .62
    1 4  .796
    1 5  .487
    1 6   .03
    0 3     1
    0 4     1
    0 6  .796
    0 4     1
    1 5  .746
    1 5  .727
    0 5     1
    0 3  .264
    0 2     1
    0 3     1
    1 6  .273
    0 4  .812
    0 5     1
    1 6  .587
    1 8  .079
    end
    label values walk slow
    label def slow 0 "Not Slow", modify
    label def slow 1 "Slow walk speed", modify

    Code:
    regress qol c.frail##i.walk
    
    margins walk, at(frail=(1(1)8)) plot

    My issue is that the resulting predictions include values that sit outside of the possible range for qol, which ranges from -0.594 to 1 (this is a validated quality of life score for my population, despite the clunky range). Hopefully you can see this in the attached graph.png that predictions at low values of frail exceed the maximum value of 1.

    Click image for larger version

Name:	Graph.png
Views:	2
Size:	152.1 KB
ID:	1656519

    I saw previous posts suggesting in different scenarios to consider truncreg or tobit, but I do not think these are valid. My dependent variable is - I think - neither censored nor truncated as there are only a certain range of possible values (like a percentage in an exam) rather than any potential values have been excluded.

    In the full dataset, there are no participants with frail < 3 & walk == 1, nor for that matter are there any with frail >6 and walk == 0, as you may appreciate from:
    Code:
    tab2 walk frail
    Though a model without the interaction term does not show any worrying collinearity:
    Code:
    regress qol walk frail
    estat vif
    I therefore suspect the problem may be that the margins command is extrapolating the slow walk speed (red) line when frail < 3.

    I have no idea how to "Stata" my way out of this, or if this simply relates to a problem of model selection and/or statistical heresy.

    Once again, I would be very grateful for any insights.

    Best wishes

    Ben

    Attached Files

  • #2
    It is indeed a problem of model selection. A linear model, by definition, will extrapolate eventually to arbitrarily large positive and negative values. If you run
    Code:
    graph twoway (scatter qol frail || lfit qol frail), by(walk)
    you will se that in the slow walk group, the best fit line has negative slope, and if you mentally extend it to lower values of frail, it clearly "runs off the page" into qol > 1 territory. So a linear model cannot properly capture the relationship between this qol measure and frail throughout the range of values of frail. I would not, however, necessarily reject the linear model out of hand here. If you look at the part of the graph for the Not Slow group, you will see a bunch of qol = 1 points jammed up against the top of the graph. This suggests that the qol measure itself may not be a valid representation of the construct of QOL because it is limited by a ceiling effect (otherwise put, values are right censored at 1). It is therefore prevented from distinguishing degrees of quality of life higher than what is represented by the measurement's upper bound. It may actually be, in this case, that the linear model, as shown by -margins- is a better model of reality than what your QOL measure is capable of addressing. I'm not asserting that it is a better model; I'm just saying it could be and you need to give the matter some thought. The fact that the measure has been previously "validated" should not influence your thinking about this much, if at all. What passes for measurement validation in the medical literature is really pathetic!

    On the other hand, it is also true that there are no values of frail less than 2 in your data, and no statistical model can ever be relied upon to extrapolate beyond the range of the data. And a linear model always, necessarily, eventually breaches the limits of any bounded outcome variable if you extrapolate far enough. So you are free to also accept the model and point out that the -margins- prediction that troubles you is actually outside the range of values of frail. You could also just modify your -at()- option to start at 2 instead of 1, which would eliminate that point. Given that frail = 1 is not instantiated in your data, this would not represent any obfuscation or selective reporting of your results. (Arguably, including such an uninstantiated value is, itself, a misrepresentation.)

    If you conclude that your QOL measure's ceiling effect reflects the real world: a score of 1 on the measure actually corresponds to a maximally attainable quality of life, then you need to move to a non-linear model that will "bend" accordingly. Perhaps rescaling the qol measures to range from 0 to 1 and then using a fractional regression might suit your needs.

    Comment


    • #3
      Clyde Schechter Thank you for such a helpful response; this has helped me escape a blind alley and to understand some fundamentals.

      To clarify/expand upon my original post, the sample of data I posted does not include frail scores of <2, but my full dataset does (sorry). In instances where frail was <= 2 however, there were no participants with walk == 1. I think this rather reinforces your points about the non-linearity of the data, extrapolation, and need for careful consideration.

      The QOL measure in question considers 1 to mean absence of any negative quality of life factors. I suppose the argument is that "absence of bad" is often considered to be equivalent to "perfect", which for all its faults as a concept does hold some truth for healthcare professionals.

      I do appreciate it fails to distinguish between those with normal or better than normal health though, and your point about the linear regression outperforming the outcome measure is well taken. Therefore my feeling is that the "least worst" option here is to rescale and use fractional regression, whilst acknowledging the pitfalls of the QOL measure.

      Thanks once again,

      Ben
      Last edited by Ben Anderson; 27 Mar 2022, 15:16.

      Comment

      Working...
      X