Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • nonlinear relationship

    hi guys!

    this is going to be quite a silly question but here goes. if the odds ratio indicates 0.99 significant at 0.001 for say...age...is there any point in testing for a quadratic effect? it predictably came down to 1.0003 at p=0.001

    cheers

  • #2
    The statistical significance of a linear effect of a variable has essentially no bearing either way on whether there is a quadratic effect. In general, polynomials are seldom a good specification of non-linear relationships, but occasionally the science of the subject supports it, and in some situations the goal is not to understand real relationships so much as just find a model that fits well without thought to generalizability. So the answer to your question is: it depends, and it depends entirely on things that you haven't told us about.

    Finally "it predictably came down to 1.0003 at p = 0.001"--what does that mean? What does the "it" refer to and by what process did it "come down?"

    Comment


    • #3
      i apologize for my bad phrasing.
      the dependent variable is occupation-education mismatch.
      the effect of age (prior to introducing the polynomial)- in odds ratio- is of 0.99
      (same specification) the effect of age with the polynomial is- in odds ratio- of 0.99 (main effect), 1.0003 (quadratic term)
      my question was whether in such a case there is a need to to present the version of the model including the polynomial terms just because it is significant. in substantial terms it says nothing more than the simpler model.
      Last edited by natalia malancu; 27 Dec 2014, 15:51.

      Comment


      • #4
        Well, the first part of my answer still applies. Whether you want this quadratic model depends on a) why you are doing the analysis and what you hope to accomplish with the results, and b) whether there is any scientific rationale for a quadratic model (as opposed to just linear or some other kind of non-linear specification).

        Depending on what you plan to use your model results, you might want to see if the quadratic term meaningfully improves the calibration or discrimination of the model. And some graphical exploration of the relationship between age and occupation-education mismatch could be helpful as well. Just from an algebraic perspective, the relationship with the quadratic term you have derived has the graph of logit(outcome) vs age as a parabola with its minimum value at age = -0.495. So unless your age variable was already centered at some meaningful point, and assuming that your data refer only to adults, the really curved part of the parabola is outside the range of your data anyway, and this equation is telling you that there is a slight bend away from linearity in the range of ages you are working in. Is that of any practical importance?

        Comment


        • #5
          the idea-theory derived- was that with time the possibility of a mismatch decrease, yet it comes a point at which the effect plateaus.
          data does refer to adults. age is not centered (unfortunately it comes in age bands- one band=3 years- and i did not want to further complicate interpretation). could you be so kind as to further explain what do you mean by the bend.

          Comment


          • #6
            i'm having a bit of trouble grasping the whole idea of a bend given that the odds ratio for the quadratic term is of 1.

            Comment


            • #7
              Clyde gave the good suggestion to look at a graph to learn more about the relationships. Assuming that age is expressed in years and that the relevant age interval is 20-60 years, a graph like this could be informative.

              Code:
              twoway (function y= 1.0 * 0.99^x * 1.0003^(x^2) , range(20 60)) ///
                 (function y= 1.5 * 0.99^x , range(20 60))
              Note the inclusion of the constant (1.0 and 1.5). With Stata 13, it is displayed in the output; in versions prior to that, you must use the postestimation command lincom after logistic to see it:
              Code:
              lincom _cons
              Also note that you should include more decimals in the coefficients; there is quite a bit of difference between 1.00026^3600 and 1.00034^3600.

              Comment


              • #8
                3 points:
                a. age is provided in age bands (3 years each...don't know what they were thinking) - so how would i tweak the code to accommodate for that?
                b. i am dealing with a multinomial logit, but i guess i can look at the effect for each of the logits separately
                c. i am a bit intrigued by the example as the introduction of the polynomial seems to completely change the direction the relationship

                Comment


                • #9
                  Re-run your model with and without the quadratic term. (Make the quadratic c.age#c.age, not age*age.) After each model run -margins-, at(age=(20(5)60))-, and then -marginsplot- to get a sense of what the two models are doing. There is some unclarity in the original post as to whether the linear coefficient of 0.99 and quadratic coefficient of 1.003 come from the same model, but if they do, you will see that the quadratic model does not produce a plateau within the meaningful age range. Rather you will see a slightly curvilinear relationship between age and outcome probability. The plateau from this quadratic is well outside the meaningful range: it occurs at an age of about negative one-half year. So if the purpose of the quadratic model was to look for a plateau, your data says it isn't there, and you can drop it.

                  With regard to three-year age groups, if you think about it, "exact" one year age groups are themselves 365-day age groups, and one-day age groups are 24-hour age groups, etc. At some point the added precision becomes meaningless. In any case, even if one-year age precision would be more meaningful for your problem (and I imagine it might), you don't have it. You should probably attribute the middle of the three-year age range to each group. There are no other adjustments to make.

                  Comment


                  • #10
                    thank you, clyde & svend! sorted all my doubts!

                    Comment


                    • #11
                      nevertheless, i do have one final, unrelated question. in a multinomial model with 3 outcomes, sometimes the following 2 scenarios are possible:
                      a. the linear specification is significant in both equation
                      b. when adding a square term, both the main and square term are significant in one equation, while in the other neither are.
                      which one of the two models should one chose?

                      Comment


                      • #12
                        The choice of the model should not depend solely on the statistical significance considerations you outline. Ideally, the specification of the model is determined by non-statistical science considerations. If there are no such considerations that can be brought to bear, then looking at graphical evidence of model fit (I recommend lowess plots), or comparing the abilities of the two models in terms of calibration and discrimination would be better ways of choosing a model specification.

                        Also, in using a multinomial model, you are implicitly stating that the form of the relationship between the outcome and the predictors is the same at all levels of the outcome. That may or may not be true, and modeling each level separately might be more sensible.

                        In the specific case of a linear vs. a quadratic model, there are a few special considerations.

                        1. If the purpose of a quadratic specification is to identify a relationship with a peak or trough, you need to calculate where the quadratic model says that peak or trough occurs: if it is far from the meaningful range of the variable, then the quadratic is meaningless even if it is highly "statistically significant," and it should not be used.

                        2. Bear in mind that depending on the range of the data, a linear and quadratic term in the same variable may be substantially correlated. Therefore, they can "share variance" in such a way that a test of the joint null hypothesis that both coefficients are zero can lead to strongly significant rejection even though you cannot reject the null hypothesis on either alone. In this situation, retention of both terms makes more sense.

                        Comment


                        • #13
                          1. by meaningful range of the data are we talking about the range of the data in general or in the dataset? i'm asking as in the case i have in mind the peak occurs at the end of the range(the last value, actually) in the dataset, but in real life the variable could take larger values.
                          since the peak occurs at the end of the "my" range and since in the other equation both the main and quadratic term are individually not significant, i'm leaning towards going for a linear specification.
                          2. your are basically saying that in the case in which one deals with multicollinearity between the linear and quadratic term, one should not discard the quadratic term just because the joint null hypothesis that both coefficients are zero can be rejected. i'm however faced with a case in which on their own both the main and quadratic are rejected in the second equation, which i dare think is indicative of not needing that quadratic term.

                          Comment


                          • #14
                            1. Your model has no validity beyond the range of the data in the data set in which it was developed. Extrapolation is hazardous, particularly when the model is polynomial. If the peak occurs at the end of your range, that's nice, but it says nothing about what will happen beyond the end of your range. You can use the quadratic model in this data--but you should not attempt to apply it to other data where there are ages outside the range of your data.

                            2. You have misquoted what I've said, and I'm a little confused by this question. So let me try to say more clearly what I said in post #12.

                            2'. The linear and quadratic terms in your model may be substantially correlated. It is possible that -test linear quadratic- would yield a strongly significant result even though neither the linear nor quadratic term by itself is "statistically significant." In that situation, you should retain both the linear and quadratic terms even though neither one is "significant" on its own.

                            All of that said, let me emphasize once more that I think it is a bad idea to rely on significance tests and p-values in deciding whether to use a linear or quadratic model. Again, let me encourage you to make the decision based on graphical evidence or quantitative indicators of model fit.

                            Finally, I am disquieted now by the word "plateau" that you used in post #5. Quadratic functions do not reach a plateau. Rather they reach a peak (or nadir) and then turn around. If a plateau happens to begin at the edge of your data, then a quadratic function that peaks (or reaches its nadir) there will look like a plateau and will be adequate, but if the peak or nadir occurs anywhere else, the model is grossly mis-specified for plateaus. But If you are looking to detect a true plateau, other functional forms would be better, including linear splines, or the logistic function. Also, in order to detect a true plateau the range of your data must extend into the flat part of the relationship, not just cut off at the beginning of the plateau.

                            Comment


                            • #15
                              thank you, clyde! this all being extremely useful and giving me food for thought. going back to post #12 how do you get the measures of discrimination in stata (rank discrimination indexes)? i know how to do it in r, but it's a drag going back and forth.

                              Comment

                              Working...
                              X