Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • square term

    my square term is significant for the whole sample but insignificant separately for males and females both, when males and females were regressed separately. what does this mean

  • #2
    This is an excellent example of one of the many good reasons why the American Statistical Association has recommended that the concept of statistical significance be abandoned. See https://www.tandfonline.com/doi/full...5.2019.1583913 for the "executive summary" and https://www.tandfonline.com/toc/utas20/73/sup1 for all 43 supporting articles. Or https://www.nature.com/articles/d41586-019-00857-9 for the tl;dr.

    I assume that by "significant" you are referring to p < 0.05. It is a serious misunderstanding that causes so many people to misinterpret p-values as measures of effect size. While the size of an effect does have some influence on the p-value, the p-value is heavily influenced by sample size. It may well be that the coefficient of the square term is nearly the same in both your male and female (and whole sample) regressions. It is inevitably the case that the whole sample is larger than either the male or female sample, and it is guaranteed to be at least twice as large as one of them! So the difference in sample size is a likely cause of this confusion.

    You don't state your reason for even looking at the "significance" of the square term in the first place, but if it is for the purpose of deciding whether or not it should be included in your model, the p-value is not a good basis for making that decision in any case. You should be considering instead whether or not the inclusion of the square term materially improves the fit of the model. It is also possible that different square-term coefficients are appropriate for males and females, so you might also want to consider a model in which the square term and the linear term of the same variable are interacted with the sex variable.

    Comment


    • #3
      Thanks a ton! (also a big fan)

      Comment


      • #4
        Just for the record -- no economics journal that I'm award of would publish a paper without reporting statistical significance of key variables, whether in the form of a p-value, t statistic, or confidence interval. Are p-values abused? Sure. But we need some way of attaching precision to our estimates. Clyde, when you say "materially improving the fit," that sounds vague. How do you measure that? Presumably by a change in the R-squared. When we formalize that for a single explanatory variable, we essentially get a t statistic argument.

        Let me give an example. Suppose one obtains data from a randomized job training program. The point estimate is that it increases labor earnings by, say, 7.5% on average. Are you saying we should not include a standard error or confidence interval with that point estimate? If we do, that implies a p-value, so it's really the same thing. I agree that at least a confidence interval does not choose a null hypothesis, but it does choose a confidence level.

        What if I do programs in two different parts of the country. In one, the estimate is 7.5%, but I could only use n = 200 workers, so the standard error is 6%. Thus, the effect is statistically insignificant. How hard can I push the 7.5% estimate? What if in another part of the country the effects is only 4.0%, but I had n = 5,000 and the standard error is 1%. I know I have more faith in the 4% estimate. I can't simply choose 7.5% because it's bigger or because I "like it" more.

        Putting in things like quadratics complicate models. Using a p-value to decide whether to include such terms is, in my view, appropriate. That does not mean, especially in large samples, that it changes the conclusion much. If we don't use p-values for such things, how do I decide whether or not to include, say, x4? How much should the "fit" improve.

        I just want to be sure that young people know that it is still a very small share of journals in the social sciences that don't want to see statistical significance. In my view, it's important to have discussions of both the effect size and statistical significance, where we hope that there hasn't been too much p-hacking.

        Comment


        • #5
          Jeff,

          I think you are setting up straw men here. I have always said that instead of looking at statistical significance or p-values we should report our estimates along with estimates of their uncertainty (which are usually standard errors or confidence intervals). That is absolutely mandatory. Does that imply a p-value? Yes, it does. And if somebody wants to report that, I suppose it does no harm, but it adds no useful information beyond what is contained in the confidence interval. And if you just report the p-value without the confidence interval or standard error, you actually lose the most important parts of the information. So the p-value is, by itself, an inadequate statistic and, at best, a harmless adjunct to a confidence interval (or standard error).

          What if I do programs in two different parts of the country. In one, the estimate is 7.5%, but I could only use n = 200 workers, so the standard error is 6%. Thus, the effect is statistically insignificant. How hard can I push the 7.5% estimate? What if in another part of the country the effects is only 4.0%, but I had n = 5,000 and the standard error is 1%. I know I have more faith in the 4% estimate. I can't simply choose 7.5% because it's bigger or because I "like it" more.

          Right, and that is entirely consistent with what I advocate. The confidence interval around the 7.5% estimate will be wide, running from about -5.5% to +19.5%. The 4% estimate comes with a confidence interval of about 2% to 6%. Clearly, a more precise estimate of effect. Putting the two together, in fact, it seems rather clear that the entire CI of the 4% estimate falls within the CI of the 7.5% estimate (though the 7.5% estimate itself does not), lending support to the understanding of the original estimate as being somewhat defective due to the adequate sample size: the larger sample enabled us to narrow-down the uncertainty and get a better estimate. All without uttering the "s-word."

          Let me also be clear, that what I am really leaning hard on here is not p-values. As above, I think they are not very helpful, but, left alone they are harmless, at least in many situations. (And there are a few situations where I find them useful, but they don't come up often in practice.) My real gripe is with the classification of result as "significant" or "not significant" based on applying some cutoff. The .05 cutoff is most commonly used, but I would object equally to any other. The problem is that doing this creates a fallacious aura of certainty that an effect "exists" or "does not exist," or that it is "real" or "not real." Granted that careful use of the term does not imply either of those things, but the term is so seldom used carefully that it almost doesn't matter. And what does it mean when used carefully? It means that the results are not likely to have been drawn from a random sample of a population in which the effect is zero. To which my answer would almost always be: so what??? That is seldom a relevant question. Indeed, suppose that I did another replication of your hypothetical study and came up with an estimate of 3% with a 95% confidence interval of 0.1% to 12.9%, with a central estimate of 6.5%. What are we to make of that? It is "statistically significant." Yet clearly we have very poor precision in our understanding of it. Moreover, if we are going to conclude that the data are consistent with the effect being as low as 0.1% does it not make sense to ask, not whether 0.1% is zero (it isn't, of course), but whether 0.1% effect is large enough to make any difference in the real world? So declaring the finding "statistically significant" and feeling we are done with it really overlooks the fact that the data are truly consistent with trivial effects, or, on the other hand, that we may have underestimated a drastically larger effect. Unless the world is such that everyone would do the same things whether the true effect is 0.1% or 12.9%, the only fair conclusion from this study is that we haven't learned enough and need more or better data (or both). (If you don't feel this way about 0.1% as trivially small, think of 0.01% or 0.0001% if you like. The point is that unless there is something truly qualitatively distinctive about a 0 effect, there will be some positive epislon where epsilon percent is too small to care about. Only if there is no such epsilon, and 0 is qualitatively different, does it make sense to care about whether or not the null hypothesis is rejected.)

          Concerning quadratic models, here is a situation where I think the p-values of the coefficients are of no help at all, and can be very misleading. They can glorify a minimal curvature, or they can fail to notice a parabola whose vertex is smack dab in the middle of the data. They just don't answer the right questions in this context. The change in R2, as you note, is a measure of the improvement of fit that arises from incorporating the quadratic term (and the interaction). There is an F-test for this, and though it is seldom done in practice, that F-test can be back-calcuated to a test-based confidence interval on the change in R2 itself. If we are going to resort to nul hypothess testing here, it is this F-test (which would be equivalent to a joint test of the linear and quadratic and interaction terms) is the one to look at: the tests of the individual coefficients of the terms are simply not helpful here.

          The harm that is done by resorting to declaring things "significant" or "not significant" is seen every day on this Forum. I respond every day to people expressing confusion about things like: how can my joint test of X1 and X2 be significant but neither X1 nor X2 alone is? I respond every day to people who are confused about things like: why is my effect of X significant in the combined sample, but not significant in either males or females separately? Or why do I find a significant result in subset A and a non-significant result in subset B alone, but in a combined analysis with an interaction term, the interaction is not significant? They should be confused, because the predicate "significant," as commonly (mis)used and (mis)understood simply doesn't obey the laws of ordinary logic. And when you tease it out in terms of the actual meaning of a statistically significant finding, a few moments thought usually leads to the conclusion: OK, then, but why does anybody care about that?

          I just want to be sure that young people know that it is still a very small share of journals in the social sciences that don't want to see statistical significance.

          This is, unfortunately, true. Hopefully, over time, and if enough of us keep the pressure up, this will change. This is not the hill that people starting up careers should choose to die on before that happens. But I think it is important for more senior investigators to press assertively to improve research practices. We can afford to have our papers rejected, and we can push back at reviewers who haven't "gotten the memo."

          The misunderstanding that statistical significance causes is, I and many others believe, a major contributor to the widely discused "reproducibility crisis" in the social and medical sciences. I think we should go back to basics: present findings with standard errors and confidence intervals. Use p-values sparingly, in those situations where they might actually add something meaningful to the discussion. And banish the s-word altogether.

          Note, by the way, that I do not bring up the issue of p-hacking. That's a separate issue. But even in its absence, "statistical significance" is a deeply problematic construct.



          Comment


          • #6
            Clyde: I agree with much of what you say, but my response was largely triggered by your response to Mahnoor to focus on "fit of the model." I don't think one can fully make decisions about "fit" without resorting to a statistical test. Let me give you another scenario: suppose, as is often recommended on this site, I include squares and interactions of my key variables as a test for nonlinearity. I want to know whether this complication is justified. I can't look at, say, 10 different confidence intervals to easily come to a conclusion. But I can do a joint F (or Wald) test. If I get a large p-value, I'm pretty much done: I will not complicated my model. If I get a small p-value, I can then compute marginal effects to see how much of a difference it makes compared with the linear model. The p-value is quite useful in such cases because it provides a way of thinking what the increase in R-squared means. That I might reject the null because of a large sample size when the nonlinearity is practically unimportant is, of course, always possible. But I can't know until I try it, right? If I put in several interactions and quadratics and get different partial effects from just the straight linear model I want to know whether the difference is real or just due to sampling error.

            Overall, we are mostly in agreement. I dislike seeing asterisks attached to coefficients based on their statistical significance. I much prefer coefficients and standard errors, in which case a CI is easy to obtain. But I think the p-value can add additional information, especially if we are testing a simple model against a more complicated one. But I want to see the p-value, not just know whether it was above or below 0.05.

            Comment


            • #7
              Well, Jeff, I concede that when you are judging complex additions to the model, ones that cannot be simply captured by a single statistic around which one can fit a confidence interval, then the use of a null hypothesis significance test may be the best we can do. But to leave it at that can result in failure to actually understand what is going on in the model and the data. Misleading conclusions can be drawn. Consider the following two situations:

              Situation 1:
              Code:
              clear*
              
              set obs 10000
              gen  x = 5-_n/1000
              gen ystar = invlogit(x)
              set seed 1234
              gen y = rbinomial(1, ystar)
              
              logit y x
              predict phat_logit
              label var phat_logit "Logistic Predictions"
              estat gof, group(10)
              
              probit y x
              predict phat_probit
              estat gof, group(10) table
              label var phat_probit "Probit Predictions"
              
              graph twoway line phat_* x, sort name(phat_v_x, replace)
              graph twoway scatter phat_probit phat_logit, name(phats_compared, replace) ///
                  msize(vsmall) || line phat_logit phat_logit
              Now, the true model here is the logistic one: that's how the data were generated. And the Hosmer-Lemeshow test statistic rejects the -probit- model strongly, while not rejecting the logistic. If the research question is: is the logit model the correct one, or is the probit model, then this is a triumph of the test. But that is seldom a question of interest. Usually we are interested in knowing whether the model is a good fit to the data. The graphs suggest that both models do very well and hardly differ from each other. And the table of the observed and expected values for the probit model confirms that the deviations are really pretty trivial. For almost any practical purpose, the probit model would be quite useful. And I chose to demonstrate this point with a logit-vs-probit because it is easy to generate illustrative data. But in real world research we seldom know the actual parametric form of the real data generating process. In fact, the real data generating process might not even be expressible parametrically,. and it might not be clear what alternative parametric models might be best suited for the data. So should we reject the probit model here? I would say no. But the hypothesis testing approach would say yes. I suppose if the test encouraged us to consider other parametric models, and we were successful in finding one, that would be a good thing. But how often will that be the result? And is the trivial improvement in fit worth the effort? Sometimes it will be, but often not.

              Situation #2:
              Code:
              clear*
              set seed 1234
              
              set obs 200
              gen byte group = mod(_n, 2)
              by group, sort: gen x = _n/10
              
              gen y = 2*x + 1 + rnormal(0, 2.5) if group == 1
              replace y = .2*x^2 - x + 1 + rnormal() if group == 0
              
              regress y i.group##c.x##c.x
              lincom c.x // LINEAR COEFFICIENT IN GROUP 0
              lincom c.x#c.x // QUADRATIC COEFFICIENT IN GROUP 0
              lincom c.x + 1.group#c.x // LINEAR COEFFICIENT IN GROUP 1
              lincom c.x#c.x + 1.group#c.x#c.x // QUADRATIC COEFFICIENT IN GROUP 1
              
              test c.x#c.x 1.group#c.x#c.x // TEST QUADRATIC VS PURE LINEAR MODEL
              
              // VERTEX LOCATION IN GROUP 0
              nlcom -_b[c.x]/(2*_b[c.x#c.x])
              
              //  VERTEX LOCATION IN GROUP 1
              nlcom -(_b[c.x ] + _b[1.group#c.x])/(2*(_b[c.x#c.x] + _b[1.group#c.x#c.x]))
              
              margins group, at(x = (0(.2)5))
              marginsplot
              This is a situation similar to the one outlined in #1 of this thread. If we do a joint test of the linear and quadratic and group terms and their interactions, that is, in this model, equivalent to the F test of the model as a whole, and we get p < 0.00005. And if we do a Wald test of just the quadratic term and its interaction with group, we also get p < 0.00005. So the tests are successful in that they tell us that these terms are better than a purely linear model.


              But if we stop there, because we have now found "significance," we miss out on a very important fact about this data. The graph plainly shows that the two groups have markedly different patterns in the data and that the pattern for group 1 is really, for practical purposes, linear, whereas in group 0 there is a bona fide quadratic. And the final calculations showing the locations of the parabolic vertices do make it clear that location of the "turning point" for group 1 is, by our best estimate, far to the right of the data, and, by its confidence interval, almost entirely unknown. It could be pretty much anywhere, including places far outside the range of the data on either side! So this additional analysis gives a lot of insight. Yes, you can also see the "non-quadraticness" of group 1 by looking at the estimates of the linear and quadratic effects in each group, shown in the final -lincom- output. But it's still true that you can get that from the estimates and confidence intervals, without even looking at the p-values. And again, it's unclear what the p-values add to that.

              My guess is that you agree with what I'm saying in this post. And my guess is that in your own work, you would not miss these points. But how many people who do data analysis, and do it in contexts where others rely on their results and interpretations, have your level of expertise and sophistication? Rather few, I think. In fact, these things are often being done by people with minimal training in statistics, and that training is often under the tutelage of people who have limited understanding themselves. The message supporting the use of statistical significance as a construct has been spread far and wide for at least a century. It has been falsely taught as the end-all and be-all of statistical analysis. I'm not worried about the ability of the proponents of statistical significance to get their message out and have it heard. At this point, the imbalance is far in the other direction, and I'm doing my part to redress it.

              Comment


              • #8
                Clyde: It seems like your complaint is generally with poor training in statistics and econometrics. Your first example illustrates the difference between practical and statistical significance, something I emphasize at all levels of statistics and econometrics. As you say, there's nothing wrong with the outcome of the test -- it chooses the correct model. Frankly, I've never even looked up the Hosmer-Lemeshow test statistic because I'd essentially never formally test probit versus logit. (I wonder if one can find a paper in an economics journal that applies this test.) It's pretty well known that they often fit similarly, and especially often give very close average marginal effects. But I wonder what harm is done here: one does both, finds that they're similar, reports that, and, maybe as a curiosity, reports that the probit model is rejected and the logit is not. Of course, one should not fixate on the latter outcome. The similar fit and resulting marginal effects are most interesting. I am definitely in agreement with your conclusion that the H-L test is essentially a waste of time. But that's because of the particulars, not because of something inherently wrong with significance testing.

                If your second example was intended to be a comment on my suggestion about using significance testing to find nonlinearities, it doesn't quite do it. If I have two specific groups in mind, then I obviously will want to study them separately. I think you've done a bit of straw man setting up here, too, but putting the quadratic and its interaction with the group into the same test. You're showing that one has to be thoughtful when applying specification tests. On that we surely agree.

                Comment


                • #9
                  Since you both agree on most of the stuff so, lets just end this here.

                  Comment


                  • #10
                    Yes, sorry for diverting this thread into a long discussion of a theoretical issue that is fairly contentious in the statistical community.

                    Comment

                    Working...
                    X