
No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issues interprenting statistical significance of models with variable and models with quadractic form of the same variable

    Hello everyone,

    I have a project in which I am trying to understand how the probability of someone becoming an entrepreneur in an industry is related with a series of variables. In order to do so, I am utilizing a fixed effects regression, and have constructed a few models to be able to interpret the results.

    One of the variables which I am interested in analyzing is the median age of the industries. I have models that include the median age and a collection of other variables, and models that include the median age and its squared term, and the same collection of other variables.

    The code itself is as follows:

    xtreg change_to_employer age_median log_nemp_median gender numb_firms higher_education i.year high_tech low_tech KIS Other, fe cluster(caem2)


    xtreg change_to_employer c.age_median##c.age_median log_nemp_median gender numb_firms higher_education i.year high_tech low_tech KIS Other , fe cluster(caem2)

    Model 1:
    xtreg change_to_empregador age_median   log_nemp_median gender numb_firms_div1000 higher_education vn_per_employee_median i.year high_tech low_tech KIS Other, fe cluster(caem2)

    Model 2:

    xtreg change_to_empregador c.age_median##c.age_median   log_nemp_median gender  numb_firms_div1000  higher_education vn_per_employee_median i.year high_tech low_tech KIS Other, fe cluster(caem2)

    The issue I am having interpreting is that the coefficient for age_median in Model1 is not significant, but then the coefficients for both age_median and c.age_median#c.age_median are both significant for Model2. As shown in:


    Fixed-effects (within) regression               Number of obs     =        889
    Group variable: caem2                           Number of groups  =         77
    R-sq:                                           Obs per group:
         within  = 0.1832                                         min =          3
         between = 0.0859                                         avg =       11.5
         overall = 0.0853                                         max =         12
                                                    F(24,76)          =       5.26
    corr(u_i, Xb)  = -0.2660                        Prob > F          =     0.0000
                                                 (Std. Err. adjusted for 77 clusters in caem2)
                             |               Robust
    Change_to_empregador_f~e |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                  age_median |   -.001125   .0028954    -0.39   0.699    -.0068918    .0046417
             log_nemp_median |  -.0015375   .0168405    -0.09   0.927    -.0350782    .0320033
                      gender |   .0010117   .0016409     0.62   0.539    -.0022565    .0042798
          numb_firms_div1000 |  -.0086784   .0050413    -1.72   0.089     -.018719    .0013621
            higher_education |   .0009512   .0012174     0.78   0.437    -.0014735    .0033758
      vn_per_employee_median |  -.0005548    .000214    -2.59   0.011     -.000981   -.0001286


    Fixed-effects (within) regression               Number of obs     =        889
    Group variable: caem2                           Number of groups  =         77
    R-sq:                                           Obs per group:
         within  = 0.3109                                         min =          3
         between = 0.1165                                         avg =       11.5
         overall = 0.1408                                         max =         12
                                                    F(27,76)          =       7.08
    corr(u_i, Xb)  = -0.1859                        Prob > F          =     0.0000
                                                                              (Std. Err. adjusted for 77 clusters in caem2)
                                                          |               Robust
                                     change_to_empregador |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                                               age_median |    .136472   .0677635     2.01   0.048     .0015094    .2714347
                                c.age_median#c.age_median |   -.001687   .0008374    -2.01   0.047    -.0033548   -.0000191
                                          log_nemp_median |  -.0891614   .0408428    -2.18   0.032    -.1705069   -.0078159
                                                   gender |  -.0019755   .0036136    -0.55   0.586    -.0091726    .0052216
                                       numb_firms_div1000 |  -.0235788   .0107627    -2.19   0.032    -.0450145    -.002143
                                         higher_education |   .0012232   .0029379     0.42   0.678    -.0046282    .0070745
                                   vn_per_employee_median |  -.0013087   .0004205    -3.11   0.003    -.0021462   -.0004713
                                                     year |

    How is it possible that one variable is not significant by itself, but then becomes significant when regressed together with its quadratic term? Can I then say that the median age of the industries has a significant impact of the probability of transition into entrepreneurship?

    Thank you very much,

    Last edited by Rui Agostinho; 25 Feb 2022, 12:30.

  • #2
    different models give back different results: no wonder about that.
    As you forgot to follow the FAQ (that recommend to share what you typed and what Stata gave you back via CODE delimiters), interested listers can only give general advice.
    In your case, as per your description (and with the cautionary tale that words are not numbers) -median_age- has a quadratic relationship with the regressand.
    Kind regards,
    (StataNow 18.5)


    • #3

      Thank you very much for your reply. I have applied the changes to my question that you proposed. Hopefully it will be more clear now.

      My issue still remains though: what is the interpretation of the coefficient of age_median in Model 1 not being significant, but then, when looking at Model 2, both age_median and its quadratic term being significant? Is it as simple as age not having a linear relationship with the regressand, but having a quadratic relationship instead?

      Thank you,


      • #4
        1) the second regression code gives a higher within R_Sq; hence is better specified than the first one;
        2) in regression code #2 other coefficients change in terms of statistical significance, too;
        3) in regression code #2 both the linear and squared terms for -median_age- are only barely statistically significant.
        Kind regards,
        (StataNow 18.5)


        • #5
          First, even for people who take the concept of statistical significance seriously (in fact, especially for them) it is important to bear in mind that the difference between statistically significant and not statistically significant is, itself, not statistically significant, nor even meaningful in any way. You should never draw any conclusions from one thing being statistically significant and another not.

          That said, the quadratic relationship you are talking about is precisely the kind of situation where what you describe can and should happen. Run this code:
          set obs 101
          set seed 1234
          gen x = _n-1
          gen y = 2*(_n-50)^2 + rnormal()
          graph twoway scatter y x || lfit y x
          regress y x
          regress y c.x##c.x
          Look at the graph before you read the regression outputs. You can see that due to the U-shaped relationship between y and x, the best fitting straight line is more or less horizontal. Correspondingly, in the regression without the quadratic term, the coefficient of x is nearly zero. By contrast, the quadratic regression effectively captures the U-shaped relationship. The coefficient of the quadratic term is a measure of the width (or narrowness) of the U-shape, and the linear term is basically a (scaled) indication of where the vertex of the parabola lies. Since the parabola is fairly steep it has a relatively large quadratic coefficient (relative to the scaling of x2 which ranges from -2500 to +2500) .

          Added: crossed with #4.

