Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interaction term logit regression doesn't make sense.

    Hi all,

    I have a problem with using an interaction term in my Logit regression about likelihood of merging with another company. My dependent variable is status, which is binary and equals 1 if merging with a company. My main independent variable is overconfident_fraction_2 which a continuous variable and measures the fraction of overconfident board members, and an interaction term between log_average_funds which is the log of average funds in the market.

    When using the interaction term, my the coefficients and p-values of my variables change (which I understand). What I don't understand is why the coefficient doesn't make sense anymore when using the interaction term. A logit coefficient of > 18 is very strange. I think it is because there is very low variation in my main dependent variable (many zero's). Another reasons could be because my code is not correct.

    Results without interaction term:
    Code:
    . logit status overconfident_fraction_2 experience_board experience_ma log_ipo_proceeds extension_10 board_size age_average female_fracti
    > on independent_fraction sp500_return, vce(robust)
    
    Iteration 0:  Log pseudolikelihood = -356.75232  
    Iteration 1:  Log pseudolikelihood = -297.41706  
    Iteration 2:  Log pseudolikelihood = -297.06963  
    Iteration 3:  Log pseudolikelihood = -297.06916  
    Iteration 4:  Log pseudolikelihood = -297.06916  
    
    Logistic regression                                     Number of obs =    515
                                                            Wald chi2(10) =  99.13
                                                            Prob > chi2   = 0.0000
    Log pseudolikelihood = -297.06916                       Pseudo R2     = 0.1673
    
    ------------------------------------------------------------------------------------------
                             |               Robust
                      status | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
    -------------------------+----------------------------------------------------------------
    overconfident_fraction_2 |  -.4519196   .5221312    -0.87   0.387    -1.475278    .5714388
            experience_board |  -.0783938   .0594812    -1.32   0.188    -.1949748    .0381873
               experience_ma |  -.3820718   .3776104    -1.01   0.312    -1.122175     .358031
            log_ipo_proceeds |  -.4250393   .1943061    -2.19   0.029    -.8058723   -.0442064
                extension_10 |  -1.829247   .2138418    -8.55   0.000    -2.248369   -1.410125
                  board_size |  -.0094118   .0626586    -0.15   0.881    -.1322204    .1133967
                 age_average |   .0244761   .0175046     1.40   0.162    -.0098322    .0587844
             female_fraction |  -1.965913   .6856968    -2.87   0.004    -3.309854   -.6219715
        independent_fraction |  -.8229314   .8706645    -0.95   0.345    -2.529402    .8835396
                sp500_return |  -4.699021   1.935518    -2.43   0.015    -8.492567   -.9054748
                       _cons |   8.946737   3.930293     2.28   0.023     1.243504    16.64997
    ------------------------------------------------------------------------------------------
    Results with interaction term:
    Code:
    . logit status c.overconfident_fraction_2##c.log_average_funds experience_board experience_ma log_ipo_proceeds extension_10 board_size ag
    > e_average female_fraction independent_fraction sp500_return, vce(robust)
    
    Iteration 0:  Log pseudolikelihood = -356.75232  
    Iteration 1:  Log pseudolikelihood = -244.78594  
    Iteration 2:  Log pseudolikelihood = -231.96174  
    Iteration 3:  Log pseudolikelihood = -229.53523  
    Iteration 4:  Log pseudolikelihood =  -229.4964  
    Iteration 5:  Log pseudolikelihood = -229.49638  
    
    Logistic regression                                     Number of obs =    515
                                                            Wald chi2(12) = 113.33
                                                            Prob > chi2   = 0.0000
    Log pseudolikelihood = -229.49638                       Pseudo R2     = 0.3567
    
    ----------------------------------------------------------------------------------------------------------------
                                                   |               Robust
                                            status | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
    -----------------------------------------------+----------------------------------------------------------------
                          overconfident_fraction_2 |   18.59992    44.3388     0.42   0.675    -68.30253    105.5024
                                 log_average_funds |  -4.536145   .7628865    -5.95   0.000    -6.031375   -3.040915
                                                   |
    c.overconfident_fraction_2#c.log_average_funds |  -1.601408   3.735983    -0.43   0.668    -8.923799    5.720983
                                                   |
                                  experience_board |  -.0402631   .0675773    -0.60   0.551    -.1727122     .092186
                                     experience_ma |  -.8213843   .4725425    -1.74   0.082    -1.747551     .104782
                                  log_ipo_proceeds |  -.7607499   .2479405    -3.07   0.002    -1.246704   -.2747955
                                      extension_10 |   -1.67068   .2538525    -6.58   0.000    -2.168222   -1.173139
                                        board_size |   .0048564   .0732874     0.07   0.947    -.1387842     .148497
                                       age_average |  -.0184036   .0199891    -0.92   0.357    -.0575815    .0207742
                                   female_fraction |  -1.556685   .7843938    -1.98   0.047    -3.094069   -.0193016
                              independent_fraction |  -.7134018   .9685569    -0.74   0.461    -2.611738    1.184935
                                      sp500_return |   -3.38591   2.506107    -1.35   0.177     -8.29779    1.525969
                                             _cons |   70.77365   10.37074     6.82   0.000     50.44737    91.09993
    ----------------------------------------------------------------------------------------------------------------
    Does someone know if the regression input is correct? I tried a lot but it doesn't make sense to my why I get these results. Thanks in advance!

    Frans.

  • #2
    I think we can agree that you have a problem. I'd guess that the problem is called over-fitting or multicollinearity. The coefficient indeed looks large, but its standard error is also a clue we can see.

    Comment


    • #3
      Thanks for your reaction, Nick. Do you maybe know how to address this problem? With my other independent variables I estimated the VIFs and indeed for the interaction term they are very high:

      (example for other independent variable with OLS regression)
      Code:
      . estat vif
      
          Variable |       VIF       1/VIF  
      -------------+----------------------
      overconfid~2 |    258.70    0.003865
      log_averag~s |      1.88    0.533119
                c. |
      overconfid~2#|
                c. |
      log_averag~s |    258.54    0.003868
      experience~d |      1.44    0.695820
      experience~a |      1.18    0.844960
      log_ipo_pr~s |      1.38    0.725980
        board_size |      1.16    0.865770
       age_average |      1.21    0.826020
      female_fra~n |      1.10    0.911595
      independen~n |      1.25    0.801812
      sp500_return |      1.07    0.934720
      -------------+----------------------
          Mean VIF |     48.08
      I was under the impression that multicollinearity isn't a problem with interaction terms. However, leaving the interaction term out gives more intuitive results...
      Last edited by Frans Vinken; 11 Jan 2025, 10:13.

      Comment


      • #4
        I'd look at the correlation matrix of the predictors and a scatter plot of

        Code:
         
         scatter overconfident_fraction_2 log_average_funds

        Comment


        • #5

          As I already mentioned, a significant number of points are clustered near the bottom of the y-axis because these board do not exhibit overconfidence in my study.

          Code:
          . corr overconfident_fraction_2 log_average_funds
          (obs=515)
          
                       | overco~2 log_av~s
          -------------+------------------
          overconfid~2 |   1.0000
          log_averag~s |   0.0194   1.0000
          Click image for larger version

Name:	Scherm*afbeelding 2025-01-11 om 17.34.01.png
Views:	1
Size:	727.1 KB
ID:	1770612

          Comment


          • #6
            Thanks; that doesn't make it obvious to me what is going on. Others with more experience of this kind of modelling may be able to take this much further.

            Comment


            • #7
              While this may not be the entire story, I think the graph in #5 has some relevant information. The range of log average funds is from about 8.5 to 12. Bear in mind that in the interaction model the coefficient of overconfident_2 represents the effect of overconfident_2 on log odds status when log_average_funds = 0. Zero is clearly very far outside the range of instantiated values of log_average_funds, and might even be, in principle, impossible under any circumstances. As such, it is your model's best estimate of an effect about which the data provides almost no information and, perhaps, is altogether a flight of fantasy. Either way, the coefficient of overconfident_2 in that model has to be understood as an extreme extrapolation from the data. This is reflected both in its extreme value and its very large standard error.

              I think you will get more meaningful results if you center the overconfident_2 and log_average_funds variables.

              Comment


              • #8
                Thank you. I centered both my variables:
                Code:
                summ overconfident_fraction_2
                generate overconfident_centered = overconfident_fraction_2 - r(mean)
                
                summ log_average_funds
                generate log_average_funds_centered = log_average_funds - r(mean)
                Now I get these results. Coefficient seems to be better, but still very high SE. And what is exactly the rationale of also centering the overconfident variable?

                Code:
                . logit status c.overconfident_centered##c.log_average_funds_centered experience_board experience_ma log
                > _ipo_proceeds extension_10 board_size age_average female_fraction independent_fraction sp500_return, v
                > ce(robust)
                
                Iteration 0:  Log pseudolikelihood = -356.75232  
                Iteration 1:  Log pseudolikelihood = -244.78594  
                Iteration 2:  Log pseudolikelihood = -231.96174  
                Iteration 3:  Log pseudolikelihood = -229.53523  
                Iteration 4:  Log pseudolikelihood =  -229.4964  
                Iteration 5:  Log pseudolikelihood = -229.49638  
                
                Logistic regression                                     Number of obs =    515
                                                                        Wald chi2(12) = 113.33
                                                                        Prob > chi2   = 0.0000
                Log pseudolikelihood = -229.49638                       Pseudo R2     = 0.3567
                
                ----------------------------------------------------------------------------------------------
                                             |               Robust
                                      status | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
                -----------------------------+----------------------------------------------------------------
                      overconfident_centered |    .093802   1.294292     0.07   0.942    -2.442964    2.630568
                  log_average_funds_centered |  -4.863122   .6819454    -7.13   0.000    -6.199711   -3.526534
                                             |
                    c.overconfident_centered#|
                c.log_average_funds_centered |  -1.601408   3.735983    -0.43   0.668    -8.923799    5.720983
                                             |
                            experience_board |  -.0402631   .0675773    -0.60   0.551    -.1727122     .092186
                               experience_ma |  -.8213843   .4725425    -1.74   0.082    -1.747551     .104782
                            log_ipo_proceeds |  -.7607499   .2479405    -3.07   0.002    -1.246704   -.2747955
                                extension_10 |   -1.67068   .2538525    -6.58   0.000    -2.168222   -1.173139
                                  board_size |   .0048564   .0732874     0.07   0.947    -.1387842     .148497
                                 age_average |  -.0184036   .0199891    -0.92   0.357    -.0575815    .0207742
                             female_fraction |  -1.556685   .7843938    -1.98   0.047    -3.094069   -.0193016
                        independent_fraction |  -.7134018   .9685569    -0.74   0.461    -2.611738    1.184935
                                sp500_return |   -3.38591   2.506107    -1.35   0.177     -8.29779    1.525969
                                       _cons |   18.37241   4.995255     3.68   0.000     8.581889    28.16293
                ----------------------------------------------------------------------------------------------

                Comment


                • #9
                  The case for centering the overconfident_2 variable is not compelling; I would say you can probably get useful results with or without it. The main thing to bear in mind is that centering or not centering changes the meaning of the coefficients (as well as their values).

                  I would refrain from judging whether a coefficient in a logistic model is too large, or its standard error too large. These judgments are very difficult to make because the magnitude of a logistic coefficient depends on the scale of the explanatory variable and also the range of the outcome variable. Even very experienced statisticians find these judgments difficult to make. To do that, you are better off using the -margins- and -marginsplot- commands to see the actual expected values of the outcomes at relevant values of the explanatory variables. Such a graph is far more enlightening than the logistic regression table.

                  Code:
                  margins, at(overconfident_2_centered = (0(0.2)1) log_average_funds_centered = (8.5(0.5)12))
                  marginsplot, xdimension(log_average_funds_centered)

                  Comment


                  • #10
                    I think we are getting somewhere. The -margins- and -marginsplot- command are indeed helpful. If I understand it correctly, in the end it doesn't matter if I center the variables or not?

                    Code:
                    . logit status c.overconfident_fraction_2##c.log_average_funds experience_board experience_ma log_ipo_pr
                    > oceeds extension_10 board_size age_average female_fraction independent_fraction sp500_return, vce(robu
                    > st)
                    
                    Iteration 0:  Log pseudolikelihood = -356.75232  
                    Iteration 1:  Log pseudolikelihood = -244.78594  
                    Iteration 2:  Log pseudolikelihood = -231.96174  
                    Iteration 3:  Log pseudolikelihood = -229.53523  
                    Iteration 4:  Log pseudolikelihood =  -229.4964  
                    Iteration 5:  Log pseudolikelihood = -229.49638  
                    
                    Logistic regression                                     Number of obs =    515
                                                                            Wald chi2(12) = 113.33
                                                                            Prob > chi2   = 0.0000
                    Log pseudolikelihood = -229.49638                       Pseudo R2     = 0.3567
                    
                    ---------------------------------------------------------------------------------------------
                                                |               Robust
                                         status | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
                    ----------------------------+----------------------------------------------------------------
                       overconfident_fraction_2 |   18.59992    44.3388     0.42   0.675    -68.30253    105.5024
                              log_average_funds |  -4.536145   .7628865    -5.95   0.000    -6.031375   -3.040915
                                                |
                     c.overconfident_fraction_2#|
                            c.log_average_funds |  -1.601408   3.735983    -0.43   0.668    -8.923799    5.720983
                                                |
                               experience_board |  -.0402631   .0675773    -0.60   0.551    -.1727122     .092186
                                  experience_ma |  -.8213843   .4725425    -1.74   0.082    -1.747551     .104782
                               log_ipo_proceeds |  -.7607499   .2479405    -3.07   0.002    -1.246704   -.2747955
                                   extension_10 |   -1.67068   .2538525    -6.58   0.000    -2.168222   -1.173139
                                     board_size |   .0048564   .0732874     0.07   0.947    -.1387842     .148497
                                    age_average |  -.0184036   .0199891    -0.92   0.357    -.0575815    .0207742
                                female_fraction |  -1.556685   .7843938    -1.98   0.047    -3.094069   -.0193016
                           independent_fraction |  -.7134018   .9685569    -0.74   0.461    -2.611738    1.184935
                                   sp500_return |   -3.38591   2.506107    -1.35   0.177     -8.29779    1.525969
                                          _cons |   70.77365   10.37074     6.82   0.000     50.44737    91.09993
                    ---------------------------------------------------------------------------------------------
                    And the -marginsplot- gives this output.
                    Click image for larger version

Name:	Scherm*afbeelding 2025-01-11 om 20.09.17.png
Views:	1
Size:	863.8 KB
ID:	1770618



                    How do I interpret the results of my main independent variable and interaction term, in combination with the rest of the output (control variables)? Indeed it is difficult to make these judgements.

                    (edit: the interpretation of the other (control) variables doesn't change when including an interaction term right?)
                    Last edited by Frans Vinken; 11 Jan 2025, 13:29.

                    Comment


                    • #11
                      So, several things are apparent from the graphs.
                      1. The probability of status = 1 is essentially 1.0 for log average funds <= 10.5 regardless of the value of overconfident_fraction. (This also explains why the coefficient for overconfident_fraction_2 was so large in the uncentered interaction model, where that coefficient represented the probability was conditional on log average funds = 0.)
                      2. For log average funds between 10.5 and about 11.6 we see that higher overconfident_fraction values are associated with slightly higher probability of status = 1. But for log average funds > 11.6, the reverse is true: higher overconfident_fraction values are associated with slightly lower probability of status = 1.
                      3. Nevertheless, the magnitudes of the differences in status probability associated with these differences in overconfident_fraction is pretty small, both in absolute terms, and relative to the confidence intervals around the individual point estimates. The largest absolute difference we see is at the far right end of the graphs, where an overconfident_fraction value of 1 is associated with about a 0.2 probability (rough visual estimate: better to look at the margins output for a more precise sense of this) of status = 1, whereas with overconfident_fraction = 0, the status probability is about 0.25 (again, rough visual estimate). As I have no understanding of the real world practical implications of the variable status, I cannot judge whether this difference between 0.2 and 0.25 probability is large enough to write home about or not--you, or a colleague who is more familiar with the real world setting of this variable, will have to make that judgment. But my sense is that the graph shows that the interaction effect is pretty small and perhaps small enough to ignore (and therefore also omit from the modeling.)
                      the interpretation of the other (control) variables doesn't change when including an interaction term right?
                      That depends on what you mean by interpretation. The other variables that do not participate in the interaction (nor in any other interaction in the model) represent the marginal effects of those variables on the log-odds of the outcome variable, conditional on all the other variables in the model, regardless of whether the interaction is entered in the model or not. But the actual values of those coefficients can be different, depending on whether the interaction is included.

                      Comment


                      • #12
                        Thanks a lot for your time and this clear explanation, this helps me a lot.

                        I have one other hypothesis that use an OLS regression instead of logit. I use the same interaction term and control variable, but now I use an OLS regression with the dependent variable being the cumulative abnormal return around announcement (in %). I got the same 'problem' with very high coefficients and SEs, probably again due to the fact that most of the data in the log_average_funds variable is centered around 10.5 log:

                        Code:
                        . reg car_ann c.overconfident_fraction_2##c.log_average_funds experience_board experience_ma log_ipo_pro
                        > ceeds extension_10 board_size age_average female_fraction independent_fraction sp500_return if status
                        > == 1, vce(robust)
                        
                        Linear regression                               Number of obs     =        265
                                                                        F(12, 252)        =       1.19
                                                                        Prob > F          =     0.2940
                                                                        R-squared         =     0.0546
                                                                        Root MSE          =     .17035
                        
                        ---------------------------------------------------------------------------------------------
                                                    |               Robust
                                            car_ann | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                        ----------------------------+----------------------------------------------------------------
                           overconfident_fraction_2 |   1.047833   .8680644     1.21   0.229    -.6617526    2.757418
                                  log_average_funds |  -.0191891   .0141266    -1.36   0.176    -.0470103    .0086321
                                                    |
                         c.overconfident_fraction_2#|
                                c.log_average_funds |  -.0856844   .0734653    -1.17   0.245    -.2303685    .0589997
                                                    |
                                   experience_board |  -.0055307    .006572    -0.84   0.401    -.0184738    .0074124
                                      experience_ma |   .0257931   .0609893     0.42   0.673    -.0943207    .1459068
                                   log_ipo_proceeds |  -.0267624   .0237781    -1.13   0.261    -.0735916    .0200668
                                       extension_10 |   .0136548   .0214189     0.64   0.524     -.028528    .0558376
                                         board_size |   .0002851   .0071693     0.04   0.968    -.0138343    .0144044
                                        age_average |     .00107   .0022455     0.48   0.634    -.0033522    .0054923
                                    female_fraction |   .0495742   .1015135     0.49   0.626    -.1503488    .2494972
                               independent_fraction |    -.02684    .120283    -0.22   0.824     -.263728    .2100481
                                       sp500_return |   .2827291   .2066824     1.37   0.173    -.1243158     .689774
                                              _cons |   .6975388   .4124172     1.69   0.092    -.1146848    1.509762
                        ---------------------------------------------------------------------------------------------
                        When I use the -marginsplot- command to interpret the results, it gives the following output:
                        Click image for larger version

Name:	Scherm*afbeelding 2025-01-12 om 01.07.14.png
Views:	1
Size:	931.5 KB
ID:	1770625




                        I still don't fully understand what the results of this graph say in an OLS setting. What does the linear prediction say exactly about the interaction effect on cumulative abnormal returns?

                        (edit: I know that the model is not significant but more for my own interpretation what these results say)
                        Last edited by Frans Vinken; 11 Jan 2025, 18:20.

                        Comment


                        • #13
                          Again, the interpretation by eyeball is pretty clear.

                          At low values of log_average_funds, a higher value of overconfident_fraction is associated with higher values of car_ann. This is true for values of log_average_funds up to about 12.2, at which point all the curves cross, and for value of log_average_funds above that threshold, the reverse is true: higher values of overconfident_fraction are associated with lower values of car_ann. The differences based on overconfident_fraction are rather small at the right hand end of the graph (though, again, somebody familiar with the real-world implications of car_ann has to make the judgement whether they are small enough to ignore). At the left end of the graph, things are more spread out.

                          I do note, however, that the left end of the graph is for values of log_average_funds that, according to the graph in #5 don't exist--so I'm not sure why the graph was even plotted this far out. It might make sense to redo this changing the -margins- command so that only realistic values of that variable are used. If we visually chop off the part of the graph to the left of log_average_funds = 9, then the differences in overconfident_fraction even at the left (i.e. at log_average_funds = 9) aren't especially large, and are more or less the same as those on the right end of the graph.

                          I know that the model is not significant but more for my own interpretation what these results say
                          Yes, but do you know that the "significance" of the model as a whole is irrelevant? If you are going to use statistical significance, only the significance of the variables being tested matters--the significance of other covariates ("control variables") or the model as a whole have no importance. So in your case, only the joint significance of log_average_funds, overconfident_fraction and their interaction should matter to you. The rest of it is in the printout because Stata has no way of knowing which of your variables are there to be tested and which are "controls."

                          Comment


                          • #14
                            Just to add a general remark about interactions and this approach to them.

                            If there were actually zero interaction between log_average_funds and overconfident_fraction, then the -marginsplot- line would look like a single curve. That's because all of the "different" curves would completely overlap and only one of them would be visible, masking the others. The meaning of interaction itself is that the curves at least separate from each other--that is, they are not parallel. And, the most interesting kinds of interaction are where the curves actually cross (as is the case here).

                            Of course, exactly zero interaction is almost never seen in real data because of noise. So we sometimes see, as here, interactions that are large enough for the eye to see, perhaps even in the interesting interaction category, but where the amount of separation of the curves is small, perhaps too small to matter at all for practical purposes. And then there is also the question of the statistical significance of the interaction--which can be judged from the p-value or confidence interval of the interaction term itself. This tells you to what extent the size of the interaction is large or small relative to what the noise in the data would produce in the absence of any systematic interaction. (By systematic interaction, I mean the hypothetical interaction that you would see between these variables if noise-free data were available.)

                            Comment

                            Working...
                            X