Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Backward elimination - categorical variable

    Backward elimination begins with the largest model and eliminates variables one-by-one until we are satisfied that all remaining variables are important to the model. How to choose p- values for variable with more than two level? For example with binary variable and continuous variable we get only one p-value. But variable with more than 2 levels we get 2 or more p- values.

    1) which p-value should be used to remove for categorical variable with levels - highest or lowest?
    2) Should one remove the whole categorical variable with kevels based on one p-value or just remove level with highest/lowest p- value. But then problem arises with the number of observations- on removing levels regression analysis removes all the observations related that level?

  • #2
    do a joint test,
    Code:
    . sysuse auto, clear
    (1978 automobile data)
    
    . reg price mpg i.rep78
    
          Source |       SS           df       MS      Number of obs   =        69
    -------------+----------------------------------   F(5, 63)        =      4.39
           Model |   149020603         5  29804120.7   Prob > F        =    0.0017
        Residual |   427776355        63  6790100.88   R-squared       =    0.2584
    -------------+----------------------------------   Adj R-squared   =    0.1995
           Total |   576796959        68  8482308.22   Root MSE        =    2605.8
    
    ------------------------------------------------------------------------------
           price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
             mpg |  -280.2615   61.57666    -4.55   0.000    -403.3126   -157.2103
                 |
           rep78 |
              2  |   877.6347   2063.285     0.43   0.672     -3245.51     5000.78
              3  |   1425.657   1905.438     0.75   0.457    -2382.057    5233.371
              4  |   1693.841   1942.669     0.87   0.387    -2188.274    5575.956
              5  |   3131.982   2041.049     1.53   0.130    -946.7282    7210.693
                 |
           _cons |   10449.99   2251.041     4.64   0.000     5951.646    14948.34
    ------------------------------------------------------------------------------
    
    . testparm i.rep78
    
     ( 1)  2.rep78 = 0
     ( 2)  3.rep78 = 0
     ( 3)  4.rep78 = 0
     ( 4)  5.rep78 = 0
    
           F(  4,    63) =    1.07
                Prob > F =    0.3780
    rep78 does not explain a significant part of the variation in price
    Last edited by Øyvind Snilsberg; 04 Oct 2022, 06:22.

    Comment


    • #3
      Sandeep:
      the usual cautionary tale applies when making up the data seems the way to go: https://www.stata.com/support/faqs/s...sion-problems/
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        Carlo Lazzaro Thanks for your reply. I was wondering if you can elaborate more on it. Couple of variables are categorical (3 levels) and have to use in models.

        Comment


        • #5
          Sandeep:
          my previous reply was more general.
          Set aside the technical side of the matter (whose drawbacks are well reported in the souce that I pointed you to in my previous reply) when you go -stepwise- or the like, you're actually selecting a subset of your original predictors in search of those which "explain a lot" (whatever this may mean).
          But one of the most informative goal of each and every regression is to give a fair and true view of the data genetaring process you're interested in, that may well include non-significant predictors (that do not reach statistical signiifcance, FWIW, because of different reasons that may be worth investigating).
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            @Øyvind Snilsberg Thanks for your reply. I have never used joint test earlier.

            a)How to read the test results?

            b) There are two categorical variables 3 levels each. I have created a hypothetical results. Categorical var2 has both highest and lowest p-value and work variable with second highest p-value. To decide which variable should be eliminated:
            Should I be using joint test for categorical var 2 or for work which has second highest p-value?
            Variable OR p-value
            Work 2.04 0.48
            Categorical var 1
            2 0.36 0.27
            3 0.47 0.46
            Categorical variable 2
            Light 0.73 0.69
            Vigorous 0.17 0.06

            Comment


            • #7
              Sandeep:
              you may want to consider something along the following lines (both categorical variables so not reach statistical significance):
              Code:
              . regress price i.foreign i.rep78 if rep78<=3
              
                    Source |       SS           df       MS      Number of obs   =        40
              -------------+----------------------------------   F(3, 36)        =      0.43
                     Model |  15821221.9         3  5273740.63   Prob > F        =    0.7329
                  Residual |   441787961        36  12271887.8   R-squared       =    0.0346
              -------------+----------------------------------   Adj R-squared   =   -0.0459
                     Total |   457609183        39  11733568.8   Root MSE        =    3503.1
              
              ------------------------------------------------------------------------------
                     price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
              -------------+----------------------------------------------------------------
                   foreign |
                  Foreign  |  -1778.407   2131.934    -0.83   0.410     -6102.17    2545.356
                           |
                     rep78 |
                        2  |   1403.125   2769.464     0.51   0.615    -4213.608    7019.858
                        3  |   2042.574   2567.189     0.80   0.431    -3163.926    7249.074
                           |
                     _cons |     4564.5   2477.084     1.84   0.074    -459.2587    9588.259
              ------------------------------------------------------------------------------
              
              . test 0.foreign=0
              
               ( 1)  0b.foreign = 0
                     Constraint 1 dropped
              
                     F(  0,    36) =       .
                          Prob > F =         .
              
              . test 1.foreign=0, acc
              
               ( 1)  0b.foreign = 0
               ( 2)  1.foreign = 0
                     Constraint 1 dropped
              
                     F(  1,    36) =    0.70
                          Prob > F =    0.4097
              
              . test 1.rep78=0
              
               ( 1)  1b.rep78 = 0
                     Constraint 1 dropped
              
                     F(  0,    36) =       .
                          Prob > F =         .
              
              . test 2.rep78=0, acc
              
               ( 1)  1b.rep78 = 0
               ( 2)  2.rep78 = 0
                     Constraint 1 dropped
              
                     F(  1,    36) =    0.26
                          Prob > F =    0.6155
              
              . test 3.rep78=0, acc
              
               ( 1)  1b.rep78 = 0
               ( 2)  2.rep78 = 0
               ( 3)  3.rep78 = 0
                     Constraint 1 dropped
              
                     F(  2,    36) =    0.38
                          Prob > F =    0.6865
              
              .
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                Sounds good. Thanks for your help.

                Comment

                Working...
                X