Backward elimination - categorical variable

sandeep kaur

Join Date: Jul 2022

Posts: 60
#1

Backward elimination - categorical variable

04 Oct 2022, 04:12

Backward elimination begins with the largest model and eliminates variables one-by-one until we are satisfied that all remaining variables are important to the model. How to choose p- values for variable with more than two level? For example with binary variable and continuous variable we get only one p-value. But variable with more than 2 levels we get 2 or more p- values.

1) which p-value should be used to remove for categorical variable with levels - highest or lowest?
2) Should one remove the whole categorical variable with kevels based on one p-value or just remove level with highest/lowest p- value. But then problem arises with the number of observations- on removing levels regression analysis removes all the observations related that level?
Tags: None

Øyvind Snilsberg

Join Date: Oct 2021
Posts: 591

04 Oct 2022, 06:18

do a joint test,

Code:

. sysuse auto, clear
(1978 automobile data)

. reg price mpg i.rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(5, 63)        =      4.39
       Model |   149020603         5  29804120.7   Prob > F        =    0.0017
    Residual |   427776355        63  6790100.88   R-squared       =    0.2584
-------------+----------------------------------   Adj R-squared   =    0.1995
       Total |   576796959        68  8482308.22   Root MSE        =    2605.8

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         mpg |  -280.2615   61.57666    -4.55   0.000    -403.3126   -157.2103
             |
       rep78 |
          2  |   877.6347   2063.285     0.43   0.672     -3245.51     5000.78
          3  |   1425.657   1905.438     0.75   0.457    -2382.057    5233.371
          4  |   1693.841   1942.669     0.87   0.387    -2188.274    5575.956
          5  |   3131.982   2041.049     1.53   0.130    -946.7282    7210.693
             |
       _cons |   10449.99   2251.041     4.64   0.000     5951.646    14948.34
------------------------------------------------------------------------------

. testparm i.rep78

 ( 1)  2.rep78 = 0
 ( 2)  3.rep78 = 0
 ( 3)  4.rep78 = 0
 ( 4)  5.rep78 = 0

       F(  4,    63) =    1.07
            Prob > F =    0.3780

rep78 does not explain a significant part of the variation in price

Last edited by Øyvind Snilsberg; 04 Oct 2022, 06:22.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#3

04 Oct 2022, 07:46

Sandeep:
the usual cautionary tale applies when making up the data seems the way to go: https://www.stata.com/support/faqs/s...sion-problems/

Kind regards,
Carlo
(Stata 19.0)
Comment
sandeep kaur

Join Date: Jul 2022

Posts: 60
#4

04 Oct 2022, 09:26

Carlo Lazzaro Thanks for your reply. I was wondering if you can elaborate more on it. Couple of variables are categorical (3 levels) and have to use in models.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#5

04 Oct 2022, 09:37

Sandeep:
my previous reply was more general.
Set aside the technical side of the matter (whose drawbacks are well reported in the souce that I pointed you to in my previous reply) when you go -stepwise- or the like, you're actually selecting a subset of your original predictors in search of those which "explain a lot" (whatever this may mean).
But one of the most informative goal of each and every regression is to give a fair and true view of the data genetaring process you're interested in, that may well include non-significant predictors (that do not reach statistical signiifcance, FWIW, because of different reasons that may be worth investigating).

Kind regards,
Carlo
(Stata 19.0)
Comment
sandeep kaur

Join Date: Jul 2022

Posts: 60
#6

04 Oct 2022, 09:38

@Øyvind Snilsberg Thanks for your reply. I have never used joint test earlier.

a)How to read the test results?

b) There are two categorical variables 3 levels each. I have created a hypothetical results. Categorical var2 has both highest and lowest p-value and work variable with second highest p-value. To decide which variable should be eliminated:
Should I be using joint test for categorical var 2 or for work which has second highest p-value?
Variable OR p-value

Work 2.04 0.48

Categorical var 1

2 0.36 0.27

3 0.47 0.46

Categorical variable 2

Light 0.73 0.69

Vigorous 0.17 0.06
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17673

04 Oct 2022, 10:30

Sandeep:
you may want to consider something along the following lines (both categorical variables so not reach statistical significance):

Code:

. regress price i.foreign i.rep78 if rep78<=3

      Source |       SS           df       MS      Number of obs   =        40
-------------+----------------------------------   F(3, 36)        =      0.43
       Model |  15821221.9         3  5273740.63   Prob > F        =    0.7329
    Residual |   441787961        36  12271887.8   R-squared       =    0.0346
-------------+----------------------------------   Adj R-squared   =   -0.0459
       Total |   457609183        39  11733568.8   Root MSE        =    3503.1

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     foreign |
    Foreign  |  -1778.407   2131.934    -0.83   0.410     -6102.17    2545.356
             |
       rep78 |
          2  |   1403.125   2769.464     0.51   0.615    -4213.608    7019.858
          3  |   2042.574   2567.189     0.80   0.431    -3163.926    7249.074
             |
       _cons |     4564.5   2477.084     1.84   0.074    -459.2587    9588.259
------------------------------------------------------------------------------

. test 0.foreign=0

 ( 1)  0b.foreign = 0
       Constraint 1 dropped

       F(  0,    36) =       .
            Prob > F =         .

. test 1.foreign=0, acc

 ( 1)  0b.foreign = 0
 ( 2)  1.foreign = 0
       Constraint 1 dropped

       F(  1,    36) =    0.70
            Prob > F =    0.4097

. test 1.rep78=0

 ( 1)  1b.rep78 = 0
       Constraint 1 dropped

       F(  0,    36) =       .
            Prob > F =         .

. test 2.rep78=0, acc

 ( 1)  1b.rep78 = 0
 ( 2)  2.rep78 = 0
       Constraint 1 dropped

       F(  1,    36) =    0.26
            Prob > F =    0.6155

. test 3.rep78=0, acc

 ( 1)  1b.rep78 = 0
 ( 2)  2.rep78 = 0
 ( 3)  3.rep78 = 0
       Constraint 1 dropped

       F(  2,    36) =    0.38
            Prob > F =    0.6865

.

Kind regards,
Carlo
(Stata 19.0)

Comment

sandeep kaur

Join Date: Jul 2022

Posts: 60
#8

06 Oct 2022, 15:03

Sounds good. Thanks for your help.
Comment

Variable	OR	p-value
Work	2.04	0.48
Categorical var 1
2	0.36	0.27
3	0.47	0.46
Categorical variable 2
Light	0.73	0.69
Vigorous	0.17	0.06

Announcement

Backward elimination - categorical variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment