Combining categories

Daria lisaholm

Join Date: Jul 2021
Posts: 18

Combining categories

31 Jan 2022, 04:25

I would like to combine two categories for one of my variables but I am not sure whether there is a test I could run to justify combining the categories. Here is an example of what I am trying to do:

Code:

. ta worry

 How worried are |
 you about being |
   infected with |
       COVID-19? |      Freq.     Percent        Cum.
-----------------+-----------------------------------
     Not at all  |        641       37.31       37.31
       A little  |        387       22.53       59.84
         Rather  |        203       11.82       71.65
           Very  |        487       28.35      100.00
-----------------+-----------------------------------
           Total |      1,718      100.00

. 
. xtreg WB i.worry [pw= panel_ind_wt_1_2], fe

Fixed-effects (within) regression               Number of obs     =      1,718
Group variable: Findid                          Number of groups  =        859

R-sq:                                           Obs per group:
     within  = 0.0191                                         min =          2
     between = 0.0388                                         avg =        2.0
     overall = 0.0214                                         max =          2

                                                F(3,858)          =       2.25
corr(u_i, Xb)  = 0.0312                         Prob > F          =     0.0811

                               (Std. Err. adjusted for 859 clusters in Findid)
------------------------------------------------------------------------------
             |               Robust
          WB |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       worry |
  A little   |   .5341085   .3449488     1.55   0.122    -.1429337    1.211151
    Rather   |   .0941577   .3733732     0.25   0.801    -.6386741    .8269896
      Very   |    .688718   .2973494     2.32   0.021     .1051006    1.272335
             |
       _cons |  -.4072548   .1753646    -2.32   0.020    -.7514485    -.063061
-------------+----------------------------------------------------------------
     sigma_u |  1.5148842
     sigma_e |  1.8692139
         rho |  .39643111   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. 
. recode worry 3=2 4=3
(worry: 690 changes made)

. 
. ta worry

 How worried are |
 you about being |
   infected with |
       COVID-19? |      Freq.     Percent        Cum.
-----------------+-----------------------------------
     Not at all  |        641       37.31       37.31
       A little  |        590       34.34       71.65
         Rather  |        487       28.35      100.00
-----------------+-----------------------------------
           Total |      1,718      100.00

. 
. xtreg WB i.worry [pw= panel_ind_wt_1_2], fe

Fixed-effects (within) regression               Number of obs     =      1,718
Group variable: Findid                          Number of groups  =        859

R-sq:                                           Obs per group:
     within  = 0.0146                                         min =          2
     between = 0.0451                                         avg =        2.0
     overall = 0.0238                                         max =          2

                                                F(2,858)          =       2.69
corr(u_i, Xb)  = 0.0556                         Prob > F          =     0.0684

                               (Std. Err. adjusted for 859 clusters in Findid)
------------------------------------------------------------------------------
             |               Robust
          WB |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       worry |
  A little   |   .3706983    .310278     1.19   0.233    -.2382945    .9796911
    Rather   |   .6793746   .2977687     2.28   0.023     .0949343    1.263815
             |
       _cons |  -.4024158   .1756929    -2.29   0.022    -.7472539   -.0575777
-------------+----------------------------------------------------------------
     sigma_u |  1.5122795
     sigma_e |  1.8724047
         rho |  .39479255   (fraction of variance due to u_i)
------------------------------------------------------------------------------

.

As categories 2/3 are not statistically significant, I would like to combine them. The results remain the same. As in is there a difference between categories 2 and 3? I am not sure if I am phrasing my question well..

Tags: None

Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

31 Jan 2022, 04:39

You want to combine them because they're not statistically significant?
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17601
#3

31 Jan 2022, 05:24

Daria:
welcome to this forum.
Jared made a very good point: grouping categories only because they do not reach ststistical significance when considered as separate predictors has no methodological justification.
In addition, I suspect that your data do not support the evidence of a panel-wise effect:
1) your -corr(u_i, Xb) = 0.0312- is dramatically low (as the within R-sq);
2) sigma_e>sigma_u.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Daria lisaholm

Join Date: Jul 2021

Posts: 18
#4

31 Jan 2022, 05:58

Thank you, Carlo and Jared. I apologise, I was not clear – not because they are statistically insignificant but mainly because I don’t see much difference between “rather worried” and “a little worried”, conceptual reasoning. But I am guessing that also does not justify it and better to keep categories as is.

Carlo, the regression here excludes most of the other variables I include. However, my corr (u_i, Xb) is still small (-0.0112) and the within R-sq is 0.20 in the full model. I am new to Stata so does that mean I cannot run the fixed effects model on my data, or that the results aren’t very meaningful?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17601
#5

31 Jan 2022, 06:27

Daria:
1) I would keep the two levels “rather worried” and “a little worried” of -worry- categorical variable separate, as they have a non-negligible number of observations each;
2) -corr (u_i, Xb) is still small (-0.0112)- supports the absence of evidence of a panel-wise effect; within R-sq is 0.20 is not that encouraging, too.
Are you sure that your model is not misspecified (put differently: is the functional form of the regressand correct)? Are you sure that all the necessary predictors and interactions were included in the right-hand side of your regression equation to give a fair and true view of the data generating process you're investigating?
I would recommend you to share what you typed and what Stata geve you back when you dealt with your full model. Thanks.

Last edited by Carlo Lazzaro; 31 Jan 2022, 06:31.

Kind regards,
Carlo
(StataNow 18.5)
Comment

Daria lisaholm

Join Date: Jul 2021
Posts: 18

31 Jan 2022, 06:38

Thanks Carlo.

Here is the (1) full model and (2) run on two subsamples (urban/rural areas). If I understand I can only include time-varying predictors.

Code:

xtreg WB i.worry i.worry2 i.security i.employment i.income_change [pw= panel_ind_wt_1_2], fe

Fixed-effects (within) regression               Number of obs     =      1,720
Group variable: Findid                          Number of groups  =        863

R-sq:                                           Obs per group:
     within  = 0.1189                                         min =          1
     between = 0.1263                                         avg =        2.0
     overall = 0.0983                                         max =          2

                                                F(12,862)         =       3.99
corr(u_i, Xb)  = -0.0817                        Prob > F          =     0.0000

                                      (Std. Err. adjusted for 863 clusters in Findid)
-------------------------------------------------------------------------------------
                    |               Robust
                 WB |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
              worry |
         A little   |   .3973488   .3345473     1.19   0.235    -.2592738    1.053971
           Rather   |  -.0733212   .3574259    -0.21   0.838    -.7748481    .6282058
             Very   |    .349525   .2923644     1.20   0.232    -.2243045    .9233545
                    |
             worry2 |
          A little  |  -.1780035    .364739    -0.49   0.626    -.8938839    .5378768
            Rather  |   .0579736   .3533665     0.16   0.870     -.635586    .7515331
              Very  |   .5641763   .2816948     2.00   0.046     .0112883    1.117064
                    |
           security |
          Moderate  |    .901208   .2586017     3.48   0.001     .3936454    1.408771
               Low  |   .9134565   .3042455     3.00   0.003     .3163078    1.510605
                    |
         employment |
        Unemployed  |   .3395662   .3472676     0.98   0.328    -.3420228    1.021155
Out of labor force  |   .1233692   .4261581     0.29   0.772    -.7130597    .9597981
                    |
      income_change |
              Same  |  -.7486318   .2950736    -2.54   0.011    -1.327779    -.169485
        Increased   |  -.7543419   .4195771    -1.80   0.073    -1.577854    .0691704
                    |
              _cons |  -.9616271   .3645323    -2.64   0.008    -1.677102   -.2461523
--------------------+----------------------------------------------------------------
            sigma_u |  1.4555533
            sigma_e |  1.7793041
                rho |  .40091058   (fraction of variance due to u_i)
-------------------------------------------------------------------------------------

. xtreg WB i.worry i.worry2 i.security i.employment i.income_change [pw= panel_ind_wt_1_2] if urban==1, fe

Fixed-effects (within) regression               Number of obs     =      1,058
Group variable: Findid                          Number of groups  =        531

R-sq:                                           Obs per group:
     within  = 0.1335                                         min =          1
     between = 0.1333                                         avg =        2.0
     overall = 0.1069                                         max =          2

                                                F(12,530)         =       2.97
corr(u_i, Xb)  = -0.1364                        Prob > F          =     0.0005

                                      (Std. Err. adjusted for 531 clusters in Findid)
-------------------------------------------------------------------------------------
                    |               Robust
                 WB |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
              worry |
         A little   |   -.071595   .4542707    -0.16   0.875    -.9639871     .820797
           Rather   |  -.0551513   .4471418    -0.12   0.902     -.933539    .8232363
             Very   |  -.0349684   .3849639    -0.09   0.928    -.7912107    .7212739
                    |
             worry2 |
          A little  |   .0180118   .4520092     0.04   0.968    -.8699377    .9059612
            Rather  |   .1315093   .4307555     0.31   0.760    -.7146883    .9777069
              Very  |     .73125   .3425631     2.13   0.033     .0583019    1.404198
                    |
           security |
          Moderate  |   .9168551   .3300422     2.78   0.006     .2685037    1.565207
               Low  |   .8619262   .4042321     2.13   0.033     .0678324     1.65602
                    |
         employment |
        Unemployed  |   .6122631   .5107989     1.20   0.231    -.3911758    1.615702
Out of labor force  |   .4533399   .5879297     0.77   0.441    -.7016185    1.608298
                    |
      income_change |
              Same  |  -1.042022   .3677464    -2.83   0.005    -1.764442    -.319603
        Increased   |  -.9039354   .5179174    -1.75   0.082    -1.921358    .1134875
                    |
              _cons |  -.8637337   .4470105    -1.93   0.054    -1.741863    .0143961
--------------------+----------------------------------------------------------------
            sigma_u |  1.4705306
            sigma_e |  1.8264579
                rho |  .39328839   (fraction of variance due to u_i)
-------------------------------------------------------------------------------------

. xtreg WB i.worry i.worry2 i.security i.employment i.income_change [pw= panel_ind_wt_1_2] if urban==2, fe

Fixed-effects (within) regression               Number of obs     =        662
Group variable: Findid                          Number of groups  =        332

R-sq:                                           Obs per group:
     within  = 0.2171                                         min =          1
     between = 0.0339                                         avg =        2.0
     overall = 0.0549                                         max =          2

                                                F(12,331)         =       4.00
corr(u_i, Xb)  = -0.2380                        Prob > F          =     0.0000

                                      (Std. Err. adjusted for 332 clusters in Findid)
-------------------------------------------------------------------------------------
                    |               Robust
                 WB |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
              worry |
         A little   |   1.412041   .3319679     4.25   0.000     .7590083    2.065074
           Rather   |  -.0654846   .4504375    -0.15   0.884    -.9515657    .8205966
             Very   |   1.200007   .3787382     3.17   0.002       .45497    1.945045
                    |
             worry2 |
          A little  |  -.2452824   .4828167    -0.51   0.612    -1.195058    .7044937
            Rather  |   .0413419   .5635478     0.07   0.942    -1.067245    1.149929
              Very  |   .3929816   .4538273     0.87   0.387    -.4997678    1.285731
                    |
           security |
          Moderate  |   .8459582   .3064525     2.76   0.006     .2431181    1.448798
               Low  |   1.225383    .363332     3.37   0.001     .5106518    1.940114
                    |
         employment |
        Unemployed  |   .0732419    .410372     0.18   0.858    -.7340241    .8805079
Out of labor force  |  -.4721134   .3913379    -1.21   0.229    -1.241936    .2977097
                    |
      income_change |
              Same  |   .0561762   .2956212     0.19   0.849    -.5253571    .6377096
        Increased   |  -.3017318   .5034221    -0.60   0.549    -1.292042    .6885784
                    |
              _cons |  -1.613676   .5644371    -2.86   0.005    -2.724012   -.5033397
--------------------+----------------------------------------------------------------
            sigma_u |  1.5786666
            sigma_e |  1.5804861
                rho |  .49942404   (fraction of variance due to u_i)
-------------------------------------------------------------------------------------

.

Comment

Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#7

31 Jan 2022, 06:45

Another thing I find strange is how you have only two observations maximum per group. I suppose this isn't exactly illegal, but typically we use FE estimators to (ostensibly) adjust for unobserved, time invariant but unit stable confounding, typically over multiple periods of time.

Are you sure xtreg is the way to go here? you honestly could likely get away with just pooling this with normal OLS
Comment

Daria lisaholm

Join Date: Jul 2021
Posts: 18

31 Jan 2022, 06:56

Yes, I only have two waves of data so two observations per person. I've tried running the pooled OLS, I think I have the code right, but R-Squared still seems low even with added covariates and no longer able to use pweights with xtreg.

Code:

xtset ID wave

xtreg WB i.worry i.worry2 i.security i.employment i.income_change i.wave i.marital i.agecat i.educ i.urban i.sex, vce (cluster ID)

Random-effects GLS regression                   Number of obs     =      1,720
Group variable: ID                              Number of groups  =        863

R-sq:                                           Obs per group:
     within  = 0.0637                                         min =          1
     between = 0.1661                                         avg =        2.0
     overall = 0.1226                                         max =          2

                                                Wald chi2(22)     =     234.55
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

                                          (Std. Err. adjusted for 863 clusters in ID)
-------------------------------------------------------------------------------------
                    |               Robust
                 WB |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
              worry |
         A little   |   .2709719   .1244742     2.18   0.029      .027007    .5149369
           Rather   |   .2405991   .1575753     1.53   0.127    -.0682428    .5494409
             Very   |   .2619095   .1278226     2.05   0.040     .0113818    .5124372
                    |
             worry2 |
          A little  |   .0472099   .1682596     0.28   0.779    -.2825728    .3769926
            Rather  |    .001488   .1800994     0.01   0.993    -.3515002    .3544763
              Very  |   .5868517   .1475392     3.98   0.000     .2976801    .8760233
                    |
           security |
          Moderate  |   .6379472   .1193082     5.35   0.000     .4041074     .871787
               Low  |   .8828169    .130804     6.75   0.000     .6264458    1.139188
                    |
         employment |
        Unemployed  |   .5433968   .1325648     4.10   0.000     .2835746     .803219
Out of labor force  |   .1512903   .1270076     1.19   0.234    -.0976401    .4002207
                    |
      income_change |
              Same  |  -.2670846   .1140901    -2.34   0.019     -.490697   -.0434721
        Increased   |  -.3734626   .1925077    -1.94   0.052    -.7507707    .0038456
                    |
               wave |
            Wave 2  |  -.0639719   .0850457    -0.75   0.452    -.2306584    .1027145
                    |
            marital |
 Currently Married  |   .1672897   .1300435     1.29   0.198    -.0875908    .4221703
  Widowed/divorced  |  -.0470623   .2268863    -0.21   0.836    -.4917512    .3976266
                    |
             agecat |
             30-40  |  -.0737299   .1276281    -0.58   0.563    -.3238764    .1764167
             41-64  |  -.2624763   .1379174    -1.90   0.057    -.5327896    .0078369
                    |
               educ |
             Basic  |  -.1280346    .138925    -0.92   0.357    -.4003227    .1442535
         Secondary  |  -.0411468   .1532769    -0.27   0.788     -.341564    .2592704
  Higher education  |   .1037788   .1624885     0.64   0.523    -.2146927    .4222504
                    |
              urban |
             rural  |  -.1553741   .1079372    -1.44   0.150    -.3669272     .056179
                    |
                sex |
            Female  |   .2121433   .1199667     1.77   0.077    -.0229872    .4472737
              _cons |  -.9983781   .2179307    -4.58   0.000    -1.425514   -.5712418
--------------------+----------------------------------------------------------------
            sigma_u |  .69735626
            sigma_e |  1.7581332
                rho |  .13594067   (fraction of variance due to u_i)
-------------------------------------------------------------------------------------

.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17601
#9

31 Jan 2022, 07:00

Daria:
your last example is a -xtreg,re- equation, not a pooled OLS.
Please, test, just out of curiosity, whether -xttest0- after -xtreg,re- supports the evidence of a panel-wise effect (I do not think so). Thanks.

Kind regards,
Carlo
(StataNow 18.5)
Comment

Daria lisaholm

Join Date: Jul 2021
Posts: 18

#10

31 Jan 2022, 07:13

Thank you for your continued help Carlo. I have run xttest0 and the result is statistically significant (0.0001) - Which i understand means my model should not be estimated as a pooled OLS (regress , vce (cluster ID)) or that i reject null of no random effect?

Code:

Breusch and Pagan Lagrangian multiplier test for random effects

        WB[ID,t] = Xb + u[ID] + e[ID,t]

        Estimated results:
                         |       Var     sd = sqrt(Var)
                ---------+-----------------------------
                      WB |   4.026931       2.006721
                       e |   3.091032       1.758133
                       u |   .4863057       .6973563

        Test:   Var(u) = 0
                             chibar2(01) =    15.11
                          Prob > chibar2 =   0.0001

So would I just be better off sticking with the FE model even with low within R squared?

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17601
#11

31 Jan 2022, 07:27

Daria:
despite my (wrong) impression, -xttest0- outcome points you towards -xtreg,re-.
That said, I would take the following step:
1) check the functional form of the regressand (just replicate the procedure detailed under -linktest- entry, Stata .pdf manual);
2) type:

Code:

xi: xtreg WB i.worry i.worry2 i.security i.employment i.income_change i.wave i.marital i.agecat i.educ i.urban i.sex, vce (cluster ID) xtoverid

As far as 2) is concerned, please note that:
a) the -xi:- prefix is required because the community-contrinuted module -xtoverid- does not suppor -fvvarlist- notation;
b) if you did not download -xtoverid- yet, just type -search xtoverid- to spot and install it (along with the other community-contributed modules that support -xtoverid-);
3) the -xtoverid- null is tht -xtreg,re- is the way to go.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Daria lisaholm

Join Date: Jul 2021

Posts: 18
#12

31 Jan 2022, 07:43

Thank you very much for your help Carlo. I got a P-value of 0.42 which means I can't reject the null and will go with the random effects model xtreg, re
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17601
#13

31 Jan 2022, 07:46

Daria:
correct, provided that your model is correctly specified (as per my previous point 1).

Kind regards,
Carlo
(StataNow 18.5)
1 like
Comment

Announcement

Combining categories

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment