Dummy Variables In a regression

Beatriz Gomes

Join Date: Nov 2022

Posts: 9
#1

Dummy Variables In a regression

10 Nov 2022, 15:43

Hello!
I want to control for education in a OLS regression. Right now I have all of my data in a single education variable.

1. Does it make sense to divide it in 2 dummies: secondary school and college, and include both on the regression? Or do I do both of them together as 1 dummy only? Is there any difference?
2. If I do the 2 dummies (with values of eitheir 1 or 0) do I have to put as missing values the college data in the secondary school dummy variable or can it be the zero (0). The same for the college dummy variable, is it ok that the secondary school data is coded as zero?

Thank you
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

10 Nov 2022, 16:26

In any modern version of Stata, there is no need to create any indicator ("dummy") variables to use in a regression. Rather, use factor-variable notation. That is, leave your education data as a single variable* and in your regression enter i.education (replace italicized part by the actual name of your education variable) as one of the predictors in your regression. Stata will calculate the indicator variables for you on the fly and enter them into the regression. It's much quicker and easier.

If you wish to combine some categories in your education variables, do that by creating a new education variable that has the fewer categories. As to whether this is advisable, there are a few issues to think about in the context of your specific research:
When you combine two different levels, like secondary school and college, into a single level, you are imposing the assumption in your model that their effects on the outcome variable are the same. Depending on what that outcome variable is, this may or may not be a reasonable assumption. If you are not sure, and if you are not bound by a pre-determined research protocol that commits you to one specific way to handle the education variable, you might want to first do the analysis keeping secondary school and college separate. Then look at their coefficients. If they are pretty similar, you probably won't lose much by re-doing the analysis with a new education variable that combines them into a single "secondary or more." category. But if their coefficients differ appreciably, you would be best advised to maintain them as separate.

Sample size may also bear on the matter. There are various rules of thumb about how many observations you need per variable to avoid overfitting issues. If your sample size isn't large enough to accommodate the most fine-grained version of your education variable, combining some categories will reduce the number of variables and improve the situation.

As for the missing values, if you use factor-variable notation, you don't have to think about this because Stata handles it correctly for you automatically. For your learning there are a few rules here:
If the value of the education variable is missing altogether, then all of the indicators for education (regardless how many, and whether you create them yourself or let Stata's factor-variable notation do it for you) are missing values, and the observation is excluded from the regression.

In any observation where the education variable is not missing, there will be no missing values for any of the indicators. Each indicator will be 0, except for a single one corresponding to the education category in that observation, which will have a value of 1.

Do read -help fvvarlist- to learn about how to use factor-variable notation in greater detail.

*It must be a numeric variable. If what you have is a string variable, use the -encode- command to make a numeric equivalent and use that in the factor-variable notation.
Comment

Sam Murgatroyd

Join Date: Oct 2023
Posts: 33

12 Jun 2024, 14:36

Hello,

I was looking for an answer to my own question on this forum and came across this post. Is there any resource you can point to about the rules of thumb mentioned in this comment?

Originally posted by Clyde Schechter View Post

Sample size may also bear on the matter. There are various rules of thumb about how many observations you need per variable to avoid overfitting issues. If your sample size isn't large enough to accommodate the most fine-grained version of your education variable, combining some categories will reduce the number of variables and improve the situation.

An answer to this question is not what brought me here; but I am interested in the answer and I do feel that my original question fits under the rather broad topic started on this thread. If this is not the correct place to post my original question, I apologise in advance and will happily start a new topic if advised to do so.

My situation is as follows: I want to investigate what happens to my dependent variable, price dispersion, when the share of specific taxes in total excise taxes increases by 1%, while controlling for other elements of policy that can impact price dispersion. The problem that I am having is that some observations are coded as missing for some dummies simply because these observations are not relevant for what the dummy represents. These observations are, however, relevant for the other dummies that I want to include in the regression. So basically, even if an observation is missing by one criterion that establishes 1 and 0 in a dummy, I need it to stay in the regression because it has a 1 or 0 meaning for another criterion that matters in my regression.

In terms of the specifics of my data, I have three types of tax structures: pure specific, pure ad valorem and mixed (which combines both specific and ad valorem). Below is a description of my data

[CODE]dataex price_dispersion_use specific ad_valorem mixed_excise specificshare autoinfl_adj retailpisbase_inadv more_spec_inmixed

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double price_dispersion_use float(specific ad_valorem mixed_excise) double specificshare float(autoinfl_adj retailpisbase_inadv more_spec_inmixed)
                15 . . .                  . . . .
                 . . . .                  . . . .
13.333333333333334 . . .                  . . . .
 11.76470588235294 . . .                  . . . .
                 . . . .                  . . . .
 44.44444444444444 1 0 0                100 . . .
56.666666666666664 1 0 0                100 0 . .
              62.5 1 0 0                100 0 . .
 60.60606060606061 1 0 0                100 1 . .
                60 1 0 0                100 1 . .
 33.33333333333333 1 0 0                100 0 . .
35.714285714285715 0 0 1  65.45226432334859 . 0 1
                15 0 0 1  60.25537348371993 . 0 1
                50 0 0 1   60.7168458781362 0 0 1
 48.57142857142857 0 0 1  56.60921404817905 0 0 1
 72.85714285714285 1 0 0                100 0 . .
 72.85714285714285 1 0 0                100 0 . .
 77.77777777777777 1 0 0                100 0 . .
 68.44993141289439 1 0 0                100 0 . .
 69.86301369863014 1 0 0                100 0 . .
                75 . . .                  . . . .
                 . . . .                  . . . .
                 . . . .                  . . . .
                25 0 1 0                  0 . 0 .
                25 0 1 0                  0 . 0 .
                75 . . .                  . . . .
                50 . . .                  . . . .
            66.875 . . .                  . . . .
 62.05673758865249 . . .                  . . . .
63.829787234042556 . . .                  . . . .
41.935483870967744 0 1 0                  0 . 0 .
             37.75 0 1 0                  0 . 0 .
 45.34920634920635 0 1 0                  0 . 0 .
18.726114649681527 0 1 0                  0 . 0 .
13.384615384615383 0 1 0                  0 . 0 .
                30 1 0 0                100 0 . .
26.666666666666668 1 0 0                100 0 . .
42.857142857142854 0 1 0                  0 . 0 .
              47.5 1 0 0                100 1 . .
 48.23529411764706 1 0 0                100 1 . .
 78.93318965517241 1 0 0                100 1 . .
 73.84341637010677 1 0 0                100 1 . .
 82.34126984126985 1 0 0                100 1 . .
 71.02189781021899 1 0 0                100 1 . .
 68.45524542829644 1 0 0                100 1 . .
 80.61224489795919 0 0 1 28.479886080455678 . 1 0
                80 0 0 1  33.89830508474576 0 1 0
                80 0 0 1  35.99689681923962 0 1 0
 82.45614035087719 0 0 1  35.17816527672473 0 1 0
 68.35820895522387 0 0 1  42.44186046511624 0 1 0
                24 1 0 0                100 . . .
             56.25 1 0 0                100 0 . .
23.076923076923077 1 0 0                100 0 . .
 47.05882352941177 1 0 0                100 0 . .
 55.55555555555556 1 0 0                100 0 . .
48.658536585365916 1 0 0                100 . . .
 40.22346368715088 1 0 0                100 . . .
                 . . . .                  . . . .
 61.08949416342412 1 0 0                100 0 . .
                 . 1 0 0                100 0 . .
                50 . . .                  . . . .
 33.33333333333333 . . .                  . 0 . .
                40 0 1 0                  0 . 0 .
 34.78260869565218 0 1 0                  0 . 0 .
58.333333333333336 0 1 0                  0 0 1 .
15.789473684210526 0 1 0                  0 . 1 .
22.727272727272727 0 1 0                  0 . 1 .
 33.33333333333333 0 1 0                  0 . 1 .
32.142857142857146 0 1 0                  0 . 1 .
                25 0 1 0                  0 . 1 .
 79.32850559578671 1 0 0                100 . . .
             81.25 1 0 0                100 . . .
 45.23433385992628 1 0 0                100 . . .
                 . . . .                  . . . .
 78.84615384615384 1 0 0                100 . . .
            35.625 1 0 0                100 0 . .
31.914893617021278 1 0 0                100 0 . .
30.645161290322577 1 0 0                100 0 . .
 25.71428571428572 1 0 0                100 0 . .
23.958333333333332 1 0 0                100 1 . .
 80.82901554404145 0 0 1 13.916991582118477 . 1 0
 81.64556962025317 0 0 1   21.4458712242379 0 1 0
 83.33333333333334 0 0 1  32.83624195666059 0 1 0
 85.29411764705883 0 0 1  32.80721956160032 0 1 0
              72.5 0 0 1  35.99035698884507 0 1 0
 41.66666666666667 . . .                  . . . .
 41.66666666666667 1 0 0                100 0 . .
                40 1 0 0                100 . . .
                50 1 0 0                100 0 . .
50.391644908616186 1 0 0                100 0 . .
                 . 0 1 0                  0 . 0 .
                20 0 1 0                  0 . 0 .
              22.5 0 1 0                  0 . 0 .
47.368421052631575 0 1 0                  0 . 0 .
                 . 0 1 0                  0 . 0 .
                 . . . .                  . . . .
                 . . . .                  . . . .
                 . . . .                  . . . .
                 . 0 1 0                  0 . 0 .
                 . 0 1 0                  0 . 0 .
end

price_dispersion_use is my dependent variable and is a percentage that measures the share of the cheapest price in the the most expensive price, it can range from zero to 100. A higher percentage means a smaller gap between the cheap and expensive brands and vice versa.

specific is a binary indicator showing that a pure specific system is in place; ad valorem is a binary indicator showing that a pure ad valorem system is in place and mixed_excise shows that a mixed system is in place.

autoinfl_adj_of_spec can only defined be defined as zero or 1 if the variable called specific==1 or mixed==1. If the tax is levied as an ad valorem tax, then autoinfl is Not Applicable (N/A) in practice and represented by missing in the dataset.

retailpisbase_inadv identifies if the price used in the base of the ad valorem tax is the retail price, as opposed to a 'pre-retail' price. It can only be defined, either as 1 or zero, if ad valorem==1 or mixed_excise==1. If a country implements a specific tax, whether the retail price is the base or not is Not Applicable (N/A) in practice, and represented by missing in the dataset.

more_specific_inmixed defines whether, in a mixed excise system, which combines both specific and ad valorem taxes, the share of specific taxes in total taxes is larger than the share of ad valorem taxes in total taxes. A value of one indicates that a larger proportion of total excise taxes in the mixed system is made up of specific taxes; a value of zero indicates that a larger proportion of total excise taxes in the mixed system is made up of ad valorem taxes. If a country implements a pure ad valorem, or a pure specific system, this variable is Not Applicable in practice and represented by missing in the dataset.

specificshare is the total share of all excise taxes made up specific taxes. It’s continuous. For a specific tax, this will be 100%, if it's ad valorem it will be 0% and if it's a mixed system then 0%<specificshare<100%.

This is what the data looks like:

Code:

xtset id year, delta(2)

Panel variable: id (unbalanced)
 Time variable: year, 2014 to 2022
         Delta: 2 units

. 
. xtsum price_dispersion_use specific ad_valorem mixed_excise specificshare autoinfl_adj retailpisbase_inadv more_spec_inmixed

Variable         |      Mean   Std. dev.       Min        Max |    Observations
-----------------+--------------------------------------------+----------------
price_~e overall |   54.1633   23.68788          4        100 |     N =     821
         between |             21.47537   7.836111   92.83242 |     n =     192
         within  |             10.13382   16.45002   99.53367 | T-bar = 4.27604
                 |                                            |
specific overall |  .3879717   .4875756          0          1 |     N =     848
         between |             .4659136          0          1 |     n =     181
         within  |             .1368678  -.4120283   1.187972 | T-bar = 4.68508
                 |                                            |
ad_val~m overall |  .2570755   .4372793          0          1 |     N =     848
         between |             .4275308          0          1 |     n =     181
         within  |             .1601849  -.5429245   1.057075 | T-bar = 4.68508
                 |                                            |
mixed_~e overall |  .3549528   .4787814          0          1 |     N =     848
         between |             .4448862          0          1 |     n =     181
         within  |             .1631064  -.4450472   1.154953 | T-bar = 4.68508
                 |                                            |
speci~re overall |  58.37707   41.80718          0        100 |     N =     846
         between |             40.85851          0        100 |     n =     181
         within  |             13.11557  -21.62293   138.3771 | T-bar = 4.67403
                 |                                            |
autoin~j overall |   .232381   .4227534          0          1 |     N =     525
         between |             .3400162          0          1 |     n =     145
         within  |             .2234645   -.567619   1.032381 | T-bar = 3.62069
                 |                                            |
retail~v overall |  .4484053   .4977981          0          1 |     N =     533
         between |             .4720523          0          1 |     n =     127
         within  |             .1402647  -.3515947   1.248405 | T-bar = 4.19685
                 |                                            |
more_s~d overall |  .5949367   .4916829          0          1 |     N =     316
         between |             .4436007          0          1 |     n =      78
         within  |             .2174893  -.2050633   1.394937 | T-bar = 4.05128

.

When I run the regression, I am now in the situation where only complete observations are used; but I do not want this; I want to keep all observations that have data, even if a particular dummy variable sees the observation as missing. How do I do this?

Code:

xtreg price_dispersion_use specificshare  i.autoinfl_adj i.retailpisbase_inadv i.more_spec_inmixed, fe cluster(id)

Fixed-effects (within) regression               Number of obs     =        266
Group variable: id                              Number of groups  =         75

R-squared:                                      Obs per group:
     Within  = 0.0433                                         min =          1
     Between = 0.1085                                         avg =        3.5
     Overall = 0.1130                                         max =          5

                                                F(3,74)           =          .
corr(u_i, Xb) = -0.6257                         Prob > F          =          .

                                             (Std. err. adjusted for 75 clusters in id)
---------------------------------------------------------------------------------------
                      |               Robust
 price_dispersion_use | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
----------------------+----------------------------------------------------------------
        specificshare |   .0302317   .1436667     0.21   0.834    -.2560305    .3164939
       1.autoinfl_adj |   5.235087   2.946485     1.78   0.080    -.6359117    11.10609
1.retailpisbase_inadv |  -15.07363   .5243044   -28.75   0.000    -16.11833   -14.02893
  1.more_spec_inmixed |  -1.764627   3.135576    -0.56   0.575    -8.012399    4.483145
                _cons |   72.60878   8.022124     9.05   0.000     56.62435    88.59322
----------------------+----------------------------------------------------------------
              sigma_u |  24.217516
              sigma_e |  9.3652246
                  rho |  .86990801   (fraction of variance due to u_i)
---------------------------------------------------------------------------------------

Thank you!

Sam

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

12 Jun 2024, 16:47

I think this is a different topic that should have been begun in a new thread. Be that as it may, it's an interesting question, clearly explained, and I'll try to answer it here.

First, there are some problems with your data. You have observations where the tax type (specific, ad_valorem, or mixed_excise) is not specified at all: all three of those variables are missing. Next, you say that autoinfl_adj is only definable for specific and mixed taxes, but you have it as 1, not missing, in an observation where ad_valorem = 1 and price_dispersion_use = 58.33333. So the first priority is to fix up your data so that it actually conforms to the specification you gave. Don't proceed until your data can pass the following consistency checks:

Code:

// VERIFY THAT ad_valorem, mixed_excise, AND specific CONSTITUTES A // MUTUALLY EXCLUSIVE AND EXHAUSTIVE CLASSIFICATION OF OBSERVATIONS foreach v of varlist specific ad_valorem mixed_excise { assert inlist(`v', 0, 1) } egen check = rowtotal(specific ad_valorem mixed_excise) assert check == 1 drop check

Next, having three separate dichotomous variables for the three tax types is not useful. You need a single 3-level variable for that:

Code:

// CREATE A 3-LEVEL TAX TYPE VARIABLE label define tax_type 1 "ad valorem" 2 "mixed" 3 "specific" gen byte tax_type:tax_type = 1 if ad_valorem == 1 replace tax_type = 2 if mixed_excise == 1 replace tax_type = 3 if specific == 1

Next, to keep all those observations from dropping out, you need to replace the missing values of autoinfl_adj retailpisbase_inadv and more_spec_inmixed by something non-missing. For the approach that I'm showing here it actually doesn't matter what you replace them with, so long as it's not missing. For simplicity, I'll replace them all with zeroes, but, again, any non-missing value will serve.

Code:

// REPLACE MISSING VALUES BY AN ARBITRARY NUMBER (0 USED FOR SIMPLICITY) foreach v of varlist autoinfl_adj retailpisbase_inadv more_spec_inmixed { replace `v' = 0 if missing(`v') }

Now, you only want the value of autoinfl_adj to matter for mixed and specific taxes. So the trick is that you don't directly use the autoinfl_adj variable: you instead use only its interactions with the mixed and specific levels of tax_type. That way the value of autoinfl_adj is ignored when tax_type = ad_valorem. Analogous considerations for retailpisbase_inadv and more_spec_inmixed lead to:

Code:

reg price_dispersion_use specificshare i.autoinfl_adj#i(2 3).tax_type /// i.retailpisbase_inadv#i(1 2).tax_type i.more_spec_inmixed#2.tax_type

Note: I changed the model from -xtreg, fe- to -reg- because your example data did not include the id variable, so I could not -xtset id-. But the same approach will work with -xtreg, fe-. The only thing you may encounter differently with -xtreg, fe- is that it is possible that some of these interaction terms could prove invariant within id, in which case they will be omitted due to colinearity with the id fixed effect.

Also be aware that at least in your example data, some of the possible combinations of tax type and the other variables do not actually occur in the data, which leads to some terms being omitted as a result. Presumably in your full data set, that will not happen.
Comment
Sam Murgatroyd

Join Date: Oct 2023

Posts: 33
#5

12 Jun 2024, 20:52

Dear Clyde,

It’s going to take me a little while to implement this; but now that I have read through it I want to thank you very much for taking the time to provide such a detailed response to my question. This is incredibly helpful!

Sam
Comment

Announcement