3 Dummy variables in panel data

Adam Klaas

Join Date: Jan 2022

Posts: 56
#1

3 Dummy variables in panel data

02 Feb 2022, 14:27

Hello everyone,

I have a question since i need 3 dummy variables there must be always one dummy variable omitted, which becomes the constant. Now my question is how can I formulate this in stata?
I have created 2 dummy variables low and high and i would like to do xtreg for medium also but i dont know how I can do that? My professor told me it will become the constant and dont need to recreate another dummy variable for it. Im a bit lost maybe someone could help me here?

code:

Code:

//L Low M Medium H high . gen Built_L= Built_area<18.27 . gen Built_H= Built_area>25.981 . gen Built_M= 1 - Built_L - Built_H . . gen Agri_L= Agri_area<48.90 . gen Agri_H= Agri_area>54.75 . gen Agri_M= 1-Agri_L-Agri_H . . gen Forest_L= NaturalForest_area<7.3 . gen Forest_H= NaturalForest_area>15.82 . gen Forest_M= 1-Forest_H-Forest_L . . gen low_dev = Built_L & Agri_H & Forest_H . gen high_dev = Built_H & Agri_L & Forest_L
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#2

02 Feb 2022, 14:39

Unless you are using some ancient version of Stata, there is no reason to create dummy variables at all. Use factor-variable notation instead. I illustrate the approach for the variable built_level, but you can use the same approach with each of the other variables.

Code:

label define LMH 1 "Low" 2 "Medium" 3 "High" gen built_level:LMH = 1 if Built_area < 18.27 replace built_level = 3 if Built_area > 25.981 & !missing(Built_area) replace built_level = 2 if missing(Built_area) & !missing(Built_area) regress some_outcome_variable i.built_level

So built-level is a new variable that takes on values 1, 2, and 3, corresponding to, and labeled as, Low, Medium and High, respectively. In the regression, Stata will select one of these as the omitted category (probably it will be the Low category) and leave it out for you. It's all done automatically, and it's bullet-proof. If you specifically want the medium category to be the omitted category, you can force that with

Code:

regress some_outcome_variable ib2.built_level

Remember, though, that the choice of omitted category is just an aesthetic matter. All estimable regression results are unaffected by that choice. Sometimes it is convenient in terms of the way people think about the model to choose one category over the others: for example, in an experiment, it is common to make the control group the omitted category. But mathematically it doesn't matter.

I would also recommend that you not rely on the notion that the omitted category "becomes the constant." That is only true if you have only one set of such variables in the model. To avoid any confusion, if you want to see the modeled expected values of the outcome variable by all categories, use the -margins- command:

Code:

margins built_level

Finally, I want to point out that your way of defining the variables Built_L, Built_H, and Built_M in #1 will give incorrect results if there are missing values for Built_area. All of those would be misclassified as Built_H using your code.
1 like
Comment

Adam Klaas

Join Date: Jan 2022
Posts: 56

02 Feb 2022, 15:24

Originally posted by Clyde Schechter View Post

Unless you are using some ancient version of Stata, there is no reason to create dummy variables at all. Use factor-variable notation instead. I illustrate the approach for the variable built_level, but you can use the same approach with each of the other variables.

Code:

label define LMH 1 "Low" 2 "Medium" 3 "High"

gen built_level:LMH = 1 if Built_area < 18.27
replace built_level = 3 if Built_area > 25.981 & !missing(Built_area)
replace built_level = 2 if missing(Built_area) & !missing(Built_area)

regress some_outcome_variable i.built_level

So built-level is a new variable that takes on values 1, 2, and 3, corresponding to, and labeled as, Low, Medium and High, respectively. In the regression, Stata will select one of these as the omitted category (probably it will be the Low category) and leave it out for you. It's all done automatically, and it's bullet-proof. If you specifically want the medium category to be the omitted category, you can force that with

Code:

regress some_outcome_variable ib2.built_level

Remember, though, that the choice of omitted category is just an aesthetic matter. All estimable regression results are unaffected by that choice. Sometimes it is convenient in terms of the way people think about the model to choose one category over the others: for example, in an experiment, it is common to make the control group the omitted category. But mathematically it doesn't matter.

I would also recommend that you not rely on the notion that the omitted category "becomes the constant." That is only true if you have only one set of such variables in the model. To avoid any confusion, if you want to see the modeled expected values of the outcome variable by all categories, use the -margins- command:

Code:

margins built_level

Finally, I want to point out that your way of defining the variables Built_L, Built_H, and Built_M in #1 will give incorrect results if there are missing values for Built_area. All of those would be misclassified as Built_H using your code.

Thank you so much this command seems to work for all three I have done the following command:

Code:

//L Low M Medium H high
label define LMH    1   "Low"   2   "Medium"    3   "High"

gen built_level:LMH = 1 if Built_area < 18.27
replace built_level = 3 if Built_area > 25.981 & !missing(Built_area)
replace built_level = 2 if missing(Built_area) & !missing(Built_area)

label define ALMH    1   "Low"   2   "Medium"    3   "High"


gen agri_level:LMH = 1 if Agri_area < 48.90
replace agri_level = 3 if Agri_area > 54.75 & !missing(Agri_area)
replace agri_level = 2 if missing(Agri_area) & !missing(Agri_area)

label define NLMH    1   "Low"   2   "Medium"    3   "High"

gen natural_level:LMH = 1 if NaturalForest_area < 7.3
replace natural_level = 3 if NaturalForest_area > 15.82 & !missing(NaturalForest_area)
replace natural_level = 2 if missing(NaturalForest_area) & !missing(NaturalForest_area)

I only changed the label define and added A for agricultural and N for NaturalForest. Now my question is how can I classify one city as low development on the condition that agri level is high, built level is low and natural level is high.

Code:

gen low_devel=built_level==1 + agri_level==3 + natural_level==3

if I do the code above I get:

Code:

. regress Unemployment_rate Interest  ib2.low_devel
note: 0.low_devel omitted because of collinearity
note: 2b.low_devel identifies no observations in the sample

      Source |       SS           df       MS      Number of obs   =     3,210
-------------+----------------------------------   F(1, 3208)      =    419.82
       Model |  112.282202         1  112.282202   Prob > F        =    0.0000
    Residual |  857.999978     3,208  .267456352   R-squared       =    0.1157
-------------+----------------------------------   Adj R-squared   =    0.1154
       Total |   970.28218     3,209  .302362786   Root MSE        =    .51716

------------------------------------------------------------------------------
Unemployme~e |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    Interest |   .2919402   .0142484    20.49   0.000     .2640034    .3198771
             |
   low_devel |
          0  |          0  (omitted)
          2  |          0  (empty)
             |
       _cons |   .6126895   .0586884    10.44   0.000     .4976189    .7277601
------------------------------------------------------------------------------

I find your code by the way very handy I never knew these its a pity they dont learn it on my university. I thank you for that!

Comment

Adam Klaas

Join Date: Jan 2022
Posts: 56

02 Feb 2022, 15:37

Forgot to ask since built_area,agri_area and natural_area remains fairly constant for each year for every city, I wanted to have for all years the same result with label define. I do have some missing values for some years, but I have looked into the data they do not change a lot in the data.
so for example:

Code:

This city must get High for all built_level years the same for agri level giving low for all years
GM_naam Year Built_area Semi_Built_area Agri_area NaturalForest_area HousePrice built_level agri_level natural_level
's-Gravenhage 2015 58.2 2.8 2 13 184000 High Low
's-Gravenhage 2014 188000
's-Gravenhage 2016 188000
's-Gravenhage 2017 197000
's-Gravenhage 2013 199000
's-Gravenhage 2012 57.5 3.4 2.6 12.9 207000 High Low
's-Gravenhage 2011 210000
's-Gravenhage 2018 212000
's-Gravenhage 2010 57.7 3.6 3.1 12 213000 High Low
's-Gravenhage 2019 242000

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17600
#5

02 Feb 2022, 15:41

Adam:
1) your -generate- code is wrong as it lacks -if- qualifier.
You should try (caveat emptor: code untested):

Code:

gen low_devel=1 if built_level==1 & agri_level==3 & natural_level==3

2) you should become the Stata teacher of yourself (as many on this list are)

Kind regards,
Carlo
(StataNow 18.5)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#6

02 Feb 2022, 16:15

I think Carlo's advice is just a tad off. It will create a 1/. variable for low_devel, which will result in no effect for low_devel in the regression, because the base value 1 will be omitted, and the rest of the observations will be dropped due to missing value. It should be:

Code:

gen low_devel = (built_level==1 & agri_level==3 & natural_level==3)

However, his key point, that you needed to connect those logical expressions with &, not +, is spot on.
1 like
Comment
Adam Klaas

Join Date: Jan 2022

Posts: 56
#7

02 Feb 2022, 16:25

Originally posted by Clyde Schechter View Post

I think Carlo's advice is just a tad off. It will create a 1/. variable for low_devel, which will result in no effect for low_devel in the regression, because the base value 1 will be omitted, and the rest of the observations will be dropped due to missing value. It should be:

Code:

gen low_devel = (built_level==1 & agri_level==3 & natural_level==3)

However, his key point, that you needed to connect those logical expressions with &, not +, is spot on.

Thank you very much that was a stupid mistake looking back to it, I need to get used to Stata. I have unfortunately basic knowledge on stata.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#8

02 Feb 2022, 16:39

I have unfortunately basic knowledge on stata.

None of us was born knowing Stata. And it can be a steep learning curve, particularly if it is your first brush with a programming language. Part of the reason I can help others with Stata now is that I've been using it since 1994 and I've already made most of the mistakes I see here myself and can recognize my own experience with them. That said, an even bigger part is what I learned from Statalist--which I read faithfully, posting questions but not answers, for many, many years.
1 like
Comment
Adam Klaas

Join Date: Jan 2022

Posts: 56
#9

02 Feb 2022, 17:03

Originally posted by Clyde Schechter View Post

None of us was born knowing Stata. And it can be a steep learning curve, particularly if it is your first brush with a programming language. Part of the reason I can help others with Stata now is that I've been using it since 1994 and I've already made most of the mistakes I see here myself and can recognize my own experience with them. That said, an even bigger part is what I learned from Statalist--which I read faithfully, posting questions but not answers, for many, many years.

I agree, I have learned a lot on this forum from others especially from you and I am very thankfull for all the free help. Of course also the time to respond and replicate the data, as soon as I have finished learning most syntaxes ill try to help others in this forum.
Comment

Adam Klaas

Join Date: Jan 2022
Posts: 56

#10

02 Feb 2022, 17:40

Originally posted by Clyde Schechter View Post

None of us was born knowing Stata. And it can be a steep learning curve, particularly if it is your first brush with a programming language. Part of the reason I can help others with Stata now is that I've been using it since 1994 and I've already made most of the mistakes I see here myself and can recognize my own experience with them. That said, an even bigger part is what I learned from Statalist--which I read faithfully, posting questions but not answers, for many, many years.

Code:

. xtreg log_realHP CPI_percentage Unemployment_rate real_interest logReal_income logrealconsind logpop i.low_dev i.Year,fe
> cluster(GM_code)
note: 2017.Year omitted because of collinearity
note: 2018.Year omitted because of collinearity
note: 2019.Year omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =      2,250
Group variable: GM_code                         Number of groups  =        282

R-sq:                                           Obs per group:
     within  = 0.8386                                         min =          7
     between = 0.0223                                         avg =        8.0
     overall = 0.0019                                         max =          8

                                                F(11,281)         =    1037.91
corr(u_i, Xb)  = -0.3609                        Prob > F          =     0.0000

                                   (Std. Err. adjusted for 282 clusters in GM_code)
-----------------------------------------------------------------------------------
                  |               Robust
       log_realHP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
   CPI_percentage |   .1060544     .00435    24.38   0.000     .0974917    .1146172
Unemployment_rate |    .036659   .0089862     4.08   0.000     .0189702    .0543479
    real_interest |   .0584772   .0058742     9.95   0.000     .0469141    .0700402
   logReal_income |   .2265326   .0964005     2.35   0.019     .0367738    .4162914
   logrealconsind |   .3836205   .0301206    12.74   0.000     .3243299    .4429112
           logpop |    .097883   .0651707     1.50   0.134    -.0304017    .2261677
        1.low_dev |   .0131723   .0033689     3.91   0.000     .0065409    .0198038
                  |
             Year |
            2013  |  -.0414103    .004665    -8.88   0.000     -.050593   -.0322276
            2014  |  -.0064646   .0023115    -2.80   0.006    -.0110146   -.0019146
            2015  |   .0040245   .0033488     1.20   0.230    -.0025675    .0106164
            2016  |   .0167621   .0024059     6.97   0.000     .0120262     .021498
            2017  |          0  (omitted)
            2018  |          0  (omitted)
            2019  |          0  (omitted)
                  |
            _cons |   3.952419   1.338873     2.95   0.003     1.316925    6.587913
------------------+----------------------------------------------------------------
          sigma_u |   .2533457
          sigma_e |  .02864636
              rho |  .98737607   (fraction of variance due to u_i)
-----------------------------------------------------------------------------------

This is the regression result I get for high development, however I am a bit worried about the coefficients for interest and unemployment. Since I would have expected that it would have a negative coefficient because higher unemployment would mean less likely to be able to buy a house or if interest rates increasing then the cost of borrowing increases and you would less likely buy a house so demand for housing would decreases so as the price of houses.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#11

02 Feb 2022, 18:44

But that's not correct reasoning. You are reasoning about the isolated effects of interest rates and unemployment on housing prices. If your substantive reasoning is correct (which makes sense to me), and if you have no major problems in the data set, then if you just ran -xtreg log_realHP log_realIncome, fe cluster(GM_code)- and, separately, -xtreg log_realHP Unemploymentrate, fe cluster(GM_code)- you would find the signs in accord with your expectations. Try that. If you do not find that, then either your expectations about these relationships are, for some reason, wrong, or you have something wrong with your data.

But the findings when you add other variables to the model can change things. They can even change them to the opposite sign. And by huge amounts. The expectation that there will be any resemblance whatsoever between the direct, unadjusted relationship and a relationship that is adjusted for other factors is simply unfounded. This is known as Simpson's paradox, or, in the context of regression, Lord's paradox. The Wikipedia page on Simpson's paradox is very good, and although it is framed in terms of discrete variables and contingency tables, it illustrates beautifully how this sort of thing can happen.

In addition to that, I see another potential reason why those coefficients might not be unbiased estimates of the causal effects of those factors. You have a lot of relationships among the covariates here. My guess is that as unemployment rises or housing prices rise, population would be caused to shrink. If that's right, this makes logpop a collider in a graph of causal relationships in your model. The inclusion of a collider in a model results in biased estimates of the causal relationships.
2 likes
Comment
Adam Klaas

Join Date: Jan 2022

Posts: 56
#12

03 Feb 2022, 04:10

Originally posted by Clyde Schechter View Post

But that's not correct reasoning. You are reasoning about the isolated effects of interest rates and unemployment on housing prices. If your substantive reasoning is correct (which makes sense to me), and if you have no major problems in the data set, then if you just ran -xtreg log_realHP log_realIncome, fe cluster(GM_code)- and, separately, -xtreg log_realHP Unemploymentrate, fe cluster(GM_code)- you would find the signs in accord with your expectations. Try that. If you do not find that, then either your expectations about these relationships are, for some reason, wrong, or you have something wrong with your data.

But the findings when you add other variables to the model can change things. They can even change them to the opposite sign. And by huge amounts. The expectation that there will be any resemblance whatsoever between the direct, unadjusted relationship and a relationship that is adjusted for other factors is simply unfounded. This is known as Simpson's paradox, or, in the context of regression, Lord's paradox. The Wikipedia page on Simpson's paradox is very good, and although it is framed in terms of discrete variables and contingency tables, it illustrates beautifully how this sort of thing can happen.

In addition to that, I see another potential reason why those coefficients might not be unbiased estimates of the causal effects of those factors. You have a lot of relationships among the covariates here. My guess is that as unemployment rises or housing prices rise, population would be caused to shrink. If that's right, this makes logpop a collider in a graph of causal relationships in your model. The inclusion of a collider in a model results in biased estimates of the causal relationships.

Thank you very much for the information it makes sense a lot to me!
I have done the regressions separately as you suggested and indeed the signs are correct. Would you suggest to just omit logpop as you have stated that the estimates results in a biased estimates ?

So would it be correct if I ran log real house price with Unemploymentrate isolated and using this procedure with other variables as well isolated?

Last edited by Adam Klaas; 03 Feb 2022, 04:41.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#13

03 Feb 2022, 14:00

Would you suggest to just omit logpop as you have stated that the estimates results in a biased estimates ?

First, we need to be clear about your research goals. If you are building a model for purposes of prediction only, then it doesn't matter whether the coefficients make sense. The only issue is predictive accuracy. And in that case, you would not worry about the sign of this or that coefficient as long as the model fits the data reasonably well and can be cross-validated in another sample.

But if you are seeking to understand causal relationships, things quickly get complicated with multiple regression models. A model can produce unbiased estimates of causal effects for one particular relationship but be biased with respect to others. This means that you have to be clear about what your primary goal is. If you want to estimate the causal relationship between X1 and Y1, you may need to include or exclude covariates differently than you would to estimate the causal relationship between X1 and Y2, or X2 and Y1, or X2 and Y2. And if you need to do all of those, it is entirely possible that there is no single model that will get all of those right simultaneously. This is a problem that is typically brushed under the rug in publications. We routinely publish tables showing all of the regression coefficients, but typically (mea culpa on this as well) fail to point out that while the regression coefficient for our primary focused relationship is (hopefully) unbiased, the other coefficients might not be, and, in some models, some of them definitely are not.

So, if estimating a causal effect of unemployment or interest rates on housing prices is the goal, then, because logpop is a collider,* you should remove it from the model. But if unemployment or interest rates are in the model solely as nuisance variables and the real focus of the study is on the causal effect of some other variable for which logpop is not a collider, then it could be better to leave logpop in the model and just accept that the coefficients of unemployment or interest rates may not be good estimators of their causal effects. You need to be clear on what causal relationship(s) you need to estimate, have carefully reviewed a direct acyclic graph showing the presumed causal effects among all of the variables in your study, and then select your model covariates accordingly. You must include confounders. You must exclude variables that lie directly on the causal path between your main variables of interest, and you must exclude any collider of the focused causal relationship. If you have multiple causal effects that you need to estimate, then you may have to make different choices of covariates and estimate different models for them.

*My assertion that logpop is a collider is based on my lay understanding of the variables in your model and the relationships among them. That understanding may be incorrect. You need to draw your own graph of the causal relationships based on your professional understanding. If you are not confident, obtain advice from people more experienced in that domain.

So would it be correct if I ran log real house price with Unemploymentrate isolated and using this procedure with other variables as well isolated?

Frequently in research we report these bivariate associations in addition to our multivariable models. Just remember that, depending on the relationships among all of the different variables, these bivariate estimates may or may not reflect causal estimates--they may be biased by confounding (omitted variable bias).
Comment

Announcement