Question about interpreting factor variables in Stata output

Thomas Robert

Join Date: Apr 2023

Posts: 9
#1

Question about interpreting factor variables in Stata output

28 Apr 2023, 13:49

Hello,

I have a quick question about interpreting multi-category categorical variables (polytomous variables) in Stata. I looked for previous posts asking about this, but did not find any. Let's say that I am interested in estimating a linear model of average July temperature in U.S. cities based on region. I will use the citytemp internal dataset in Stata. In the dataset, there is the continuous variable "tempjuly" that is the average temperature in July and the variable "region" that is a categorical variable with four categories where 1=NE, 2=N Centrl, 3=South, and 4=West. I estimate my model using tempjuly as the outcome and region as the predictor, and I do this using the "i." prefix for "region" to tell Stata to treat this as a factor variable.

Code:

sysuse citytemp tab tempjuly tab region reg tempjuly i.region

Stata defaults to using 1=NE as the reference category for region, and then includes three dummy variables for the remaining categories. I know that interpreting my results, I need to compare each of these dummy variable coefficients with the baseline reference category. My question is: when reporting results, should I report the other two dummy variables as control variables? In other words, looking at the coefficient for "South" the value is 7.6396. Two options for how I can report results are:

1. "As the coefficient shows, a city being in the South region is estimated to have an average temperature in July that is 7.64 degrees higher relative to a city being in the NE region.

Or

2. "As the coefficient shows, a city being in the South region is estimated to have an average temperature in July that is 7.64 degrees higher relative to a city being in the NE region, after controlling for the regions N Centrl and West.

Which is the more precise way to report results?

Here is why I ask: in multiple statistics textbooks I have read, when they discuss making polytomous variables into dummy variables and then conducting a regression, they report results as if the other dummy variables are not controls (i.e., they report results of the dummy variables as if they are each bivariate regressions). I think the reason for this is, theoretically, each of these dummies is really part of a larger polytomous variable. Mathematically, however, each of these dummies is a control variable for each of the other dummies. Is option #2 the more precise, correct way to report results?

Best,
Thomas

Last edited by Thomas Robert; 28 Apr 2023, 13:52.
Tags: categorical, categorical variable, dummy variable, factor
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

28 Apr 2023, 14:54

You should report the results as #1. You are correct that, strictly speaking, the results for each region are adjusted for the others, but when the indicators in question are part of a single category everybody will understand what is going on. If you use #2, it will seem weird to people.

By the way, I realize that the term control variable is in widespread use and my personal campaign to stamp it out is doomed. Nevertheless, at least keep it in the front of your mind that in observational data you are not controlling anything. Only experiments produce control (and sometimes even they don't). You are adjusting for the effects of these other variables; it is not possible to control them in observational data. And the term covariate is preferable to the term control variable as it does not imply something that isn't possible.
1 like
Comment
Thomas Robert

Join Date: Apr 2023

Posts: 9
#3

28 Apr 2023, 17:51

Hello Clyde,

Thanks for your input! I had wondered why statistics textbooks do not report the other dummies as controls. Also, I appreciate your comment about the term "control variables." As one who strives to be precise about how I report results (hence the reason for the original post), I'm wondering the phrasing and terms you would use instead. I'm careful to avoid using the phrase "when other variables are held constant" to describe other independent variables given the criticism here: https://journals.sagepub.com/doi/pdf...867X1601600103. In the article, the author advocates for describing a beta coefficient as "..how Y responds to change in X after adjusting for simultaneous linear change in the other predictors.." (pg. 7). The issue, of course, is that this is somewhat verbose compared to using words like "controlling for" or "held constant."

So, instead of stating, "the coefficient of beta is the effect that X has on Y controlling for the other independent variables," how would you phrase it? One idea that I thought of based on your comment is to just say "the coefficient of beta is the effect that X has on Y adjusting for covariates." The word "adjusting" might not be perfect here, but it gets a little closer to being precise, and this also uses the word "covariates" instead of "control" as you recommend. I'm curious to know what you think.

Best,
Thomas
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

28 Apr 2023, 18:49

the coefficient of beta is the effect that X has on Y adjusting for covariates

This would be excellent.
Comment
Thomas Robert

Join Date: Apr 2023

Posts: 9
#5

01 May 2023, 13:24

Hello Clyde,

I like this suggestion. Thanks again, for the quick reply. I thought of another quick multi-level categorical variable. Hypothetically, let's say that I want to predict township tax revenue based on the county a township is in, and I have a dataset of 300 townships and three counties. But, let's say that a few of the townships share two different counties (i.e., the county dividing line cuts through the township). So, I have a categorical variable for "county" that is coded as 1: County A, 2: County B, 3: County C, 4: County B and County C.

One option is to use "i." in Stata for the county variable, which makes County A the default and then I can interpret the results for townships in County B only, County C only, and County B and C together, relative to County A. Another option is that I create three dummy variables for County A, B, and C instead of using "i." in Stata. In this second option, do I still need to leave out one dummy variable as a reference category? Or, can I include all three dummy variables since the County B and County C dummies are not mutually exclusive? I know that using either of this options will change the interpretation of results, but do you know if one option is more mathematically appropriate than the other? Or, is the decision to us "i." or separate these into individual dummies all driven by theory, depending on how I want to interpret the results?

Best,
Thomas
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

01 May 2023, 15:21

Another option is that I create three dummy variables for County A, B, and C instead of using "i." in Stata. In this second option, do I still need to leave out one dummy variable as a reference category? Or, can I include all three dummy variables since the County B and County C dummies are not mutually exclusive?

You would not leave out one, because with B and C not being mutually exclusive, there is no colinearity here.

However, there is another modeling issue using separate variables for A, B, and C: the effect of being split between county B and county C is probably not the same as the sum of the county B and county C effects separately. So modeling this way would be a mis-specification of the process. You would also need to include county B # county C interaction term.

My inclination would be to use a four-level county variable: A, B, C, B&C and use i.county. This is the simplest representation of the problem and the interpretation of the coefficients is as clear as possible. Mathematically, it would be equivalent to doing i.countyA, i countyB, icountyC and i.countyB#i.countyC. This would be preferable if the focus is specifically on the synergy or interference, as the case may be, resulting from being in two counties. But otherwise, it's just more complicated with no added value I can see.
Comment
Thomas Robert

Join Date: Apr 2023

Posts: 9
#7

04 May 2023, 20:52

Hello Clyde,

I was also leaning towards keeping the categorical variable as it is, so this is helpful. Thanks again for your input!
Comment

Announcement

Question about interpreting factor variables in Stata output

Comment

Comment

Comment

Comment

Comment

Comment