Hello,
I have a quick question about interpreting multi-category categorical variables (polytomous variables) in Stata. I looked for previous posts asking about this, but did not find any. Let's say that I am interested in estimating a linear model of average July temperature in U.S. cities based on region. I will use the citytemp internal dataset in Stata. In the dataset, there is the continuous variable "tempjuly" that is the average temperature in July and the variable "region" that is a categorical variable with four categories where 1=NE, 2=N Centrl, 3=South, and 4=West. I estimate my model using tempjuly as the outcome and region as the predictor, and I do this using the "i." prefix for "region" to tell Stata to treat this as a factor variable.
Stata defaults to using 1=NE as the reference category for region, and then includes three dummy variables for the remaining categories. I know that interpreting my results, I need to compare each of these dummy variable coefficients with the baseline reference category. My question is: when reporting results, should I report the other two dummy variables as control variables? In other words, looking at the coefficient for "South" the value is 7.6396. Two options for how I can report results are:
1. "As the coefficient shows, a city being in the South region is estimated to have an average temperature in July that is 7.64 degrees higher relative to a city being in the NE region.
Or
2. "As the coefficient shows, a city being in the South region is estimated to have an average temperature in July that is 7.64 degrees higher relative to a city being in the NE region, after controlling for the regions N Centrl and West.
Which is the more precise way to report results?
Here is why I ask: in multiple statistics textbooks I have read, when they discuss making polytomous variables into dummy variables and then conducting a regression, they report results as if the other dummy variables are not controls (i.e., they report results of the dummy variables as if they are each bivariate regressions). I think the reason for this is, theoretically, each of these dummies is really part of a larger polytomous variable. Mathematically, however, each of these dummies is a control variable for each of the other dummies. Is option #2 the more precise, correct way to report results?
Best,
Thomas
I have a quick question about interpreting multi-category categorical variables (polytomous variables) in Stata. I looked for previous posts asking about this, but did not find any. Let's say that I am interested in estimating a linear model of average July temperature in U.S. cities based on region. I will use the citytemp internal dataset in Stata. In the dataset, there is the continuous variable "tempjuly" that is the average temperature in July and the variable "region" that is a categorical variable with four categories where 1=NE, 2=N Centrl, 3=South, and 4=West. I estimate my model using tempjuly as the outcome and region as the predictor, and I do this using the "i." prefix for "region" to tell Stata to treat this as a factor variable.
Code:
sysuse citytemp tab tempjuly tab region reg tempjuly i.region
1. "As the coefficient shows, a city being in the South region is estimated to have an average temperature in July that is 7.64 degrees higher relative to a city being in the NE region.
Or
2. "As the coefficient shows, a city being in the South region is estimated to have an average temperature in July that is 7.64 degrees higher relative to a city being in the NE region, after controlling for the regions N Centrl and West.
Which is the more precise way to report results?
Here is why I ask: in multiple statistics textbooks I have read, when they discuss making polytomous variables into dummy variables and then conducting a regression, they report results as if the other dummy variables are not controls (i.e., they report results of the dummy variables as if they are each bivariate regressions). I think the reason for this is, theoretically, each of these dummies is really part of a larger polytomous variable. Mathematically, however, each of these dummies is a control variable for each of the other dummies. Is option #2 the more precise, correct way to report results?
Best,
Thomas
Comment