Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Collinearity from area dummies

    Hello everyone!

    I am estimating a model for my undergraduate thesis. In the model, I have dummy variables for urban, rural, and mixed areas to categorize my observations (the data is cross-section). This is how I store them in my table: If an observation is urban, the 'urban' column will be 1, and the others will be 0; if an observation is mixed, the 'mixed' column will be 1, and the others 0; if an observation is rural, the 'rural' column will be 1, and the others 0. I just realized that this caused collinearity, as Stata omitted one of my dummies, the rural one, due to collinearity, as well as the interaction variable between that dummy and my main explanatory variable. I am planning to use these dummies to moderate my main explanatory variable, to see how the effect might change depending on different recipient areas. As you might have guess I'm not really good with statistics lol So my question is, is there a work around for this ?

    Thanks in advance!

  • #2
    Unless you are using some archaic version of Stata, you should not create these indicator ("dummy") variables yourself. You should instead have a single variable, area_type, set to 1 for urban, 2 for mixed, and 3 for rural. Then in your regression analyses, you can refer to that variable as i.area_type, and Stata will create appropriate indicators "on the fly" in the regression. Similarly the interaction with other variables will involve notation like i.area_type##other_variable. This is called factor variable notation. Read -help fvvarlist- to see how it is used. You may then want to look at the outcome values in each category or marginal effects using the -margins- command. While the Stata documentation for -margins- is complete and thorough, I think an easier way to learn about it is from Richard Williams' https://www3.nd.edu/~rwilliam/stats/Margins01.pdf.

    On a more conceptual level, whenever you have a categorization involving N levels, the representation of that by indicator variables involves N-1 variables. One of the levels is always omitted as the reference level, and the other variables' coefficients represent the differences between their corresponding levels and the reference level. (The effect on the reference level is absorbed into the regression's constant term.)

    Comment


    • #3
      Thank you so much Dr Clyde Schechter for the response and suggestion! It helps me a lot! I wasn't aware I can do that, as I'm still not quite familiar with the software haha

      Comment

      Working...
      X