Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need advice on recoding categorical variables into continuous variables and vice versa

    I am new to data analysis and Stata, so I apologise if I have not explained this very well.

    I have two independent variables I need some advice on: one is a ratio variable on age and the other is an ordinal variable on income. My dependent variable is a feeling thermometer on attitudes towards the economy/environment: 0 "Prioritise the economy" - 10 "Prioritise the environment.

    I have recoded age into a categorical variable, as I felt it would be better to have age groups - rather than 0-100 - in my multiple regression. My logic was that by doing so, even though I would lose some information, I could better see a difference between age groups and their relationship to my dependent variable. Is this a good idea?

    However, I have also recoded income into a discrete variable (I think). Previously, it was in categories such as "Under GBP 2,600" and "GBP 2,600 - GBP 5,199" etc. I recoded it by using the median of each category: "1300" and "3900". I felt this was a good idea as the income intervals were not equal. I'm hoping this will help when I do other tests, such as difference of means or non parametric tests. I'm worried that I have needlessly done this though. I thought that if I turned income into a discrete variable, it would make it easier for me to look at control variables, but I feel I have contradicted this thought process by grouping age.

    Any advice would be much appreciated!

  • #2
    I have recoded age into a categorical variable, as I felt it would be better to have age groups - rather than 0-100 - in my multiple regression. My logic was that by doing so, even though I would lose some information, I could better see a difference between age groups and their relationship to my dependent variable. Is this a good idea?
    In a word, no.

    The only circumstance where it is truly a good idea to make categories out of an inherently continuous independent variable is if, in the real world, something truly discontinuous happens to the dependent variable when the boundary defining adjacent categories is crossed. For example, if your outcome variable were a behavior or other attribute that abruptly changes upon becoming a legal adult (age 18 where I am), then creating separate categories for age < 18 and age >= 18 could be sensible (although, really, only if nothing much happens within the range of ages over 18. If the outcome continues to change with age after 18, then combining everybody 18 and over into a single group is a catastrophic loss of information.)

    There are some other circumstances where it is a sort-of good idea to make categories out of an inherently continuous independent variable. Foremost among these is if the relationship to the outcome variable is expected to be non-linear, and if you then make a large number of categories, this enables you to capture the non-linearity correctly. However, the number of categories must be large enough, and the width of each category small enough, that, for practical purposes, there is no relationship of the outcome to that independent variable within categories. If the number of categories becomes too large to manage when they are made narrow enough to suppress the relationship within categories, then you need to use, instead, either a linear spline or a cubic spline or some other device for representing non-linear relationships. (And, frankly, even when there is no such difficulty to be overcome, using linear splines is easier and just as good, if not better.)

    Now, what you did with income is a different matter. You started with a truly ordinal variable. By recoding it with median values for each category, you have attempted to create an interval variable: you are trying to at least take the notion of the "difference" between the categories seriously in a quantitative way. This may or may not work well, depending on whether the relationship to outcome you are looking for is, in fact, linear in those median values. It might or might not be: that's an empirical question. If it's not, then using those median values to represent the categories gains you nothing, and might well introduce mis-specification bias. You might do some exploration of outcome:income as you recoded it graphs to see if linearity seems at least approximately right. If it does not, you should probably stick to the categorical variable you started with.

    I thought that if I turned income into a discrete variable, it would make it easier for me to look at control variables...
    What specifically do you have in mind here? I don't grasp what you're getting at--maybe I'm missing something.
    Last edited by Clyde Schechter; 19 Dec 2021, 16:25.

    Comment


    • #3
      Thank you so much for your feedback! I really appreciate the time you took to reply to me. I'll use the variables are they originally were, as it's not really helping me having changed them.

      For your question, I'm looking for control variables that might further improve my model. I was looking at using a categorical variable on education level as a control variable for income. As I had to turn the ordinal income variable into dummy variables - to do multiple regression with my DV - I thought it would be less complicated if I attempted to turn income into a discrete variable. I'm not really sure what I was thinking though, I'm still a newbie to data analysis so I'm not surprised if my thought process doesn't make sense.

      Comment

      Working...
      X