ordered factor variable as a regressor

Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#1

ordered factor variable as a regressor

06 May 2016, 10:25

I have an age catagory variable that codes individuals age into the following categories: 20-40,40-60,60-80,80+. The data is panel data so individuals can change age categories and since were dealing with age, the categories have an ordinal meaning and not just a nominal one (meaning that 40-60 catagory is "above" 20-40). if I run something like:
xtreg y i.age_group

Stata decomposes age_group into dummies - which doesn't take into account that the difference between the age groups has meaning. how can I estimate something like that?
Tags: None
Richard Williams

Join Date: Apr 2014

Posts: 4946
#2

06 May 2016, 10:47

We just had a very similar discussion. See

http://www.statalist.org/forums/foru...dent-variables

http://www3.nd.edu/~rwilliam/xsoc739...ndependent.pdf

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17675
#3

06 May 2016, 10:47

Ariel:
the first question that I would address to myself is: are those aged 40-60 twice something than those aged 20-40?

PS: Crossed in the cyberspace with Richard's reply.

Kind regards,
Carlo
(Stata 19.0)
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#4

06 May 2016, 10:51

Crossed with Carlo & Richard

The usual options are xtreg y i.age_group OR xtreg y age_group.

There was a recent thread on some similar issues here: http://www.statalist.org/forums/foru...dent-variables.

Richard Williams has written a document about it and offers some options including the use of sheaf coefficients::
http://www3.nd.edu/~rwilliam/xsoc739...ndependent.pdf

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#5

06 May 2016, 16:04

The difference between values on an ordinal scale is mathematically undefined. Another issue with your variable (as explained above), is that the categories are not mutually exclusive. In other words, someone who is 40 years old would be categorized in the 20-40 and 40-60 categories.
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#6

07 May 2016, 01:40

Thanks for the great replies, the thread and the the document by Richard Williams especially. When I have again access to the data I will test whether or not I can treat the age group regressor as continuous.

I'll just elaborate a bit on my specific issue: I do have the original age values (as a continuous variable denoting number of years since birth) but due to privacy concerns with the data, I was asked to categorize this variable in order to lower the chance of identifying individuals. So my question is also How to code this. is keeping the spaces the same across categories important? Obviously I would not code this as 1,100,924,3000 but rather something like 1,2,3,4,5 (i'm not sure whether to code the lowest category as 0 for example). So any advice regarding "best practices" coding for such a situation would be greatly appreciated.

wbuchanan - This is of course an error on my part in writing the post. the categories are mutually exclusive and in this example are: 20-40,41-60,61-80,81+
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17675
#7

07 May 2016, 01:49

Ariel:
there's a renowned caveat about catgorizing continuous predictor like age http://www.ncbi.nlm.nih.gov/pubmed/16217841.
Can't you explore other ways to conceal the identity of those included in your dataset? It sounds strange to me that age only can be tagged as a sensitive datum.

Kind regards,
Carlo
(Stata 19.0)
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#8

07 May 2016, 04:25

Unfortunately it's not my choice to make. Information Security indicates age as "highly probable to lead to identification" and so It's up to me how to deal with it.
I can say that age is just a covariate and not my main variable of interest, and that if I estimate:

xtreg y x age covars
or
xtreg y x i.age_group covars

The estimations for my main variable of interest (x) and also the other covars is pretty much the same (the coefficient for x, which is a count variable, goes down from 321 to 319 if I recall correctly). The trouble is that a fellow researcher told me that by using i.age_group I'm introducing dummy categories which as I said in the main post: (a) introduce too many regressors to the model (b) doesn't take into account the ordinal nature of the categories.

Last edited by Ariel Karlinsky; 07 May 2016, 05:11.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17675
#9

07 May 2016, 04:37

Ariel:
thanks for providing more details.
I would treat age as a continuous variable. if that choice can reduce your problems.

Kind regards,
Carlo
(Stata 19.0)
2 likes
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#10

07 May 2016, 05:13

Thanks! and then the coefficient for age_group would be "the change in Y if an individual moves one age group upwards"?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17675
#11

07 May 2016, 05:24

Ariel:
not quite, in that I would rule out age_category from the set of predictors.
If you treat age as a continuos, the related coefficient explain the variation of the dependent variable given a 1-year change, others things being equal.

Kind regards,
Carlo
(Stata 19.0)
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#12

07 May 2016, 05:29

The issue is that I cannot use the original continuous (year level) variable age due to privacy issue, and have to recode it in some way. the question is then how to estimate with the recoded variable. Or maybe I didn't understand what you were saying?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17675

#13

07 May 2016, 06:45

Ariel:
my previous reply was related to youir concerns about treating -age_goups- as a categorical variable; hence, I thought that treating -age- as a continuos predictors (as you proposed in the first equation reported in #8) could save you time and worries.
Now I notice that you purpose is to use -age_group- (not -age-) as continuos (as Carol suggested at #4); in that case, as you wrote,

...the coefficient for age_group would be "the change in Y if an individual moves one age group upwards"...

.
However, setting aside for a while the totally legal privacy issues, you should be aware of the different estimates that you get following one of those two approaches, as reported in the following toy-example:

Code:

. sysuse auto.dta
(1978 Automobile Data)

. reg price i.rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(4, 64)        =      0.24
       Model |  8360542.63         4  2090135.66   Prob > F        =    0.9174
    Residual |   568436416        64     8881819   R-squared       =    0.0145
-------------+----------------------------------   Adj R-squared   =   -0.0471
       Total |   576796959        68  8482308.22   Root MSE        =    2980.2

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       rep78 |
          2  |   1403.125   2356.085     0.60   0.554    -3303.696    6109.946
          3  |   1864.733   2176.458     0.86   0.395    -2483.242    6212.708
          4  |       1507   2221.338     0.68   0.500    -2930.633    5944.633
          5  |     1348.5   2290.927     0.59   0.558    -3228.153    5925.153
             |
       _cons |     4564.5   2107.347     2.17   0.034     354.5913    8774.409
------------------------------------------------------------------------------


. reg price c.rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(1, 67)        =      0.00
       Model |  24770.7652         1  24770.7652   Prob > F        =    0.9574
    Residual |   576772188        67  8608540.12   R-squared       =    0.0000
-------------+----------------------------------   Adj R-squared   =   -0.0149
       Total |   576796959        68  8482308.22   Root MSE        =      2934

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       rep78 |   19.28012   359.4221     0.05   0.957    -698.1295    736.6897
       _cons |   6080.379    1274.06     4.77   0.000     3537.345    8623.413
------------------------------------------------------------------------------

Kind regards,
Carlo
(Stata 19.0)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#14

07 May 2016, 08:35

Here's my two cents. You state in #8 that the result of principal interest changes by only a negligible amount when you switch from including age as a continuous covariate to using i.age_gp. Your fellow researcher is correct that using age groups this way is not a fully faithful way of specifying age because it overlooks the ordinal properties of age_gp. But, as you have already demonstrated, it actually makes no difference in this case. So my response to your fellow researcher would be--good point in general, but in this case it is inconsequential. However if your fellow researcher is, perhaps, your supervisor or somebody that you cannot afford to by sassy with, here's another compromise approach. Your age groups are defined so as to cover approximately equal ranges (exactly equal except for the 81+ group). So if you code them as 1, 2, 3, 4 (or any 4 equally spaced integers), you can include age_gp in your regression as if it were a continuous variable. I'm pretty confident, given what happened when you used it as a nominal-level variable, that your main result will still be essentially unchanged from before, and should be a satisfactory response to your colleague's concerns. If you want to make it even more transparent what's going on, you could code the age groups as 30, 50, 70, 90, these numbers corresponding to the midpoints of the ranges in the groups, and treat it as a continuous variable.

Another approach would be to take the age variable and add a small amount of random noise to it (jittering). That would make it less useful for identification purposes, and would slightly degrade its associations with other variables in your data set, but probably less than coarsely categorizing it would. I imagine that this approach, too, would leave the estimates for your main variables largely unaffected.

All of that said, I'm baffled by what your information security people are telling you. You already have the actual age variable. That is where the security risk comes from; the horse is already out of the barn, or, perhaps a better analogy is that the barn door is unlocked and the horse might bolt at any time. If somebody hacks your computer, they have access to a variable that can be used to identify people. I get that. But I don't understand why anyone would think, that a regression analysis that uses the continuous variable would pose a risk of identification but the same analysis using a categorized version would not. In fact, I don't see how the results of a regression analysis on a sample of any appreciable size could be used to identify people, no matter what variables it contains. Do they know what a regression analysis is? Have they ever looked at some output? What are they thinking?

I do plenty of research with human subjects data, some of which is sensitive. I have had my IRB tell me that I can't have certain variables at all, or that the owner of the data must obscure them in some way before giving me a de-identified data set, but they have never said that it was OK for me to have a variable but that I must transform it in some information-destroying way when I use it. That doesn't make any sense.
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#15

07 May 2016, 08:47

I understand where your confusion is I don't "already have the age variable on my computer". the data is currently only on an offline station with private data. For me to take the data out (per journal requirements) I had it undergo some privacy evaluation by the data's owners/"keepers", and they indicated age as a reason for concern that something must be done with.

Obviously If I only needed to output the regression tables, such a recoding would be very silly but journals today in my field require do file + data file for publication so I must balance these somehow
Comment

Announcement