Problem with factor variables syntax

Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#1

Problem with factor variables syntax

17 Mar 2016, 03:34

I'm fitting a model which has a factor variable as a control variable. I now wish to estimate the same model, adding an interaction between one of the factor's levels and my main variable of interest.
However, When I use the standard fvvarlist notation, stata drops all the other levels of the covar that use as controls. There's a simple way around this by generating a new binary variable according to the level of the covar i'm interested in, but I still don't understand why the "standard" syntax does this.

This is on stata 14.1 SE, windows 7 64bit. the following code will show the issue (the computer with stata is not connected to the internet)

Example:

Code:

sysuse auto2 reg price mpg i.rep78 reg price c.mpg##1.rep78 i.rep78 gen bin = 1.rep78 reg price c.mpg##bin i.rep78

note how the second regression doesn't estimate any of rep78's levels (except for 1.rep78), yet the third model does - Though to my understanding the two models should be equivalent...
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#2

17 Mar 2016, 04:05

The following doesn't do what I would expect, either

Code:

regress price 2b.rep78 3.rep78 4.rep78 5.rep78 c.mpg##1.rep78
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

17 Mar 2016, 07:04

My experimentation leads me to agree that Stata seems to be confounding differing references to factor levels for the same variable in a single command in a way that is not predicted by the documentation in the User's Guide. I think you should direct this question to Stata Technical Support, and please let us know the outcome.
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#4

17 Mar 2016, 07:17

Iv'e sent this to tech support. I'll update when I get a reply
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#5

18 Mar 2016, 02:15

I got a reply from Rose at StataCorp:

When you specify level values for a factor variable you are telling Stata not to look for any other values in the dataset.
As a result when you type 1.rep78, Stata only includes 1.rep78 in the model.
One way to include all the dummy variables in a model, but include only one level of the interaction is to constrain all other coefficients for the interaction to 0.
Below is an example of how to do this.

Code:

*** Begin example *** sysuse auto constraint 2 _b[2.rep78#c.mpg]=0 constraint 3 _b[3.rep78#c.mpg]=0 constraint 4 _b[4.rep78#c.mpg]=0 cnsreg price c.mpg##ib5.rep78 , const(2/4) *** End example ***
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#6

18 Mar 2016, 07:50

Is this what you wnat?

Code:

xi: reg price c.mpg##1.rep78 i.rep78

labelling is gone, but the other groups are in there now
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

18 Mar 2016, 07:54

Thank you for closing the loop. That is a disappointing answer at best. Even having seen that explanation, it is not supported by a very close reading of the documentation for factor variables in section 11.4 of the Stata User's Guide. With further experimentation, I was able to get similarly counterintuitive results in the opposite direction:

Code:

reg price c.mpg i(1 2 3 4 5)b5.rep78 c.mpg#i1.rep78

causes levels 1-4 of rep78 to appear as factor variables, as desired, but mpg is now interacted those same 4 levels rep78, rather than limiting the interaction to level 1.

Added in edit: my reply crossed Jorrit's in cyberspace. Seeing it, without attempting any further testing, suggests to me that the behavior of factor variables in the absence of xi: is a bug rather than a feature. The documentation for xi recommends using factor variable notation directly in those commands that support it; no mention of any difference in outcome. That makes two sources where the documentation diverges from the behavior.

Last edited by William Lisowski; 18 Mar 2016, 07:58.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4955
#8

18 Mar 2016, 08:05

I find factor variables pretty confusing if you try to get past the basic i. notation. Like Rose at StataCorp said, using constraints is sometimes a way to get what you want.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#9

19 Mar 2016, 03:49

The way I read it the xi: option should be entirely redundant for reg, and most other commands, really. And this use of it was really just something I tried, I don't see it explained as a 'feature' in the xi: or fvvar guides. Before trying it, I would have entirely expected these two to be perfectly equivalent.

Code:

xi: reg price c.mpg##1.rep78 i.rep78 reg price c.mpg##1.rep78 i.rep78

edit
The only potential difference is the remark in the xi: manual, which states that it creates new temporary variables in the background, so it it more perfectly equivalent to Ariel's

Code:

gen bin = 1.rep78 reg price c.mpg##bin i.rep78

Then again, I imagine the i.var notation would do pretty much the same thing

Last edited by Jorrit Gosens; 19 Mar 2016, 04:01.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29964
#10

19 Mar 2016, 10:12

Jorrit: -xi- does not create temporary variables in the background: that's what factor variable notation does. -xi- creates new variables in the dataset, having names beginning with _I (or some other prefix you can specify as an option to -xi-). The problem with this is that -margins- does not understand these _I variables and cannot work properly with interaction terms or quadratics and other powers if you use them. Also, when you specify the -xi:- prefix, you are actually blocking factor-variable notation, because all of the i.varname terms are first converted to these _I* variables before the modeling command is even parsed, so the parser never gets to see those i.varname terms.

Bottom line is that -xi- is an almost entirely obsolete command. Its use really should be restricted to situations where factor variable notation simply cannot be applied. Most of those situations are some user-written commands or a few very old official Stata commands (e.g. -loneway-) that don't support factor variable notation. And, at least in my experience, all of the things that those old official Stata commands do can be done just as easily, if not more so, with more modern commands that do support factor variable notation.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

19 Mar 2016, 10:34

Clyde -

Not meaning to be argumentative, but I would be hard pressed to characterize the solution using factor variable notation given at post #5 as accomplishing the objective "just as easily, if not more so" than the solution given at post #9. And posts #2 and #8 suggest that there is a some confusion about the workings of factor variable notation, a sense that the outcome of complicated expressions are unintuitive.

And so I persist in wondering if this difference between the old way and the new way really is a design feature, versus an oversight on the part of the developers. I have considered raising this with Tech Support and asking them if the developers have any comment on this, since there doesn't appear to be any mention in the Version 14 documentation for factor variables or for the xi: prefix.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29964
#12

19 Mar 2016, 10:44

William Lisowski I agree that the problem that initiated this post is an instance of something where factor variables don't work well, and probably should be fixed so that it does. The workaround in #5 is, I agree, kludgy. I guess when I said that "in my experience" all the things that call for -xi:- can be done as or more easily with factor variables has this one instance as an exception. Then again, this situation is only my experience in the vicarious sense.

I was really responding to Jorrit's implication that there is a broad equivalence between xi: and factor variable notation. There isn't, and the places where the former is superior are, in my view, few and far between. I suspect you don't disagree with me on that.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#13

19 Mar 2016, 10:59

Clyde Schechter I indeed agree with you, and given that you and I agree that it "probably should be fixed" will take it on myself, at some point, to re-raise the issue with Tech Support and see what if anything can be done.

I don't think I read Jorrit as asserting broad equivalence between xi and factor notation, just that he expected equivalence between the models specified using the reg command with the two syntaxes shown at the top of post #9.

Last edited by William Lisowski; 19 Mar 2016, 11:03.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#14

19 Mar 2016, 19:46

xi can be handy at times.. A poster on stackoverflow had the following problem. To make counterfactual predictions, he wanted to .exclude one set of interactions from a model but include others. He was able to exclude the first set by defining them with factor variable notation,; then setting to zero thevalues of beta whose names contained "#". The question was: how to keep the second set. The solution was to create them with xi, which uses an asterisk to define interactions.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#15

20 Mar 2016, 04:48

I wholeheartedly agree that factor variables notation is superior to xi on every level imaginable. the transition from using xi to fvvarlist in stata 11 for example was great for me.
As for me, I replied to Stata's tech support to voice again the sentiments in this thread that this behavior is un-expected and un-documented.

Also, specifically for me, the solution of duplicating the factor variable (bin) and then interacting on a specific level of bin is currently the best one - as XI is incredibly slow and constraints suggested by tech support is also problematic (impossible?) as I'm running panel(XT) models.
Also, my factor variable has a lots of levels, so indicating a constraint each level to not be interacted is extremely cumbersome.

Last edited by Ariel Karlinsky; 20 Mar 2016, 04:51.
Comment

Announcement

Problem with factor variables syntax

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment