oaxaca-blinder decomposition and categorical variables

giulia Cescon

Join Date: Apr 2018

Posts: 13
#1

oaxaca-blinder decomposition and categorical variables

17 May 2018, 04:07

Hi!
I am using the Oaxaca blinder two fold decomposition in Stata. I read the article 'A Stata implementation of the Blinder-Oaxaca decomposition' by Jann(2008). What I would like to do is to avoid the omitted category bias using categorical variables.Therefore, I think I should use the option categorical(varlist)...
My command is the following one:

tabulate degreegrade, nofreq generate(degreegrade)
tabulate father, nofreq generate(father)
tabulate atheneumregion, nofreq generate(atheneumregion)
tabulate diploma, nofreq generate(diploma)
tabulate voto_diploma_all, nofreq generate(voto_diploma_all)
tabulate eta_laurea_mfr, nofreq generate(eta_laurea_mfr)
tabulate tiplau_micro, nofreq generate(tiplau_micro)
tabulate gruppo_micro, nofreq generate(gruppo_micro)
tabulate domicile, nofreq generate(domicile)
tabulate frequency, nofreq generate(frequency)
tabulate method, nofreq generate(method)

oaxaca lnhourlywage citt_mfr expmobility degreeontime permanent privatecourses privatesector livingorigin doctoralstudies specializationschool masterafterdegree studyscholarship_workgrant stage trainership_practicum professionaltraining trainingcourse mesi_i_lavoro durata_lavoro_mesi weeklyhours degreegrade1-degreegrade5 father1-father5 atheneumregion1-atheneumregion3 diploma1-diploma8 voto_diploma_all1-voto_diploma_all3 eta_laurea_mfr1-eta_laurea_mfr4 tiplau_micro1-tiplau_micro3 gruppo_micro1-gruppo_micro14 domicile1-domicile3 frequency1-frequency3 method1-method11 if [people_over==1], by(sesso) weight(1) categorical(father?, degreegrade?, atheneumregion?, diploma?, voto_diploma_all?, eta_laurea_mfr?, tiplau_micro?, gruppo_micro?, domicile?, frequency?, method?) relax

However, when I run the command I get the error '3300 argument out of range'
Can someone help me pleeease?
Thanks a lot
Giulia

Last edited by giulia Cescon; 17 May 2018, 04:33.
Tags: None
depado

Join Date: May 2014

Posts: 7
#2

17 May 2018, 09:24

if none answer, i would simply try renaming the excluded category, eg
rename degreegrade1 degreegrade_baseline
i am afraid the command tries to estimate everything including the excluded category and then the problem is multicollinearity. of course i am not sure this is the problem
Comment
giulia Cescon

Join Date: Apr 2018

Posts: 13
#3

17 May 2018, 10:18

Hi! First, Thanks for your reply! I read some articles after having written the question, and I think the problem is not the one of multicollinearity. I mean, I use this method exactly because I do not want a reference category, as results of the decomposition have been demonstrated to highly depend on the reference category chosen. Therefore, this method should help with this problem. I do not think that this is the problem as if I keep out of the sample 'gruppo_micro' and 'method' (that are the categorical variables yielding more categories than the other) everything works fine; the problem seems to be those two variables; however , I cannot understand why this happens and which solution to apply
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2494
#4

17 May 2018, 12:29

Hi Giulia,
Based on your command and since i do not have access to your data, its difficult to say why exactly is your command giving you that error. My recommendation would be to take a step back and introduce each set of dummies at the time. Perhaps there is only one in particular that is causing that problem.
For instance, do you have the same problem when you run the following command:

Code:

oaxaca lnhourlywage citt_mfr expmobility degreeontime permanent privatecourses privatesector livingorigin doctoralstudies specializationschool masterafterdegree studyscholarship_workgrant stage trainership_practicum professionaltraining trainingcourse mesi_i_lavoro durata_lavoro_mesi weeklyhours degreegrade1-degreegrade5 if [people_over==1], by(sesso) weight(1) categorical( degreegrade? ) relax

Second, using the "categorical" option does not really solve the baseline dummy problem, it just changes its nature. In my experience, it can help when few dummies are used, but can become problematic when trying to interpret the results.
Best
Fernando
Comment
giulia Cescon

Join Date: Apr 2018

Posts: 13
#5

17 May 2018, 16:01

Hi Fernando, thanks a lot for your reply. As I was explaining in the previous reply, the error message comes only when I include gruppo_micro and method, which are the categorical variables with the highest number of categories. This happens also if I include only that particular variable, i.e.
oaxaca lnhourlywage gruppo_micro2-gruppo_micro14, by(sesso) categorical(gruppo_micro?) relax
oaxaca lnhourlywage method2-method11, by(sesso) categorical(gruppo_micro?) relax
that gives the same error. I checked the categorical variables construction and it seems ok to me. I found similar questions of the forum that referred the problem to the high number of categories, but no reply was never given.
I actually used this option because I didn't want the result to depend on the omitted variable category; I think that interpreting those results may be even more difficult with respect to interpret results that do not depend on any category, do you agree?
I am just afraid that I cannot choose randomly the base category; in fact, if I use this method decomposition results completely change with respect to taking into account a reference category for each categorical variable..
Thanks in advance
Giulia
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2494
#6

17 May 2018, 20:11

Hi Giulia,
I see. I missed that about group_micro. and method. The reason is because when you type gruppo_micro?, you will only capture the variables gruppo_micro1 to micro9. the symbol ? indicates any ONE character, but beyond "9" you have two characters.
Try instead this:
oaxaca lnhourlywage method1-method11 ruppo_micro1-gruppo_micro14, by(sesso) categorical(method1-method11, gruppo_micro1-gruppo_micro14) relax
HTH
Fernando
1 like
Comment
giulia Cescon

Join Date: Apr 2018

Posts: 13
#7

18 May 2018, 08:03

Thanks it works perfectly! I got the same reply by the creator of the oaxaca package and you quite contemporaneously. Actually, using that method completely solves the problem. Alternatively to specify gruppo_micro1-gruppo_micro14 one can write gruppo_micro*? among the categorical inputs thanks again a lot for your reply Fernando, I would have never thought that the error may have been caused by the symbol ?
Best
Giulia
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2494
#8

18 May 2018, 08:32

Just one last note on that change. If you use gruppo_micro*, it will use all variables including the original variable!
On that regard, something i do is to change the name of the "dummy" variables slightly, for example:
tabulate degreegrade, nofreq generate(degreegrade_)
That way, when using the "*" you can call only the dummies, and not the original variable.
Fernando
2 likes
Comment
Magashi Joseph

Join Date: May 2018

Posts: 12
#9

09 Feb 2020, 01:56

Hello FernandoRios
Is there any recommended way to incorporate categorical variable (i.e more than two categories) when decomposing using oaxaca_rif as I have tried several times and I am getting an "error option categorical() not allowed".
I have read both your Working paper No. 927 but I have not been able to see how one can include such variable in the decomposition.

Best regard,
Jose
Comment

Sven-Kristjan Bormann

Join Date: Jul 2018
Posts: 310

#10

10 Feb 2020, 04:42

The option "categorical" works still for the oaxaca-command, but the prefered and documented way is

Code:

Normalization of categorical variables

    For categorical regressors, the detailed decomposition results depend on the choice of the (omitted) base category. A solution is to compute the decomposition based on "normalized"
    effects, i.e. effects that are expressed as deviation contrasts from the grand mean (Yun 2005). To "normalize" the effects for a set of indicator variables representing a categorical
    variable include the indicator variables in the list of regressors using syntax

        ... normalize(spec) ...

    where spec usually simply is the list of indicator variables.  Note that an indicator variable has to be supplied for every category (including the base category). For example, you
    could type

        . tabulate isco, generate(isco) nofreq
        . oaxaca lnwage educ exper normalize(isco1-isco9), by(female)

    The tablate, generate() command is a convenient way to generate a set of indicator variables from a categorical variable (such as the 9 major group ISCO-88 job classification). The
    base category to be omitted from model estimation can be designated using the b. operator, but this should not affect the decomposition results. For example, you could type

        ... normalize(married b.single divorced) ...
.

This way works for oaxaca_rif as well.

Comment

FernandoRios

Join Date: Apr 2014
Posts: 2494

#11

10 Feb 2020, 06:25

Thank you Sven!
I was just about to answer too. Im guessing categorical is an option for an older version of Oaxaca, and has been replaced with the option "normalize".
So to confirm, oaxaca_rif does not allow the option "categorical" but it does work with normalize Here a couple of examples:

Code:

use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear
* using old version categorical
oaxaca lnwage educ exper tenure, by(female) categorical(single married divorced) w(0)
est sto m1
* using newer syntax 
oaxaca lnwage educ exper tenure normalize(single married divorced) , by(female) w(0)
est sto m2
* using oaxaca_rif
oaxaca_rif lnwage educ exper tenure normalize(single married divorced) , by(female) w(0) rif(mean)
est sto m3
esttab m1 m2 m3

Hope this helps.
Best Regards

Comment

Magashi Joseph

Join Date: May 2018

Posts: 12
#12

11 Feb 2020, 21:10

Thank you very much Sven-Kristjan Bormann and FernandoRios for the clarification. Your answers real helped.
Comment
Korbi Nagel

Join Date: Mar 2020

Posts: 1
#13

23 Mar 2020, 10:57

Hi.
Thanks for the helpful explanations.
I noticed that "normalize" seems to work only if the values of the categorical regressor appear in both groups. For example, if the status "divorced" appears among both, "females" and "males".
If this assumption is violated, the results of the decomposition are (still) sensitive to the choice of the base category (or the order of the dummy variables). Consider the following simulated example:

Code:

. clear . set obs 1000 number of observations (_N) was 0, now 1,000 . gen x = runiform(1,100) . gen group = runiformint(0,1) . gen cat = runiformint(1,400) . gen y = 10 + .5*x + 0.2 *group + 0.05 * x * group + 0.005*cat + runiform(-5,5) . qui tab cat,gen(dummy) . rename dummy1 base_cat . set matsize 5000 . qui oaxaca y x normalize(base_cat dummy*), by(group) relax . dis _b[interaction] -.11314982 . qui oaxaca y x normalize(dummy* base_cat), by(group) relax . dis _b[interaction] -.78982842

Is there any trick to get insensitive results, even if some values of the categorical variable are group specific?
Kind regards,

Korbi
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2494
#14

23 Mar 2020, 11:12

I dont think there is.
Keep in mind that normalize simply imposes some restrictions on the joint coefficients for identification, so when a category does not exist for one of the groups, the restriction is no longer valid.
One may argue that even the OB decomposition may not be valid when you have this kind of cases.
Remember that one of the basic assumptions of the OB decomposition is to have an overlapping of characteristics (control variables).
HTH
Fernando
1 like
Comment
Belen Fontenez

Join Date: Jul 2020

Posts: 5
#15

01 Jul 2020, 12:06

Hi FernandoRios! Hope you are doing well. I'm a new joiner at this forum, but I've been reading it for some weeks now. I'm working on my dissertation, which is focused on analysing the gender wage gap between mothers and fathers, on the one hand, and between non-mothers and non-fathers, on the other. My aim is to find evidence of sticky floor and glass ceiling effects in both groups, which is why I'm applying the RIF OB decomposition using the code that you've created. I'm running the following code for mothers and fathers at the q10:

Code:

oaxaca_rif lnrwage exper expersqr yr_school married dgba dnoa dnea dcuyo dpampeana in_employee self_empl, rif(q(10)) by(fem) w(1) relax

Where dgba, dnoa, dnea, dcuyo and dpampeana are regional dummy variables, leaving out dpatagonia as a reference category. I'm not quite sure how to interpret them. For example, I get the following results for two of these variables:

Code:

Dummy Explained Unexplained Total

dgba 0.0175 -0.008 0.0095

dcuyo -0.0012 -0.0133 -0.0145

Would this mean that mothers in gba are paid less than fathers in comparison to what mothers and fathers are paid in Patagonia? And would this be the other way around for cuyo?

Thank you so much for your help!!

Kind regards,
Belen
Comment

Dummy	Explained	Unexplained	Total
dgba	0.0175	-0.008	0.0095
dcuyo	-0.0012	-0.0133	-0.0145

Announcement