How to correctly residualize before running a regression

German Reyes

Join Date: Jun 2020

Posts: 7
#1

How to correctly residualize before running a regression

22 Jan 2021, 19:23

I am interested in how the performance of a student i on a test changes over the course of the test (more precisely, as a function of the position of the question in the test). I have multiple observations for every individual i (one observation per question in the test). My data looks like this:

where the first column identifies the individual taking the test, the second column is the position of the question in the test, the third column is my measure of performance, the third column a dummy that takes the value 1 if the question was answered correctly and 0 otherwise, and the fourth column identifies the question in each position of the test. This is because each test-taker receives 1 of 4 possible booklets, and the order of the questions is different across booklets.

An example dataset is:

Code:

clear all input id pos corr item 1 1 1 1 1 2 1 2 1 3 0 3 2 1 0 2 2 2 1 1 2 3 1 3 3 1 1 3 3 2 0 2 3 3 1 1 end

I am interested in whether as time goes by, individuals get tired and make more mistakes in the test. I want an estimate of the change in performance for each individual over the course of the test. So, one option is to run a regression per individual, like this:

C_q = α + β Pos_q + ε_q

where Pos_q ∈ {1, 2, ...} is the position of question q in the test. Each regression (one per student) would yield a β. Using the dataset above, that would be:

Code:

reg corr pos if id == 1

An equivalent way of estimating β for each individual is to run a pooled regression using all individuals and including individual-level fixed effects, like this:

Code:

reg corr c.pos#id i.id

The coefficient on id#c.pos 1 is -0.5, just like in the first regression.

Now, suppose that I want to do the same exercise, but controlling for the question each individual is answering (column 4 of the table). Including question fixed effects is straightforward in the pooled regression. I can simply run:

Code:

areg corr c.pos#id i.id, absorb(item)

In this case, the coefficient for the first individual is now -0.25.

For computational reasons, I can't run this pooled regression. Instead, I need to run one regression per person. So I tried to "partial-out" the question fixed effects, and then to run the regression at the student level. More precisely, what I tried is:

Code:

areg pos, absorb(item) predict pos_hat1 gen pos_residual = pos - pos_hat1 reg corr pos_residual if id == 1

I thought this was going to work. However, comparing the coefficient from the "partial-out" approach (-0.25) with the coefficient from the pooled regression (-0.5) for a given student yields different results. So I must be doing something wrong! Any thoughts?

Thanks!
Tags: None
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

23 Jan 2021, 01:55

The problem is that you are residualising a different regression. For example your i.id is included in the regression (these are a number of regressors equal to the total number of ids you have), but you have not residualised this, and c.pos#id are as many regressors as ids you have, but instead you are residualising only one, pos.

You either must work at the level of the whole dataset, with the fixed effects and the interactions with the fixed effects, or you must work at the level of id. You cannot mix both levels in the residualisation.
Comment
German Reyes

Join Date: Jun 2020

Posts: 7
#3

27 Jan 2021, 15:11

Thanks Joro,

I tried your suggestion but unfortunately didn't seem to yield the right answer.

As an example dataset:

Code:

clear all input id pos corr item 1 1 1 1 1 2 1 2 1 3 0 3 2 1 0 2 2 2 1 1 2 3 1 3 3 1 1 3 3 2 0 2 3 3 1 1 end

The following regression yields the coefficients I am interested in (the coefficients associated with the inter_subj_*variables):

Code:

* Create dummies and interaction between dummy and position forvalues i = 1(1)3 { gen id_`i' = id==`i' gen inter_subj_`i' = id_`i'*pos } areg corr id_2 id_3 inter_subj_1 inter_subj_2 inter_subj_3, absorb(item)

For individual 1, the coefficient is -0.25, for individual 2 is 0.5, and for individual 3 is -0.25.

Next, I tried to residualize the variables and then run the regression:

Code:

* Residualize the variables foreach x in id_1 id_2 id_3 inter_subj_1 inter_subj_2 inter_subj_3 { areg `x', absorb(item) predict `x'_hat gen `x'_res = `x'-`x'_hat drop `x'_hat } reg corr id_2_res id_3_res inter_subj_1_res inter_subj_2_res inter_subj_3_res

Unfortunately, the coefficients of this regression are different than in the benchmark regression. For individual 1, the coefficient is -0.5, for individual 2 is 0.5, and for individual 3 is 0.
Comment

Announcement

How to correctly residualize before running a regression

Comment

Comment