Generating Predicted Values based on a subset of regressors

Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#1

Generating Predicted Values based on a subset of regressors

16 Jul 2018, 12:17

Hi All,

I have data that looks like the following:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(y x1 x2) 23 23 42 123 3 324 3 23 . 21 32 32 3 4 3 212 32 2 2 3 32 end

In the above data, I have a regressand y, which I regress on covariates x1 and x2. In the data construction, I need to work with the projections of y on x1 and x2 (as I will be using these fitted values as a regressor for an auxiliary regression). However, the constraint I face is the usual one- the prediction (y hat) is not generated for row 3, where the datapoint for x2 is missing. While I understand why this is reasonable, I still need to obtain data on that fitted value. As such, I would still like to compute the predicted value, despite the missing var3. One way to do this would be to for instance:

Code:

replace x2=0 if missing(x2) predict, xb

This would indeed generate a predicted value, as the product between 0 and beta(hat) is going to be 0. I was wondering if there is any prebuilt command that would calculate the predicted values, even for rows wherein 1 or more regressor value is missing.

Thanks!
Chinmay

Last edited by Chinmay Sharma; 16 Jul 2018, 12:19.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#2

16 Jul 2018, 12:59

This would indeed generate a predicted value, as the product between 0 and beta(hat) is going to be 0.

Yes, but it would be an incorrect value because there is no reason to suppose that the missing values of x2 are really zeroes, and you are applying regression coefficients from a sample where x2 is not missing to observations where it is. But it is entirely possible that the coefficient of x1 is different in the subpopulation where x2 is missing.

I think the preferred approach would be to use multiple imputation for your regression, followed by -mi predict-, assuming that the assumption that the missing values are missing at random is plausible. Alternatively, and less satisfactory, but perhaps viable if missingness at random is not plausible, you could try something like this:

Code:

gen yhat = . regress y x1 x2 predict yhat, xb regress y x1 if missing(x2) predict xb, xb replace yhat = xb if missing(yhat) drop xb regress y x2 if missing(x1) predict xb, xb replace yhat = xb if missing(yhat)

At least this way, if one of the x variables is missing, you will be applying a coefficient derived from a sample that has that missingness.

Last edited by Clyde Schechter; 16 Jul 2018, 13:00. Reason: Added a line to the code to -drop xb- after the first -replace yhat = xb...-
1 like
Comment
Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#3

16 Jul 2018, 13:05

Awesome, Clyde, many thanks. I will try the mi command!
Comment

Announcement

Generating Predicted Values based on a subset of regressors

Comment

Comment