Negative predictions of wage

George Ford

Join Date: Aug 2014

Posts: 3152
#16

07 Nov 2023, 07:24

When you transform from the log to the linear form, you have to adjust by the variance of the regression.

Need to make sure you have e(rmse) in ereturn before you make the calculation, or else save it as a local for later use.

Clyde's idea a clever solution since predict will give you what want directly. Jeff's proposal to simplify the first stage is probably a good one.

If you idea is to predict Yi, then it may make sense just to focus on that and not worry so much about selection bias, etc.. Prediction is a different game than hypothesis testing.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#17

07 Nov 2023, 08:59

Re #11. No that's not what I meant. Sorry I wasn't clear. I meant doing a Poisson regression with the untransformed outcome variable. So not -poisson log_wage...-, rather -poisson wage...-. But, very important, you must do this with -vce(robust)-.
1 like
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#18

07 Nov 2023, 09:45

Something else that just occurred to me
for prediction you probably want to recalculate the imr the usual way instead of using score
Comment
Facundo Duran

Join Date: Sep 2023

Posts: 28
#19

08 Nov 2023, 07:22

Originally posted by George Ford View Post

When you transform from the log to the linear form, you have to adjust by the variance of the regression.

Need to make sure you have e(rmse) in ereturn before you make the calculation, or else save it as a local for later use.

Clyde's idea a clever solution since predict will give you what want directly. Jeff's proposal to simplify the first stage is probably a good one.

If you idea is to predict Yi, then it may make sense just to focus on that and not worry so much about selection bias, etc.. Prediction is a different game than hypothesis testing.

Thank you, your answer is very clear!
So would you recommend that I not correct for selection bias? Instead, should I just do an xtreg of the wage as a function of the explanatory variables?
Comment
Facundo Duran

Join Date: Sep 2023

Posts: 28
#20

08 Nov 2023, 07:23

Originally posted by Clyde Schechter View Post

Re #11. No that's not what I meant. Sorry I wasn't clear. I meant doing a Poisson regression with the untransformed outcome variable. So not -poisson log_wage...-, rather -poisson wage...-. But, very important, you must do this with -vce(robust)-.

Thank you very much, now I understand the idea!
Comment
Facundo Duran

Join Date: Sep 2023

Posts: 28
#21

08 Nov 2023, 07:25

Originally posted by FernandoRios View Post

Something else that just occurred to me
for prediction you probably want to recalculate the imr the usual way instead of using score

Ah, what would be the reason? isn't it equivalent to doing it with the score?
Another doubt that arises, in the model there is no way to add categorical variables since I have to add after the average of the variable to the regression, right?
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2167
#22

08 Nov 2023, 11:21

Several issues here. In labor economics when these selection methods are applied, the goal is to consistently estimate the parameters in the wage OFFER function. But the wage offer = wage only when someone is working. If, later, you wish to predict the wage offer, you wouldn't include the inverse Mills ratio in the prediction. If you want to predict wage conditional on being in the workforce, then the IMR would be included. I can't tell which you're interested in.

George is correct that you just can't exponentiate the fitted value for the logarithm. Just multiplying by a constant adjustment factor might not be enough. At a minimum, you might estimate a different variance for each time period. This is not easy when using a sample selection correction.

As Clyde suggested, it would be more direct to use Poisson regression with an exponential mean and use this for prediction. However, straight Poisson won't allow you to do a Heckman correction. Just adding the IMR to Poisson regression -- actually, its natural log -- is valid as a test, but only approximately as a selection correction (and the quality of the approximation isn't generally known) . Equation (18.48) in my 2010 MIT Press book shows the equation you'd have to estimate to do the adjustment for sample selection.
1 like
Comment

Facundo Duran

Join Date: Sep 2023
Posts: 28

#23

13 Nov 2023, 05:02

Originally posted by Jeff Wooldridge View Post

Several issues here. In labor economics when these selection methods are applied, the goal is to consistently estimate the parameters in the wage OFFER function. But the wage offer = wage only when someone is working. If, later, you wish to predict the wage offer, you wouldn't include the inverse Mills ratio in the prediction. If you want to predict wage conditional on being in the workforce, then the IMR would be included. I can't tell which you're interested in.

George is correct that you just can't exponentiate the fitted value for the logarithm. Just multiplying by a constant adjustment factor might not be enough. At a minimum, you might estimate a different variance for each time period. This is not easy when using a sample selection correction.

As Clyde suggested, it would be more direct to use Poisson regression with an exponential mean and use this for prediction. However, straight Poisson won't allow you to do a Heckman correction. Just adding the IMR to Poisson regression -- actually, its natural log -- is valid as a test, but only approximately as a selection correction (and the quality of the approximation isn't generally known) . Equation (18.48) in my 2010 MIT Press book shows the equation you'd have to estimate to do the adjustment for sample selection.

Thank you so much for your help!
I am trying to apply logarithm and then apply the exponential correcting for rmse.
However, when I make predictions the predicted values are very large.
In case I explained myself wrong or I'm not clear, the objective is to consistently estimate what the salary of people who do not work would be if they worked, just as the xthecman or xthecmanfe command does, after making a predict, since my objective is to consider a counterfactual scenario where workers do not suffer interruptions in their formal work history as occurs in Argentina.

This is the descriptive statistics of the variable I want to predict.

Code:

. summ wb_tot

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      wb_tot |  3,012,636    1035.359    1071.342   52.82394   9999.839

Then, what I do is run the 312 probits of the variable working (=1 if working and 0 otherwise) as a function of age and unemployment and the average of age and unemployment to obtain the Inverse Mill Ratio (IMR)

Code:

gen lambda2 = .
local i = 1
while `i' <= 312 {
    di "Periodo=" `i'
    probit working c.edad  c.desempleo c.medad  c.mdesempleo if periodo == `i'
    predict IMRR, score
    replace lambda2 = IMRR if periodo == `i'
    drop IMRR
    local i = `i' + 1
}

After that, I run the regression of the logarithm of the salary, as a function of age, experience (tenure), productivity by branch of activity and the means of age, unemployment, productivity and tenure, the time dummies and the mills ratio. multiplied by the period variable

Code:

reg log_wb_tot tenure edad productividad medad mtenure mproductividad mdesempleo t1 t2 t3 t4 t5 ... t312 periodo#c.lambda2

Code:

      Source |       SS           df       MS      Number of obs   = 3,012,636
-------------+----------------------------------   F(629, 3012006) =   2686.86
       Model |  761761.228       629  1211.06714   Prob > F        =    0.0000
    Residual |  1357622.28 3,012,006  .450736909   R-squared       =    0.3594
-------------+----------------------------------   Adj R-squared   =    0.3593
       Total |   2119383.5 3,012,635  .703498268   Root MSE        =    .67137

-----------------------------------------------------------------------------------
       log_wb_tot |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
           tenure |   .0016643   .0000139   119.68   0.000      .001637    .0016915
             edad |   .0950091   .2080477     0.46   0.648     -.312757    .5027752
    productividad |   .0005882   .0000227    25.91   0.000     .0005437    .0006326
            medad |  -.0730807   .2080539    -0.35   0.725    -.4808591    .3346977
          mtenure |    .004262   .0000155   274.94   0.000     .0042316    .0042924
   mproductividad |   .0058454   .0000413   141.62   0.000     .0057645    .0059263
       mdesempleo |   -1.85624   .0096591  -192.17   0.000    -1.875172   -1.837309
               t1 |   2.902968   5.114918     0.57   0.570    -7.122091    12.92803
               t2 |   2.837135   5.114904     0.55   0.579    -7.187896    12.86217
               t3 |   2.786675   5.114892     0.54   0.586    -7.238334    12.81168
               t4 |   2.825018   5.114884     0.55   0.581    -7.199974    12.85001
               t5 |   2.880242   5.114897     0.56   0.573    -7.144777    12.90526
               ...
             t311 |  -.3498321   .1961563    -1.78   0.075    -.7342916    .0346273
             t312 |          0  (omitted)
                  |
periodo#c.lambda2 |
               1  |   -.361941   .0513046    -7.05   0.000    -.4624961   -.2613859
               2  |  -.3207056   .0502389    -6.38   0.000    -.4191721    -.222239
               3  |  -.2925191   .0497737    -5.88   0.000    -.3900739   -.1949644
               4  |   -.339305   .0491191    -6.91   0.000    -.4355768   -.2430333
               5  |  -.3843913   .0500555    -7.68   0.000    -.4824983   -.2862843
               ...
             311  |  -.1281175   .1095578    -1.17   0.242    -.3428469     .086612
             312  |  -.1074686   .1047756    -1.03   0.305    -.3128251    .0978879
                  |
            _cons |   3.031493   2.514698     1.21   0.228    -1.897226    7.960212

After that I make the prediction and correct with the exponential

Code:

predict yhath, xb
replace yhath = (exp(yy)*exp((`e(rmse)'^2)/2))

But the predicted salary is very high for me

Code:

    
. summ yhat

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        yhat |  8,487,648    142926.1     3045515   14.85537   1.81e+08

Even if I do the following fixed effects model and then predict, the predictions give me quite coherent

Code:

xtset id_trabajador periodo
xtreg log_wb_tot edad tenure productividad desempleo , fe

Code:

Fixed-effects (within) regression               Number of obs     =  3,012,636
Group variable: id_trabaja~r                    Number of groups  =     26,327

R-sq:                                           Obs per group:
     within  = 0.2147                                         min =          1
     between = 0.1100                                         avg =      114.4
     overall = 0.1284                                         max =        312

                                                F(5,2986304)      =  163257.39
corr(u_i, Xb)  = -0.1257                        Prob > F          =     0.0000

-------------------------------------------------------------------------------
   log_wb_tot |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
         edad |  -.1126464   .0005679  -198.34   0.000    -.1137596   -.1115333
       tenure |   .0045672   .0000186   244.98   0.000     .0045306    .0046037
productividad |   .0070082   .0000145   482.24   0.000     .0069798    .0070367
    desempleo |  -2.979933   .0091035  -327.34   0.000    -2.997775    -2.96209
        _cons |   9.261534   .0147454   628.10   0.000     9.232634    9.290435
--------------+----------------------------------------------------------------
      sigma_u |  .67761924
      sigma_e |  .49677159
          rho |  .65042557   (fraction of variance due to u_i)
-------------------------------------------------------------------------------
F test that all u_i=0: F(26326, 2986304) = 163.07            Prob > F = 0.0000

And if I predict the salary, it looks pretty similar to the original wb_tot

Code:

. sum yhatxt

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      yhatxt |  8,487,648    698.8388    314.9869   86.53976   8355.217

I don't understand why there is such a difference between the predicted salary corrected by Heckman and using fixed effects.

Last edited by Facundo Duran; 13 Nov 2023, 05:44.

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment