Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    When you transform from the log to the linear form, you have to adjust by the variance of the regression.

    Need to make sure you have e(rmse) in ereturn before you make the calculation, or else save it as a local for later use.

    Clyde's idea a clever solution since predict will give you what want directly. Jeff's proposal to simplify the first stage is probably a good one.

    If you idea is to predict Yi, then it may make sense just to focus on that and not worry so much about selection bias, etc.. Prediction is a different game than hypothesis testing.

    Comment


    • #17
      Re #11. No that's not what I meant. Sorry I wasn't clear. I meant doing a Poisson regression with the untransformed outcome variable. So not -poisson log_wage...-, rather -poisson wage...-. But, very important, you must do this with -vce(robust)-.

      Comment


      • #18
        Something else that just occurred to me
        for prediction you probably want to recalculate the imr the usual way instead of using score

        Comment


        • #19
          Originally posted by George Ford View Post
          When you transform from the log to the linear form, you have to adjust by the variance of the regression.

          Need to make sure you have e(rmse) in ereturn before you make the calculation, or else save it as a local for later use.

          Clyde's idea a clever solution since predict will give you what want directly. Jeff's proposal to simplify the first stage is probably a good one.

          If you idea is to predict Yi, then it may make sense just to focus on that and not worry so much about selection bias, etc.. Prediction is a different game than hypothesis testing.
          Thank you, your answer is very clear!
          So would you recommend that I not correct for selection bias? Instead, should I just do an xtreg of the wage as a function of the explanatory variables?

          Comment


          • #20
            Originally posted by Clyde Schechter View Post
            Re #11. No that's not what I meant. Sorry I wasn't clear. I meant doing a Poisson regression with the untransformed outcome variable. So not -poisson log_wage...-, rather -poisson wage...-. But, very important, you must do this with -vce(robust)-.
            Thank you very much, now I understand the idea!

            Comment


            • #21
              Originally posted by FernandoRios View Post
              Something else that just occurred to me
              for prediction you probably want to recalculate the imr the usual way instead of using score
              Ah, what would be the reason? isn't it equivalent to doing it with the score?
              Another doubt that arises, in the model there is no way to add categorical variables since I have to add after the average of the variable to the regression, right?

              Comment


              • #22
                Several issues here. In labor economics when these selection methods are applied, the goal is to consistently estimate the parameters in the wage OFFER function. But the wage offer = wage only when someone is working. If, later, you wish to predict the wage offer, you wouldn't include the inverse Mills ratio in the prediction. If you want to predict wage conditional on being in the workforce, then the IMR would be included. I can't tell which you're interested in.

                George is correct that you just can't exponentiate the fitted value for the logarithm. Just multiplying by a constant adjustment factor might not be enough. At a minimum, you might estimate a different variance for each time period. This is not easy when using a sample selection correction.

                As Clyde suggested, it would be more direct to use Poisson regression with an exponential mean and use this for prediction. However, straight Poisson won't allow you to do a Heckman correction. Just adding the IMR to Poisson regression -- actually, its natural log -- is valid as a test, but only approximately as a selection correction (and the quality of the approximation isn't generally known) . Equation (18.48) in my 2010 MIT Press book shows the equation you'd have to estimate to do the adjustment for sample selection.

                Comment


                • #23
                  Originally posted by Jeff Wooldridge View Post
                  Several issues here. In labor economics when these selection methods are applied, the goal is to consistently estimate the parameters in the wage OFFER function. But the wage offer = wage only when someone is working. If, later, you wish to predict the wage offer, you wouldn't include the inverse Mills ratio in the prediction. If you want to predict wage conditional on being in the workforce, then the IMR would be included. I can't tell which you're interested in.

                  George is correct that you just can't exponentiate the fitted value for the logarithm. Just multiplying by a constant adjustment factor might not be enough. At a minimum, you might estimate a different variance for each time period. This is not easy when using a sample selection correction.

                  As Clyde suggested, it would be more direct to use Poisson regression with an exponential mean and use this for prediction. However, straight Poisson won't allow you to do a Heckman correction. Just adding the IMR to Poisson regression -- actually, its natural log -- is valid as a test, but only approximately as a selection correction (and the quality of the approximation isn't generally known) . Equation (18.48) in my 2010 MIT Press book shows the equation you'd have to estimate to do the adjustment for sample selection.

                  Thank you so much for your help!
                  I am trying to apply logarithm and then apply the exponential correcting for rmse.
                  However, when I make predictions the predicted values are very large.
                  In case I explained myself wrong or I'm not clear, the objective is to consistently estimate what the salary of people who do not work would be if they worked, just as the xthecman or xthecmanfe command does, after making a predict, since my objective is to consider a counterfactual scenario where workers do not suffer interruptions in their formal work history as occurs in Argentina.


                  This is the descriptive statistics of the variable I want to predict.

                  Code:
                  . summ wb_tot
                  
                      Variable |        Obs        Mean    Std. Dev.       Min        Max
                  -------------+---------------------------------------------------------
                        wb_tot |  3,012,636    1035.359    1071.342   52.82394   9999.839

                  Then, what I do is run the 312 probits of the variable working (=1 if working and 0 otherwise) as a function of age and unemployment and the average of age and unemployment to obtain the Inverse Mill Ratio (IMR)

                  Code:
                  gen lambda2 = .
                  local i = 1
                  while `i' <= 312 {
                      di "Periodo=" `i'
                      probit working c.edad  c.desempleo c.medad  c.mdesempleo if periodo == `i'
                      predict IMRR, score
                      replace lambda2 = IMRR if periodo == `i'
                      drop IMRR
                      local i = `i' + 1
                  }
                  After that, I run the regression of the logarithm of the salary, as a function of age, experience (tenure), productivity by branch of activity and the means of age, unemployment, productivity and tenure, the time dummies and the mills ratio. multiplied by the period variable

                  Code:
                  reg log_wb_tot tenure edad productividad medad mtenure mproductividad mdesempleo t1 t2 t3 t4 t5 ... t312 periodo#c.lambda2
                  Code:
                        Source |       SS           df       MS      Number of obs   = 3,012,636
                  -------------+----------------------------------   F(629, 3012006) =   2686.86
                         Model |  761761.228       629  1211.06714   Prob > F        =    0.0000
                      Residual |  1357622.28 3,012,006  .450736909   R-squared       =    0.3594
                  -------------+----------------------------------   Adj R-squared   =    0.3593
                         Total |   2119383.5 3,012,635  .703498268   Root MSE        =    .67137
                  
                  -----------------------------------------------------------------------------------
                         log_wb_tot |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                  ------------------+----------------------------------------------------------------
                             tenure |   .0016643   .0000139   119.68   0.000      .001637    .0016915
                               edad |   .0950091   .2080477     0.46   0.648     -.312757    .5027752
                      productividad |   .0005882   .0000227    25.91   0.000     .0005437    .0006326
                              medad |  -.0730807   .2080539    -0.35   0.725    -.4808591    .3346977
                            mtenure |    .004262   .0000155   274.94   0.000     .0042316    .0042924
                     mproductividad |   .0058454   .0000413   141.62   0.000     .0057645    .0059263
                         mdesempleo |   -1.85624   .0096591  -192.17   0.000    -1.875172   -1.837309
                                 t1 |   2.902968   5.114918     0.57   0.570    -7.122091    12.92803
                                 t2 |   2.837135   5.114904     0.55   0.579    -7.187896    12.86217
                                 t3 |   2.786675   5.114892     0.54   0.586    -7.238334    12.81168
                                 t4 |   2.825018   5.114884     0.55   0.581    -7.199974    12.85001
                                 t5 |   2.880242   5.114897     0.56   0.573    -7.144777    12.90526
                                 ...
                               t311 |  -.3498321   .1961563    -1.78   0.075    -.7342916    .0346273
                               t312 |          0  (omitted)
                                    |
                  periodo#c.lambda2 |
                                 1  |   -.361941   .0513046    -7.05   0.000    -.4624961   -.2613859
                                 2  |  -.3207056   .0502389    -6.38   0.000    -.4191721    -.222239
                                 3  |  -.2925191   .0497737    -5.88   0.000    -.3900739   -.1949644
                                 4  |   -.339305   .0491191    -6.91   0.000    -.4355768   -.2430333
                                 5  |  -.3843913   .0500555    -7.68   0.000    -.4824983   -.2862843
                                 ...
                               311  |  -.1281175   .1095578    -1.17   0.242    -.3428469     .086612
                               312  |  -.1074686   .1047756    -1.03   0.305    -.3128251    .0978879
                                    |
                              _cons |   3.031493   2.514698     1.21   0.228    -1.897226    7.960212
                  After that I make the prediction and correct with the exponential
                  Code:
                  predict yhath, xb
                  replace yhath = (exp(yy)*exp((`e(rmse)'^2)/2))
                  But the predicted salary is very high for me
                  Code:
                      
                  . summ yhat
                  
                      Variable |        Obs        Mean    Std. Dev.       Min        Max
                  -------------+---------------------------------------------------------
                          yhat |  8,487,648    142926.1     3045515   14.85537   1.81e+08
                  Even if I do the following fixed effects model and then predict, the predictions give me quite coherent

                  Code:
                  xtset id_trabajador periodo
                  xtreg log_wb_tot edad tenure productividad desempleo , fe
                  Code:
                  Fixed-effects (within) regression               Number of obs     =  3,012,636
                  Group variable: id_trabaja~r                    Number of groups  =     26,327
                  
                  R-sq:                                           Obs per group:
                       within  = 0.2147                                         min =          1
                       between = 0.1100                                         avg =      114.4
                       overall = 0.1284                                         max =        312
                  
                                                                  F(5,2986304)      =  163257.39
                  corr(u_i, Xb)  = -0.1257                        Prob > F          =     0.0000
                  
                  -------------------------------------------------------------------------------
                     log_wb_tot |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                  --------------+----------------------------------------------------------------
                           edad |  -.1126464   .0005679  -198.34   0.000    -.1137596   -.1115333
                         tenure |   .0045672   .0000186   244.98   0.000     .0045306    .0046037
                  productividad |   .0070082   .0000145   482.24   0.000     .0069798    .0070367
                      desempleo |  -2.979933   .0091035  -327.34   0.000    -2.997775    -2.96209
                          _cons |   9.261534   .0147454   628.10   0.000     9.232634    9.290435
                  --------------+----------------------------------------------------------------
                        sigma_u |  .67761924
                        sigma_e |  .49677159
                            rho |  .65042557   (fraction of variance due to u_i)
                  -------------------------------------------------------------------------------
                  F test that all u_i=0: F(26326, 2986304) = 163.07            Prob > F = 0.0000
                  And if I predict the salary, it looks pretty similar to the original wb_tot

                  Code:
                  . sum yhatxt
                  
                      Variable |        Obs        Mean    Std. Dev.       Min        Max
                  -------------+---------------------------------------------------------
                        yhatxt |  8,487,648    698.8388    314.9869   86.53976   8355.217

                  I don't understand why there is such a difference between the predicted salary corrected by Heckman and using fixed effects.
                  Last edited by Facundo Duran; 13 Nov 2023, 05:44.

                  Comment

                  Working...
                  X