Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Transformation of a highly skewed depedent variable and marginal effects

    Dear members of the list,

    I am now carrying out a research with panel data where my dependent variable is overeducation. It has many 0s (corresponding to inviduals with the right job; or matched) and positive values (1,2,3,4...) depending on the number of estimated years about the years required for their jobs that overeducated workers have.

    As you can imagine, this dependent variable is heavily skewed to the right.

    Click image for larger version

Name:	Sin título.png
Views:	1
Size:	12.8 KB
ID:	1717179


    The CERO values are important to me, but the rest of the values are important too. I could merge all the non-cero values into 1, treating my depedent variable as a binary one, but this means losing information, because the range from 1 to 6 means different levels of mismatch, and this is valuable information.

    I have been said that I should transform my dependent variable using...
    HTML Code:
    gen [new_dep_var] = asinh([old_dep_var])
    The problem comes at the moment of interpreting the coefficients resulting from this transformation. I am aware of a paper written on this matter by Edward C. Norton in this regard. It offers a way to estimate the marginal effects of the different variables in the model on the original scale of the dependent variable, but the routine offered by the author seems quite complicated to me. Edward C. Norton provides a loop for this purpose in the paper. Is there anyone who nows if such a loop has already been incorporated to Stata?

    Besides, is there any other way of attending the problem generated by the particular distribution of this depedent variable? How would you deal with it? Any advice should be greatly welcome.

    Thanks for your attention

    Luis Ortiz



  • #2
    Dear Luis Ortiz,
    ​​​​​​
    This looks like count data, so why not use Poisson regression?

    Best wishes,

    Joao
    ​​​​

    Comment


    • #3
      Many thanks, Joao

      The distribution in the following Stata link about Poisson regression looks certainly similar to the one that I posted:

      https://stats.oarc.ucla.edu/stata/da...itional%20mean.

      But I'm not quite sure my dependent variable is a count variable. The variable captures either job match (value 0 or no excess of years of education relative to the ones required by the job) or mismatch, with different levels of mismatch, from low mismatch (only one year of excess in the number of years of education relative to the number of years required by the job) to six years of excess.

      The distribution is clearly (and reasonably) skewed, but the variable -I suspect- is not a count one.

      Thanks a lot for your suggestion and attention

      Best wishes

      Luis Ortiz

      Comment


      • #4
        Dear Luis Ortiz,
        ​​​​​​
        This still looks like count data, but with an upper bound at 6, so Poisson is not suitable. Have a look at the glm command (family: binomial).

        Best wishes,

        ​​​​​​​Joao
        ​​​​

        Comment


        • #5
          A more general point here -- entirely consistent I think with Joao Santos Silva's advice -- is that outcomes that are binary or ordered categorical variables can be quite highly skewed, at least compared with others, but transformation is rarely the best treatment (and I write as someone more positive about transformations than is common).

          Consider for example the auto data. On moment-based skewness foreign (Car origin) appears quite skewed and the same can be said using the undeservedly not-so-well-known L-moment-based measure t_3. But if foreign is an outcome, the best remedy is usually just a model fit such as using a logit or probit link function (which does in effect fit on a transformed scale, and report in ways that don't require back-transformation.

          Code:
          . sysuse auto, clear
          (1978 automobile data)
          
          . moments
          
          -----------------------------------------------------------------------
                          n = 69 |       mean          SD    skewness    kurtosis
          -----------------------+-----------------------------------------------
                           Price |   6146.043    2912.440       1.688       5.032
                   Mileage (mpg) |     21.290       5.866       0.995       3.997
              Repair record 1978 |      3.406       0.990      -0.057       2.678
                  Headroom (in.) |      3.000       0.853       0.197       2.144
           Trunk space (cu. ft.) |     13.928       4.343      -0.044       2.159
                   Weight (lbs.) |   3032.029     792.851       0.118       2.073
                    Length (in.) |    188.290      22.747      -0.076       2.000
               Turn circle (ft.) |     39.797       4.441       0.071       2.228
          Displacement (cu. in.) |    198.000      93.148       0.581       2.354
                      Gear ratio |      2.999       0.463       0.279       2.109
                      Car origin |      0.304       0.464       0.850       1.723
          -----------------------------------------------------------------------
          
          . lmoments
          
          -----------------------------------------------------------------------
                          n = 69 |        l_1         l_2         l_3         l_4
          -----------------------+-----------------------------------------------
                           Price |   6146.043    1432.006     607.807     313.122
                   Mileage (mpg) |     21.290       3.209       0.612       0.501
              Repair record 1978 |      3.406       0.538       0.016       0.068
                  Headroom (in.) |      3.000       0.486       0.025       0.016
           Trunk space (cu. ft.) |     13.928       2.488      -0.061       0.152
                   Weight (lbs.) |   3032.029     456.479       9.005      12.074
                    Length (in.) |    188.290      13.121      -0.324       0.439
               Turn circle (ft.) |     39.797       2.535      -0.009       0.092
          Displacement (cu. in.) |    198.000      52.500       8.227       1.495
                      Gear ratio |      2.999       0.265       0.020       0.017
                      Car origin |      0.304       0.215       0.087      -0.014
          -----------------------------------------------------------------------
          
          -----------------------------------------------------------
                          n = 69 |          t         t_3         t_4
          -----------------------+-----------------------------------
                           Price |      0.233       0.424       0.219
                   Mileage (mpg) |      0.151       0.191       0.156
              Repair record 1978 |      0.158       0.030       0.127
                  Headroom (in.) |      0.162       0.051       0.034
           Trunk space (cu. ft.) |      0.179      -0.025       0.061
                   Weight (lbs.) |      0.151       0.020       0.026
                    Length (in.) |      0.070      -0.025       0.033
               Turn circle (ft.) |      0.064      -0.003       0.036
          Displacement (cu. in.) |      0.265       0.157       0.028
                      Gear ratio |      0.088       0.075       0.063
                      Car origin |      0.706       0.403      -0.063
          -----------------------------------------------------------
          moments is a convenience wrapper from SSC for calls to summarize.

          lmoments is also from SSC.

          Comment


          • #6
            Dear Joao and Nick,

            Many thanks for your advice.

            I take note about your recommendation not to transform the dependent variable. I explore the GLM option that Joao points out.

            Again, many thanks

            Luis Ortiz

            Comment

            Working...
            X