Transformation of a highly skewed depedent variable and marginal effects

Luis Ortiz

Join Date: Dec 2014

Posts: 97
#1

Transformation of a highly skewed depedent variable and marginal effects

14 Jun 2023, 12:41

Dear members of the list,

I am now carrying out a research with panel data where my dependent variable is overeducation. It has many 0s (corresponding to inviduals with the right job; or matched) and positive values (1,2,3,4...) depending on the number of estimated years about the years required for their jobs that overeducated workers have.

As you can imagine, this dependent variable is heavily skewed to the right.

The CERO values are important to me, but the rest of the values are important too. I could merge all the non-cero values into 1, treating my depedent variable as a binary one, but this means losing information, because the range from 1 to 6 means different levels of mismatch, and this is valuable information.

I have been said that I should transform my dependent variable using...

HTML Code:

gen [new_dep_var] = asinh([old_dep_var])

The problem comes at the moment of interpreting the coefficients resulting from this transformation. I am aware of a paper written on this matter by Edward C. Norton in this regard. It offers a way to estimate the marginal effects of the different variables in the model on the original scale of the dependent variable, but the routine offered by the author seems quite complicated to me. Edward C. Norton provides a loop for this purpose in the paper. Is there anyone who nows if such a loop has already been incorporated to Stata?

Besides, is there any other way of attending the problem generated by the particular distribution of this depedent variable? How would you deal with it? Any advice should be greatly welcome.

Thanks for your attention

Luis Ortiz
Tags: average marginal effects, functions, skewness, variables transformation
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#2

14 Jun 2023, 12:57

Dear Luis Ortiz,

This looks like count data, so why not use Poisson regression?

Best wishes,

Joao
Comment
Luis Ortiz

Join Date: Dec 2014

Posts: 97
#3

14 Jun 2023, 13:32

Many thanks, Joao

The distribution in the following Stata link about Poisson regression looks certainly similar to the one that I posted:

https://stats.oarc.ucla.edu/stata/da...itional%20mean.

But I'm not quite sure my dependent variable is a count variable. The variable captures either job match (value 0 or no excess of years of education relative to the ones required by the job) or mismatch, with different levels of mismatch, from low mismatch (only one year of excess in the number of years of education relative to the number of years required by the job) to six years of excess.

The distribution is clearly (and reasonably) skewed, but the variable -I suspect- is not a count one.

Thanks a lot for your suggestion and attention

Best wishes

Luis Ortiz
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#4

14 Jun 2023, 15:07

Dear Luis Ortiz,

This still looks like count data, but with an upper bound at 6, so Poisson is not suitable. Have a look at the glm command (family: binomial).

Best wishes,

Joao
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

14 Jun 2023, 16:52

A more general point here -- entirely consistent I think with Joao Santos Silva's advice -- is that outcomes that are binary or ordered categorical variables can be quite highly skewed, at least compared with others, but transformation is rarely the best treatment (and I write as someone more positive about transformations than is common).

Consider for example the auto data. On moment-based skewness foreign (Car origin) appears quite skewed and the same can be said using the undeservedly not-so-well-known L-moment-based measure t_3. But if foreign is an outcome, the best remedy is usually just a model fit such as using a logit or probit link function (which does in effect fit on a transformed scale, and report in ways that don't require back-transformation.

Code:

. sysuse auto, clear
(1978 automobile data)

. moments

-----------------------------------------------------------------------
                n = 69 |       mean          SD    skewness    kurtosis
-----------------------+-----------------------------------------------
                 Price |   6146.043    2912.440       1.688       5.032
         Mileage (mpg) |     21.290       5.866       0.995       3.997
    Repair record 1978 |      3.406       0.990      -0.057       2.678
        Headroom (in.) |      3.000       0.853       0.197       2.144
 Trunk space (cu. ft.) |     13.928       4.343      -0.044       2.159
         Weight (lbs.) |   3032.029     792.851       0.118       2.073
          Length (in.) |    188.290      22.747      -0.076       2.000
     Turn circle (ft.) |     39.797       4.441       0.071       2.228
Displacement (cu. in.) |    198.000      93.148       0.581       2.354
            Gear ratio |      2.999       0.463       0.279       2.109
            Car origin |      0.304       0.464       0.850       1.723
-----------------------------------------------------------------------

. lmoments

-----------------------------------------------------------------------
                n = 69 |        l_1         l_2         l_3         l_4
-----------------------+-----------------------------------------------
                 Price |   6146.043    1432.006     607.807     313.122
         Mileage (mpg) |     21.290       3.209       0.612       0.501
    Repair record 1978 |      3.406       0.538       0.016       0.068
        Headroom (in.) |      3.000       0.486       0.025       0.016
 Trunk space (cu. ft.) |     13.928       2.488      -0.061       0.152
         Weight (lbs.) |   3032.029     456.479       9.005      12.074
          Length (in.) |    188.290      13.121      -0.324       0.439
     Turn circle (ft.) |     39.797       2.535      -0.009       0.092
Displacement (cu. in.) |    198.000      52.500       8.227       1.495
            Gear ratio |      2.999       0.265       0.020       0.017
            Car origin |      0.304       0.215       0.087      -0.014
-----------------------------------------------------------------------

-----------------------------------------------------------
                n = 69 |          t         t_3         t_4
-----------------------+-----------------------------------
                 Price |      0.233       0.424       0.219
         Mileage (mpg) |      0.151       0.191       0.156
    Repair record 1978 |      0.158       0.030       0.127
        Headroom (in.) |      0.162       0.051       0.034
 Trunk space (cu. ft.) |      0.179      -0.025       0.061
         Weight (lbs.) |      0.151       0.020       0.026
          Length (in.) |      0.070      -0.025       0.033
     Turn circle (ft.) |      0.064      -0.003       0.036
Displacement (cu. in.) |      0.265       0.157       0.028
            Gear ratio |      0.088       0.075       0.063
            Car origin |      0.706       0.403      -0.063
-----------------------------------------------------------

moments is a convenience wrapper from SSC for calls to summarize.

lmoments is also from SSC.

Comment

Luis Ortiz

Join Date: Dec 2014

Posts: 97
#6

15 Jun 2023, 01:47

Dear Joao and Nick,

Many thanks for your advice.

I take note about your recommendation not to transform the dependent variable. I explore the GLM option that Joao points out.

Again, many thanks

Luis Ortiz
Comment

Announcement