log transformation generating missing data

ruth matko

Join Date: Mar 2018

Posts: 15
#1

log transformation generating missing data

18 Jun 2018, 07:57

Hello,

I' m using stata version 12.

I want to log my variables, but then stata generate many missing values and I can not run the regression. If I do not take the logarithm the values are instead 0.

I heard that I can not take the log(0), but is it possible to change my missing values into 0 or to ignore the missing values somhow so i can run the regression.

Thank you.

Best regards,
Ruth

Last edited by ruth matko; 18 Jun 2018, 08:00.
Tags: None
FernandoRios

Join Date: Apr 2014

Posts: 2469
#2

18 Jun 2018, 08:23

Hi Ruth,
I think you already answered your own question. You cannot take the a logarithm of zero, as it is undefined. You can certainly change those to zero using a code like:

Code:

replace log_var=0 if log_var==.

But then it would be difficult to interpret the results, because this is no longer a proper log transformation.
Regarding ignoring the variables, that is what stata already does. It drops data with missing values when trying to estimate whichever model you are trying to obtain.
The better question here is, what type of model are you trying to estimate? why do you need a log transformation?
HTH
Fernando
2 likes
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

18 Jun 2018, 09:05

You wish to logtransform your variables. But you didn't mention the reason. Apart from the fact that zero and negative values will pose an obstacle, there are several situations (perhaps, the majority) where logtransforming won't be necessary, even in case the variable is skewed.

Best regards,

Marcos
Comment
ruth matko

Join Date: Mar 2018

Posts: 15
#4

18 Jun 2018, 10:01

Ok, thank you.

I'm trying to estimate a gmm function for the following equation:

Code:

xtabond2 DepVar l.DepVar l2.DepVar MigrEU l.MigrEU l2.MigrEU MigrNoEU l.MigrNoEU l2.MigrNoEU, gmm (MigrEU MigrNoEU), lag (3 3)) iv(PropImm) small robust, if country==40

First only my DepVar was a logtransformation but then my coefficients were too big so I thought transforming all variables into logs would be helpful.

Best regards,
Ruth
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#5

18 Jun 2018, 10:44

If size of the coefficients is the only problem, change the units of measurement to something more congenial.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#6

18 Jun 2018, 13:29

If part of the reason for using the log() is to have a functional form for which the absolute value of the slope with respect to Y decreases as the predictor X increases, using sqrt(X) rather than log(X) can be useful. I once had a situation like yours, and by using sqrt(X), I not only avoided losing lots of observations to missing values, but also improved the fit of the model.
2 likes
Comment
ruth matko

Join Date: Mar 2018

Posts: 15
#7

19 Jun 2018, 01:52

Ok thank you.

When I look at my descriptive statistic I can see that I have small values and a small variation in the x variables. Do you think this explains the big coefficients? and if so do you know how I can solve the problem? I'm not sure if taking the log(x) or sqrt(x) reducing the variation even more?

I don't know if this help but my DepVar is the employment/unemployment rate ratio of native labour and MigrEU and MigrNoEU represent the migrant worker rate.

Best regards,
Ruth
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#8

19 Jun 2018, 02:52

It seems the question is not specifically related to logtransforn or not to logtransform. Apparently, it is related to modeling.

I recommend to follow the FAQ advice, particularly the topic about sharing data/command/output.

Best regards,

Marcos
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#9

19 Jun 2018, 03:19

Show us the results of

Code:

su DepVar MigrEU MigrNoEU , detail graph matrix MigrEu MigrNoEu DepVar

The question of wanting a logarithm-like transformation when there are zeros or negative values arises frequently and the answers range widely, including

* why do you want to do that (if only because anything other than logarithm is hard to interpret simply)

* square root (Mike Lacy's suggestion, and in many ways a natural suggestion statistically when a variable is counted)

* cube root (stronger than square root)

* neglog (sign(x) * log(1 + |x|)

* inverse hyperbolic sine (asinh()).

The last three have the signal advantage of being defined for negative, zero and positive values alike. That isn't true of square roots (as complex numbers aren't helpful in this context).
Comment

ruth matko

Join Date: Mar 2018
Posts: 15

#10

19 Jun 2018, 04:07

Thank you.

I want to use the log because I thought it would help to make my coefficients smaller.

Here you can see the summary statistics and the graph:

Code:

        DepVar
                
    Percentiles    Smallest
1%    -5.455321    -6.100319
5%    -5.375278    -6.100319
10%    -5.293305    -6.100319    Obs    3218
25%    -4.804021    -6.100319    Sum of Wgt.    3218

50%    -4.199705                        Mean    -4.123614
                       Largest                Std. Dev.    .8922244
75%    -3.491444    -1.489479
90%    -2.913902    -1.483287    Variance    .7960644
95%    -2.508437    -1.270463    Skewness    .4502615
99%    -1.930486    -1.225612    Kurtosis    2.519077

        MigrEU
                
    Percentiles    Smallest
1%      0            0
5%      0            0
10%     0            0         Obs    3548
25%     0            0         Sum of Wgt.    3548

50%    0                      Mean    .0036834
                   Largest    Std. Dev.    .0061859
75%    .0073529    .0353982
90%    .012605      .036036    Variance    .0000383
95%    .0173913    .036036    Skewness    1.975739
99%    .025641      .036036    Kurtosis    6.876026

        MigrNoEU
                
    Percentiles    Smallest
1%      0            0
5%      0            0
10%     0            0         Obs    3548
25%     0            0         Sum of Wgt.    3548

50%    0                     Mean    .0002441
                  Largest    Std. Dev.    .0013774
75%    0           .009901
90%    0          ´.009901      Variance    1.90e-06
95%    0           .009901      Skewness    5.694731
99%    .0083333    .0151515     Kurtosis    34.86684

.

Click image for larger version

Name: Graph.png
Views: 1
Size: 20.2 KB
ID: 1449611

Comment

mehmed han

Join Date: May 2018

Posts: 9
#11

19 Jun 2018, 04:13

Hi;
I also have negative and zero values in my series. I used "Busse, M. and Hefeker, C. (2007). Political risk and foreign direct investment.European Journal ofPolitical Economy. 23,397-415." method for transformation of negative and zero values. You can use this formula in excel or stata for transformation. That is my suggestion
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#12

19 Jun 2018, 06:03

Mehmed: "this formula" is not explained, so you're expecting people to know the paper already or to look it up.

I'll short-circuit that by mentioning that on p.404 the transformation is explained (not very well) as (translating to Stata) ln(x + sqrt(x^2 + 1)) which is indeed just the inverse hyperbolic sine.

The method long predates that paper.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#13

19 Jun 2018, 07:24

It seems the last 2 variables are basically a sequence of zeroes or almost-zero values. Maybe the suggested approach in #5 still applies to this situation, at least to some of the variables. Additionally, if we have 95% of the values equal to zero and the largest value = 0.015, I fear it is practically a "zero-value" variable. To end, you still didn't inform "how big" are the coefficients. I have no experience with dynamic models, but I gather using lags and instruments to variables with so few "changes" may eventually turn into something difficult to handle, even mathematically speaking.

Best regards,

Marcos
Comment

ruth matko

Join Date: Mar 2018
Posts: 15

#14

19 Jun 2018, 07:36

just to show you how my output look like:

Code:

Dynamic panel-data estimation, one-step system GMM
------------------------------------------------------------------------------
Group variable: groupvaria~e                    Number of obs      =       794
Time variable : quarter                         Number of groups   =        54
Number of instruments = 57                      Obs per group: min =         1
F(8, 53)      =      9.53                                      avg =     14.70
Prob > F      =     0.000                                      max =        26
------------------------------------------------------------------------------
             |               Robust
      DepVar |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      DepVar |
         L1. |   .4731976   .1382863     3.42   0.001     .1958303    .7505648
         L2. |   .2507907   .1840918     1.36   0.179    -.1184506     .620032
             |
      MigrEU |
         --. |   .3225744   5.543193     0.06   0.954    -10.79566    11.44081
         L1. |   14.66429   9.731885     1.51   0.138    -4.855399    34.18398
         L2. |   1.333202   6.702954     0.20   0.843    -12.11122    14.77763
             |
    MigrNoEU |
         --. |  -125.6486   60.03794    -2.09   0.041    -246.0694     -5.2277
         L1. |  -74.77027   254.0151    -0.29   0.770    -584.2601    434.7196
         L2. |   47.28871   44.73174     1.06   0.295     -42.4318    137.0092
             |
       _cons |  -1.198504   .6964889    -1.72   0.091    -2.595484    .1984756
------------------------------------------------------------------------------
Instruments for first differences equation
  Standard
    D.PropImm
  GMM-type (missing=0, separate instruments for each period unless collapsed)
    L3.(MigrEU MigrNoEU)
Instruments for levels equation
  Standard
    PropImm
    _cons
  GMM-type (missing=0, separate instruments for each period unless collapsed)
    DL2.(MigrEU MigrNoEU)
------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z =  -2.77  Pr > z =  0.006
Arellano-Bond test for AR(2) in first differences: z =  -0.15  Pr > z =  0.883
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(48)   =  63.17  Prob > chi2 =  0.070
  (Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(48)   =  37.75  Prob > chi2 =  0.856
  (Robust, but weakened by many instruments.)

Difference-in-Hansen tests of exogeneity of instrument subsets:
  GMM instruments for levels
    Hansen test excluding group:     chi2(20)   =  15.91  Prob > chi2 =  0.722
    Difference (null H = exogenous): chi2(28)   =  21.84  Prob > chi2 =  0.789
  iv(PropImm)
    Hansen test excluding group:     chi2(47)   =  37.56  Prob > chi2 =  0.836
    Difference (null H = exogenous): chi2(1)    =   0.19  Prob > chi2 =  0.666

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#15

19 Jun 2018, 10:51

As a matter of curiosity I produced quantile normal plots from your 1 5 10 25 50 75 90 95 99% points.

Those and the summarize results underline a very large fraction of zeros for your predictors, as Marcos Almeida also flags. Hence if you transform your predictors it is essential to use a transformation that allows zeros, which can't mean bare logarithms. That said, It's not clear that a transformation would necessarily help much with the model.

What bizarre units are associated with the response variable DepVar ?
Comment

Announcement

log transformation generating missing data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment