Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • log transformation generating missing data

    Hello,

    I' m using stata version 12.

    I want to log my variables, but then stata generate many missing values and I can not run the regression. If I do not take the logarithm the values are instead 0.

    I heard that I can not take the log(0), but is it possible to change my missing values into 0 or to ignore the missing values somhow so i can run the regression.

    Thank you.

    Best regards,
    Ruth
    Last edited by ruth matko; 18 Jun 2018, 09:00.

  • #2
    Hi Ruth,
    I think you already answered your own question. You cannot take the a logarithm of zero, as it is undefined. You can certainly change those to zero using a code like:
    Code:
    replace log_var=0 if log_var==.
    But then it would be difficult to interpret the results, because this is no longer a proper log transformation.
    Regarding ignoring the variables, that is what stata already does. It drops data with missing values when trying to estimate whichever model you are trying to obtain.
    The better question here is, what type of model are you trying to estimate? why do you need a log transformation?
    HTH
    Fernando

    Comment


    • #3
      You wish to logtransform your variables. But you didn't mention the reason. Apart from the fact that zero and negative values will pose an obstacle, there are several situations (perhaps, the majority) where logtransforming won't be necessary, even in case the variable is skewed.
      Best regards,

      Marcos

      Comment


      • #4
        Ok, thank you.

        I'm trying to estimate a gmm function for the following equation:

        Code:
            xtabond2 DepVar l.DepVar l2.DepVar MigrEU l.MigrEU l2.MigrEU MigrNoEU l.MigrNoEU l2.MigrNoEU, gmm (MigrEU MigrNoEU), lag (3 3)) iv(PropImm) small robust, if country==40
        First only my DepVar was a logtransformation but then my coefficients were too big so I thought transforming all variables into logs would be helpful.

        Best regards,
        Ruth

        Comment


        • #5
          If size of the coefficients is the only problem, change the units of measurement to something more congenial.

          Comment


          • #6
            If part of the reason for using the log() is to have a functional form for which the absolute value of the slope with respect to Y decreases as the predictor X increases, using sqrt(X) rather than log(X) can be useful. I once had a situation like yours, and by using sqrt(X), I not only avoided losing lots of observations to missing values, but also improved the fit of the model.

            Comment


            • #7
              Ok thank you.

              When I look at my descriptive statistic I can see that I have small values and a small variation in the x variables. Do you think this explains the big coefficients? and if so do you know how I can solve the problem? I'm not sure if taking the log(x) or sqrt(x) reducing the variation even more?

              I don't know if this help but my DepVar is the employment/unemployment rate ratio of native labour and MigrEU and MigrNoEU represent the migrant worker rate.

              Best regards,
              Ruth


              Comment


              • #8
                It seems the question is not specifically related to logtransforn or not to logtransform. Apparently, it is related to modeling.

                I recommend to follow the FAQ advice, particularly the topic about sharing data/command/output.
                Best regards,

                Marcos

                Comment


                • #9
                  Show us the results of

                  Code:
                  su DepVar MigrEU  MigrNoEU , detail
                  
                  graph matrix MigrEu MigrNoEu DepVar
                  The question of wanting a logarithm-like transformation when there are zeros or negative values arises frequently and the answers range widely, including

                  * why do you want to do that (if only because anything other than logarithm is hard to interpret simply)

                  * square root (Mike Lacy's suggestion, and in many ways a natural suggestion statistically when a variable is counted)

                  * cube root (stronger than square root)

                  * neglog (sign(x) * log(1 + |x|)

                  * inverse hyperbolic sine (asinh()).

                  The last three have the signal advantage of being defined for negative, zero and positive values alike. That isn't true of square roots (as complex numbers aren't helpful in this context).

                  Comment


                  • #10
                    Thank you.

                    I want to use the log because I thought it would help to make my coefficients smaller.

                    Here you can see the summary statistics and the graph:

                    Code:
                            DepVar
                                    
                        Percentiles    Smallest
                    1%    -5.455321    -6.100319
                    5%    -5.375278    -6.100319
                    10%    -5.293305    -6.100319    Obs    3218
                    25%    -4.804021    -6.100319    Sum of Wgt.    3218
                    
                    50%    -4.199705                        Mean    -4.123614
                                           Largest                Std. Dev.    .8922244
                    75%    -3.491444    -1.489479
                    90%    -2.913902    -1.483287    Variance    .7960644
                    95%    -2.508437    -1.270463    Skewness    .4502615
                    99%    -1.930486    -1.225612    Kurtosis    2.519077
                    
                            MigrEU
                                    
                        Percentiles    Smallest
                    1%      0            0
                    5%      0            0
                    10%     0            0         Obs    3548
                    25%     0            0         Sum of Wgt.    3548
                    
                    50%    0                      Mean    .0036834
                                       Largest    Std. Dev.    .0061859
                    75%    .0073529    .0353982
                    90%    .012605      .036036    Variance    .0000383
                    95%    .0173913    .036036    Skewness    1.975739
                    99%    .025641      .036036    Kurtosis    6.876026
                    
                            MigrNoEU
                                    
                        Percentiles    Smallest
                    1%      0            0
                    5%      0            0
                    10%     0            0         Obs    3548
                    25%     0            0         Sum of Wgt.    3548
                    
                    50%    0                     Mean    .0002441
                                      Largest    Std. Dev.    .0013774
                    75%    0           .009901
                    90%    0          ยด.009901      Variance    1.90e-06
                    95%    0           .009901      Skewness    5.694731
                    99%    .0083333    .0151515     Kurtosis    34.86684
                    
                    .

                    .
                    Click image for larger version

Name:	Graph.png
Views:	1
Size:	20.2 KB
ID:	1449611

                    Comment


                    • #11
                      Hi;
                      I also have negative and zero values in my series. I used "Busse, M. and Hefeker, C. (2007). Political risk and foreign direct investment.European Journal ofPolitical Economy. 23,397-415." method for transformation of negative and zero values. You can use this formula in excel or stata for transformation. That is my suggestion

                      Comment


                      • #12
                        Mehmed: "this formula" is not explained, so you're expecting people to know the paper already or to look it up.

                        I'll short-circuit that by mentioning that on p.404 the transformation is explained (not very well) as (translating to Stata) ln(x + sqrt(x^2 + 1)) which is indeed just the inverse hyperbolic sine.

                        The method long predates that paper.

                        Comment


                        • #13
                          It seems the last 2 variables are basically a sequence of zeroes or almost-zero values. Maybe the suggested approach in #5 still applies to this situation, at least to some of the variables. Additionally, if we have 95% of the values equal to zero and the largest value = 0.015, I fear it is practically a "zero-value" variable. To end, you still didn't inform "how big" are the coefficients. I have no experience with dynamic models, but I gather using lags and instruments to variables with so few "changes" may eventually turn into something difficult to handle, even mathematically speaking.
                          Best regards,

                          Marcos

                          Comment


                          • #14
                            just to show you how my output look like:

                            Code:
                            Dynamic panel-data estimation, one-step system GMM
                            ------------------------------------------------------------------------------
                            Group variable: groupvaria~e                    Number of obs      =       794
                            Time variable : quarter                         Number of groups   =        54
                            Number of instruments = 57                      Obs per group: min =         1
                            F(8, 53)      =      9.53                                      avg =     14.70
                            Prob > F      =     0.000                                      max =        26
                            ------------------------------------------------------------------------------
                                         |               Robust
                                  DepVar |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                            -------------+----------------------------------------------------------------
                                  DepVar |
                                     L1. |   .4731976   .1382863     3.42   0.001     .1958303    .7505648
                                     L2. |   .2507907   .1840918     1.36   0.179    -.1184506     .620032
                                         |
                                  MigrEU |
                                     --. |   .3225744   5.543193     0.06   0.954    -10.79566    11.44081
                                     L1. |   14.66429   9.731885     1.51   0.138    -4.855399    34.18398
                                     L2. |   1.333202   6.702954     0.20   0.843    -12.11122    14.77763
                                         |
                                MigrNoEU |
                                     --. |  -125.6486   60.03794    -2.09   0.041    -246.0694     -5.2277
                                     L1. |  -74.77027   254.0151    -0.29   0.770    -584.2601    434.7196
                                     L2. |   47.28871   44.73174     1.06   0.295     -42.4318    137.0092
                                         |
                                   _cons |  -1.198504   .6964889    -1.72   0.091    -2.595484    .1984756
                            ------------------------------------------------------------------------------
                            Instruments for first differences equation
                              Standard
                                D.PropImm
                              GMM-type (missing=0, separate instruments for each period unless collapsed)
                                L3.(MigrEU MigrNoEU)
                            Instruments for levels equation
                              Standard
                                PropImm
                                _cons
                              GMM-type (missing=0, separate instruments for each period unless collapsed)
                                DL2.(MigrEU MigrNoEU)
                            ------------------------------------------------------------------------------
                            Arellano-Bond test for AR(1) in first differences: z =  -2.77  Pr > z =  0.006
                            Arellano-Bond test for AR(2) in first differences: z =  -0.15  Pr > z =  0.883
                            ------------------------------------------------------------------------------
                            Sargan test of overid. restrictions: chi2(48)   =  63.17  Prob > chi2 =  0.070
                              (Not robust, but not weakened by many instruments.)
                            Hansen test of overid. restrictions: chi2(48)   =  37.75  Prob > chi2 =  0.856
                              (Robust, but weakened by many instruments.)
                            
                            Difference-in-Hansen tests of exogeneity of instrument subsets:
                              GMM instruments for levels
                                Hansen test excluding group:     chi2(20)   =  15.91  Prob > chi2 =  0.722
                                Difference (null H = exogenous): chi2(28)   =  21.84  Prob > chi2 =  0.789
                              iv(PropImm)
                                Hansen test excluding group:     chi2(47)   =  37.56  Prob > chi2 =  0.836
                                Difference (null H = exogenous): chi2(1)    =   0.19  Prob > chi2 =  0.666

                            Comment


                            • #15
                              As a matter of curiosity I produced quantile normal plots from your 1 5 10 25 50 75 90 95 99% points.


                              Click image for larger version

Name:	transformation2.png
Views:	1
Size:	21.4 KB
ID:	1449684




                              Those and the summarize results underline a very large fraction of zeros for your predictors, as Marcos Almeida also flags. Hence if you transform your predictors it is essential to use a transformation that allows zeros, which can't mean bare logarithms. That said, It's not clear that a transformation would necessarily help much with the model.

                              What bizarre units are associated with the response variable DepVar ?

                              Comment

                              Working...
                              X