FDI Inflows has negative values, how to transform with log

Nour Mohamed

Join Date: Jul 2024

Posts: 30
#1

FDI Inflows has negative values, how to transform with log

08 Jan 2025, 09:26

Hi StataList,

My dependent variable is FDI inflows as a % of GDP and I'm trying to predict how different uncertainty measures impact FDI. Some of the values for FDI have negative values and because the data is skewed to the right, I want to take log values. Unsure how to do so without creating missing values.
I saw in another thread that log(1+FDI) or any other constant is a bad idea because it causes comparisons with other papers difficult.

Thanks
Nour
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35811
#2

08 Jan 2025, 10:49

There isn't a solution here without a downside. The downsides include (1) needing inversion (2) being arbitrary (3) needing interpretation (4) what else are you assuming, tacitly or otherwise, about the data generation process?

Cube roots or any other odd integer roots preserve the sign and pull in each tail.

sign(y) * ln(1 + abs(y)) does the same but 1 is not a neutral player. For anything in currency units, does it mean 1 cent, 1 dollar, 1 million dollars, or what? In other words why 1 rather than any other constant?

asinh() -- which some economists want to call IHS -- is yet another beast that has similar promise and a similar pitfall of needing to explain why one constant was used rather than any other. Note that asinh(y) is naturally asinh(1 * y) so there is still a question why 1 not any other k.

In my view the worst solution of all, which I won't defend even half-heartedly, is to add some number big enough to ensure that logarithms can be taken of everything. The word arbitrary is too weak to describe such a tactic.

On cube roots, watch out. Stata like most other software I know about doesn't have a cube root function. It calculates cube roots as powers for which it first takes logarithms. That will fail with negative arguments. It's elementary that say the cube root of -8 is -2 but you need in Stata code like

Code:

sign(y) * (abs(y))^(1/3)

as a work-around to ensure correct results. .

The best I can advise for any candidate transformation is to plot the transform against your data. You need a strong sense that the transform is if not sensible, then defensible. It's easier to know when the transform is crazy, as when it squeezes in the tails far too much.

John Mullahy , Joao Santos Silva and I are the usual suspects in this territory. I am a little more positive than they are about the scope for useful transformation, but not much.
3 likes
Comment
George Ford

Join Date: Aug 2014

Posts: 3207
#3

08 Jan 2025, 11:28

as Nick explains, this is a question with no good answers, which is weird because it's quite common.

that said, I'm not sure why you'd take a log of a percentage. You can, and the interpretation (directionally) is the same though more complicated to interpret.

as a percentage, I'd leave it as is.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35811
#4

08 Jan 2025, 11:33

We're getting into economics, which is not at all my field, but if this were my problem I would start with FDI and use GNP as one predictor among several. Just scaling FDI by GNP could easily create quite as many problems as it solves.

Comparison with other papers is already fraught because there are so many solutions here.
2 likes
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3028
#5

08 Jan 2025, 11:35

Dear Nour Mohamed,

In addition to Nick Cox's great comments, I would just point out that, as long and the expected value is positive, an exponential model will be an attractive option even if the dependent variable has a few negative values. The point, obviously, is that you need to be confident that a negative expected value is implausible.

You do not say why you are doing this; if it is for school work, you should follow the advice of your supervisor. Anyway, John Mullahy is the authority in this filed and hopefully he will be able to help.

Best wishes,

Joao
3 likes
Comment
John Mullahy

Join Date: Dec 2016

Posts: 755
#6

08 Jan 2025, 12:08

Joao Santos Silva is being characteristically generous—and in this case quite possibly incorrect—in referring to me as an authority on the topic.

That aside, I'd like to raise a question. I'm not familiar with the FDI literature so this may come across as naive and/or irrelevant.

It seems like the model you are planning to estimate is of the form:

Code:

FDI / GDP = f(x) + u

I wonder whether an alternative model could be interesting:

Code:

FDI = g(x, GDP) + v

If so then then negative values of FDI would seem to create no particular concerns.

Obviously getting a reasonable functional form for g(…) would be important but I could imagine a linear specification with, for instance, main effects in x and GDP and interactions of GDP with x.

Or perhaps even better—along the lines Joao proposed—an exponential mean model

Code:

E[FDI | x, GDP] = exp(b*x)*exp(a*GDP)

As Joan noted you'd need to be able maintain that the conditional mean of FDI could never be nonpositive for any values of (x, GDP). But were that reasonable then the exp(a*GDP) term has a perhaps-nice interpretation as a multiplicative scaling factor.

GDP could more restrictively be specified (e.g. using Stata's –poisson– command) as either an exposure or an offset variable. As an exposure variable one has:

Code:

E[FDI | x, GDP] = GNP*exp(b*x)

which is more or less Joao's suggestion.

Of course if the relevant literature (or at least its editors and referees) insists on FDI/GDP as the dependent variable then these ideas are for naught.

P.S. added as an edit: While Nick Cox and I occasionally don't see eye-to-eye on how to handle nonpositive dependent variables, I just saw his suggestion in #4 and concur with it. Perhaps our worldviews are converging?!

Last edited by John Mullahy; 08 Jan 2025, 12:14.
4 likes
Comment
George Ford

Join Date: Aug 2014

Posts: 3207
#7

08 Jan 2025, 13:15

The few studies I looked at used the level of FDI with market size covariates.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35811
#8

09 Jan 2025, 02:15

It would be interesting and relevant to see the results of

Code:

su FDI, detail count if FDI < 0

where I mean the raw (unscaled) version of FDI and naturally you should use whatever variable name is used in your data.
1 like
Comment

Nour Mohamed

Join Date: Jul 2024
Posts: 30

09 Jan 2025, 10:32

Originally posted by Nick Cox View Post

It would be interesting and relevant to see the results of

Code:

su FDI, detail

count if FDI < 0

where I mean the raw (unscaled) version of FDI and naturally you should use whatever variable name is used in your data.

I ran it for the the FDI inflow/GDP and the other 2 forms of FDI (inflows and total). Probably not the most relevant but what caught my eye was that they're not exactly equal which feels weird but it's not a significant difference.
When I log FDI inflows/GDP I lose 10 observations, I want to try and preserve as much as I can because from the results you can see I'm dealing with a somewhat small sample.

Code:

. su FDI, detail 

      Foreign direct investment, net inflows (% of GDP)
                   [BX.KLT.DINV.WD.GD.ZS]
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -1.855686      -3.606928
 5%     .0612933       -2.75744
10%     .3813196      -1.855686       Obs                 275
25%     .9557686      -1.332574       Sum of wgt.         275

50%     2.302984                      Mean           4.531348
                        Largest       Std. dev.      8.157018
75%     4.056448       41.06495
90%     8.253737       41.53184       Variance       66.53695
95%     25.35682        44.5507       Skewness       3.674657
99%     41.53184       58.51837       Kurtosis       17.53948

. 
. 
. 
. count if FDI < 0
  10

. su FDI_NetIn, detail

        Foreign direct investment, net inflows (BoP,
              current US$) [BX.KLT.DINV.CD.WD]
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -4.29e+09      -2.51e+10
 5%     5.47e+08      -4.55e+09
10%     1.61e+09      -4.29e+09       Obs                 275
25%     4.10e+09      -2.98e+09       Sum of wgt.         275

50%     9.50e+09                      Mean           2.98e+10
                        Largest       Std. dev.      5.21e+10
75%     3.13e+10       2.53e+11
90%     6.81e+10       2.68e+11       Variance       2.71e+21
95%     1.56e+11       2.80e+11       Skewness       3.057231
99%     2.68e+11       2.91e+11       Kurtosis       12.69021

. su FDI_net, detail

      Foreign direct investment, net (BoP, current US$)
                      [BN.KLT.DINV.CD]
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -1.86e+11      -2.32e+11
 5%    -6.81e+10      -2.18e+11
10%    -4.06e+10      -1.86e+11       Obs                 273
25%    -1.16e+10      -1.76e+11       Sum of wgt.         273

50%    -2.56e+09                      Mean          -5.73e+09
                        Largest       Std. dev.      4.60e+10
75%     2.45e+09       1.38e+11
90%     2.11e+10       1.45e+11       Variance       2.11e+21
95%     6.05e+10       1.55e+11       Skewness      -.3052543
99%     1.45e+11       2.18e+11       Kurtosis       11.32349

.

Thank you for your first answer/response. It was very useful, I'll think about what I'm willing to trade off. I want an intuitive interpretation above all.
Will it cause problems if I just leave the dependent variable as is?

Comment

Nour Mohamed

Join Date: Jul 2024

Posts: 30
#10

09 Jan 2025, 10:35

Originally posted by George Ford View Post

as Nick explains, this is a question with no good answers, which is weird because it's quite common.

that said, I'm not sure why you'd take a log of a percentage. You can, and the interpretation (directionally) is the same though more complicated to interpret.

as a percentage, I'd leave it as is.

For your first comment, based on the literature it seems to almost be the norm.

For your second comment, I'm also using trade openness for one of my dependent/explanatory variables (specifically my macroeconomic variables) as it seems to be the "canonical model" used for FDI studies.
Comment
Nour Mohamed

Join Date: Jul 2024

Posts: 30
#11

09 Jan 2025, 10:39

Originally posted by John Mullahy View Post

Joao Santos Silva is being characteristically generous—and in this case quite possibly incorrect—in referring to me as an authority on the topic.

That aside, I'd like to raise a question. I'm not familiar with the FDI literature so this may come across as naive and/or irrelevant.

It seems like the model you are planning to estimate is of the form:

Code:

FDI / GDP = f(x) + u

I wonder whether an alternative model could be interesting:

Code:

FDI = g(x, GDP) + v

If so then then negative values of FDI would seem to create no particular concerns.

Obviously getting a reasonable functional form for g(…) would be important but I could imagine a linear specification with, for instance, main effects in x and GDP and interactions of GDP with x.

Or perhaps even better—along the lines Joao proposed—an exponential mean model

Code:

E[FDI | x, GDP] = exp(b*x)*exp(a*GDP)

As Joan noted you'd need to be able maintain that the conditional mean of FDI could never be nonpositive for any values of (x, GDP). But were that reasonable then the exp(a*GDP) term has a perhaps-nice interpretation as a multiplicative scaling factor.

GDP could more restrictively be specified (e.g. using Stata's –poisson– command) as either an exposure or an offset variable. As an exposure variable one has:

Code:

E[FDI | x, GDP] = GNP*exp(b*x)

which is more or less Joao's suggestion.

Of course if the relevant literature (or at least its editors and referees) insists on FDI/GDP as the dependent variable then these ideas are for naught.

P.S. added as an edit: While Nick Cox and I occasionally don't see eye-to-eye on how to handle nonpositive dependent variables, I just saw his suggestion in #4 and concur with it. Perhaps our worldviews are converging?!

I hope I'm not misinterpreting your second line of code, but I think I'm doing that. Most studies I looked at do.

FDI = f(macro, institutional quality)
or
FDI = f(macro, institutional quality, X)

with the second one being what I'm doing as X can be a matrix for other controls, which is my case is uncertainty - my main point of concern.

Sorry for the crude formatting
Comment
Nour Mohamed

Join Date: Jul 2024

Posts: 30
#12

09 Jan 2025, 10:42

Originally posted by Joao Santos Silva View Post

Dear Nour Mohamed,

In addition to Nick Cox's great comments, I would just point out that, as long and the expected value is positive, an exponential model will be an attractive option even if the dependent variable has a few negative values. The point, obviously, is that you need to be confident that a negative expected value is implausible.

You do not say why you are doing this; if it is for school work, you should follow the advice of your supervisor. Anyway, John Mullahy is the authority in this filed and hopefully he will be able to help.

Best wishes,

Joao

Yes it's for school work. My supervisor hasn't really given me guidance for how to transform my data, it seems a bit basic I think? He suggested some models to use (FE, GMM, IV)

Thank you all for your answers, It's been very useful for me and I hope future stata users who are stuck on this part.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35811

#13

09 Jan 2025, 11:15

I will illustrate some technique on one of your variables FDI_net.

You have 9 calculated quantiles (percentiles) for 1,5,10,25,50,75,90,95,99% plus information on the range.

A plot of cube root indicates that it's a modest transformation.

A plot of sign(y) * ln(1 + abs(y)) (sometimes called neglog) shows that 1 is far from harmless: you're essentially mapping to a binary summary.

Similarly asinh(y) is almost binary in result. asinh(y / 1e11) is almost linear. The choice of constant in asinh(k * y) or asinh(y / k) is absolutely crucial.

I will give just the first graph: the rest are easy technique.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float p double y
       .01 -1.860e+11
       .05 -6.810e+10
        .1 -4.060e+10
       .25 -1.160e+10
        .5 -2.560e+09
       .75  2.450e+09
        .9  2.110e+10
       .95  6.050e+10
       .99  1.450e+11
.001831502 -2.320e+11
.005494506 -2.180e+11
 .00915751 -1.860e+11
.012820513 -1.760e+11
  .9871795  1.380e+11
  .9908425  1.450e+11
  .9945055  1.550e+11
  .9981685  2.180e+11
end

gen curt_y = sign(y) * abs(y)^(1/3)

scatter curt_y y in 1/9 || function sign(x) * abs(x)^(1/3), ra(y)

gen neglog_y = sign(y) * log(1 + abs(y))

scatter neglog_y y in 1/9 || function sign(x) * log1p(abs(x)), ra(y)

twoway function asinh(x), ra(y)

twoway function asinh(x/1e9), ra(y)

twoway function asinh(x/1e11), ra(y)

twoway function asinh(x/1e10), ra(y)

Click image for larger version

Name: curt.png
Views: 1
Size: 31.4 KB
ID: 1770539

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35811
#14

10 Jan 2025, 09:24

Perhaps I should try some extra big-picture comments. Here read response for outcome if you wish and dependent variable for outcome if you must.

Leaving out observations because they are awkward for a naive analysis (e.g. zero or negative values when logarithms are contemplated) is rarely defensible. Modifying the analysis to fit the data beats modifying the data to fit the analysis, all else aside.

Outcome distributions that (a) are skewed or (b) are long-tailed or (c) contain outliers can be awkward. Note that all combinations of (a) (b) (c) can occur.

Moment-based skewness is one single measure of skewness -- which can still be misleading. Always plot the data too.

Moment-based kurtosis is one single measure of tail weight -- which can still be (very!) misleading. Always plot the data too.

Working on a transformed scale can be a good idea. That need not mean prior transformation; it might mean working with (in generalized linear model jargon, sometimes used elsewhere) a particular link function. Poisson regression is one such method.

Although it is always a good idea to look at the distribution of the outcome, mistakes in any direction can be made.

* A common mistake is not to realise that a transformation would help. It is frequent in some quarters that people have never been told about logarithms (perhaps it's assumed that they remember them from a previous incarnation or from previous teaching they never had or have forgotten) -- or are unduly diffident about using them because they (feel they) don't understand them -- or have the mistaken idea that taking logarithms is somehow cheating -- and so on, and so on. Logarithms are just the leading example here, and similar things could be said about some other transformations.

* A different common mistake is not to realise that a particular transformation is a bad idea. For example log(x + fudge constant), where fudge constant is big enough to ensure positive arguments, is almost never a good way to fix negative values, as a supposed problem. Plotting the effect of a transformation will usually be a way to spot a really bad idea and avoid it.

* Although apparent outliers should be always be noted -- and removed from the data if and only if they are obvious mistakes or impossible values -- often outliers are, as it were, explained by outliers. A big outcome is often explained by big predictors.

What was a good idea will only become evident in terms of a model, including plots of fitted outcome versus observed outcome and of residuals too. The usual tables of coefficient estimates, errors and confidence intervals never carry the whole truth about a model fit.

Textbooks that are great on theory and deep modelling ideas aren't always helpful on how to handle messy data.
1 like
Comment

Announcement