transforming data

Jonathan David

Join Date: Jul 2014

Posts: 23
#1

transforming data

19 Nov 2014, 19:14

Hi transforming data, I tried to transform using the log transformation commend
gen inforaging=ln(ForagingPercentage)

it managed to improve it slightly, just it still has a way to go, I am not sure what to go next.
Tags: None
Jack Stiles

Join Date: May 2014

Posts: 24
#2

19 Nov 2014, 19:20

Not sure what your end goal is but I recommend adding a 1 into the log command because otherwise you will get missing values when Foraging-Percent=0

Code:

gen inforaging=ln(1+ForagingPercentage)
Comment
Jonathan David

Join Date: Jul 2014

Posts: 23
#3

19 Nov 2014, 19:28

well my end goal is to make it normally distributed
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#4

19 Nov 2014, 19:44

Well, (to paraphrase Inigo Montoya) I am not sure you want what you think you want, but check this out and see if it seems the right direction. Really, best off finding a model that fits your data without radical transformations. For example, in this totally clean example below, the end product only correlates at about .9 with the original -- and this is under ideal circumstances. Nowadays, with Poisson, Negative Binomial, zero-inflated whatever, you can find a model that works, without destroying the original distribution.

Code:

clear set obs 1000 *=======start normal gen x=rnormal() hist x *=======make it non-normal replace x=exp(x) hist x *=======bring it back to uniform -- you can make almost *anything* uniform. xtile x2=x, nq(100) hist x2 *=======make it normal. Need to make it range 0 to 1, thus divide by 100. replace x2=invnorm(x2/100) hist x2

Last edited by ben earnhart; 19 Nov 2014, 20:13.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29825
#5

19 Nov 2014, 21:03

While I strongly endorse Ben's comment that you are probably better off fitting a model that has a logarithmic link function than log-transforming your data, if your goal is to normalize, and you are getting nearly satisfactory results with log(), and if zero or near-zero values are present in your data, you might look into the asinh() [inverse hyperbolic sine] function. It is near-logarithmic away from zero, but well behaved at zero and is often useful for normalizing percentages.

But again, I think you should think long and hard about whether you really have good reason to try to normalize your data.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3415
#6

20 Nov 2014, 01:24

Originally posted by Jonathan David View Post

well my end goal is to make it normally distributed

The end goal is usually to see how some variable influences another variable. So the distribution of a variable is usually an intermediate goal, if ever. However, making the marginal distribution normal is almost always a bad idea. If anything should be normally distributed, then it is the resdiduals, but if you have a reasonable sample size (> 30) that usually does not matter.

If your variable is the explained/dependent/response/left-hand-side/y-variable than I strongly recommend against the transformation log(1+y), as there is no easy way to backtransform your coefficients to the original metric. Instead I would recommend using glm with the options link(log) vce(robust) or better yet if you model a proportion: link(logit) vce(robust). The latter will estimate a fractional logit model.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35256
#7

20 Nov 2014, 01:34

There are many reasons to transform and better advice here depends on knowing more about the distribution of the variable, and even more importantly the functional form of the model(s) contemplated.

The name ForagingPercentage suggests a measured but bounded variable for which models with logit link are likely to be useful (if indeed it is the response variable; people seem to be assuming that, but I didn't see Jonathan stating that).

See e.g. http://www.stata-journal.com/sjpdf.h...iclenum=st0147

Conversely, although data with very skewed distributions often benefit from transformations or non-identity link functions, it is not because marginal normality is required by any of the usual models.

The word percentage however covers at least two kinds of variables.

One is "percentage of", necessarily bounded by 0 and 100%, as in percentage of males.

The other is percentage change, as in (new - old) / old, where changes can be positive or negative. Setting aside problems if old is ever zero, such data occasionally include very large changes of either sign.

asinh() could be useful for change data but for the first kind it has negligible effect on distribution shape. Here the similar sounding but quite different arcsine of square root lingers on in some literatures.

It seems most likely that Jonathan's foraging data are the first kind.

Last edited by Nick Cox; 20 Nov 2014, 02:11.
Comment
Jonathan David

Join Date: Jul 2014

Posts: 23
#8

20 Nov 2014, 05:26

Hi, perhaps it would be easier for me to just show you,

1, DV = %Foraging, IV infantage,PP, Temp, Troopsize, year, then i do interactions between infantageXPP, infantXTemp, TempXPP, and infantXPPXTemp. Random effect is MotherID

essentially as I said before, I need to transform %Foraging percentage, but I am not sure what to do.
Attached Files

CombinedBEHAVIOURDATA.xlsx (226.3 KB, 1 view)
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#9

20 Nov 2014, 05:52

Did you try my code? As I feared, with all the zeros in there, making it normal was not possible. But if you can live with some skewness, it's pretty normal. Or, see below for code that ignores the 0's. The wisdom of doing this transformation is obviously questionable based on all the responses above, but it is possible. Then you run a selection model to account for the zeros. The interpretation and marginal effects are all messed up, but getting at a crude "x has a positive effect on y, but we don't know how much of an effect" should be possible.

But if you can find a model that fits your data, you can have it all: marginal effects and meaningful betas.

Code:

*=======bring it back to uniform -- you can make almost *anything* uniform. xtile ForagingPercentageU=ForagingPercentage if ForagingPercentage!=0, nq(100) hist ForagingPercentageU *=======make it normal. Need to make it range 0 to 1, thus divide by 100. gen ForagingPercentageN=invnorm(ForagingPercentageU/100) hist ForagingPercentageN
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3415
#10

20 Nov 2014, 06:52

Originally posted by Jonathan David View Post

DV = %Foraging [...] essentially as I said before, I need to transform %Foraging percentage

That is incorrect, you should not transform %foraging to make it normally distributed. I assume you want to use this variable in a linear regression, and a linear regression only "requires" the residuals to be normally distributed. I have put requires in quotes, because in practice it does not matter much if your dataset is large enough, and yours is.

Anyhow, as I said before, you should probably reconsider using linear regression, and us a fractional logit model instead.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35256
#11

20 Nov 2014, 07:29

My post #7 was evidently written at around the same time as Maarten's #6 but it's no surprise (to me, at any rate) that we said very similar things.

Jonathan: I don't see that you're engaging with any of the points made by contributors beyond confirming that foraging percent is what you are trying to explain.

Last edited by Nick Cox; 20 Nov 2014, 08:10.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#12

20 Nov 2014, 10:05

Seems to cry out for Poisson, see the distribution. And when I ran:

Code:

encode MotherID, gen(momID) xtset momID xtreg ForagingPercentage Infantage PP Temp Year, re xtpoisson ForagingPercentage Infantage PP Temp Year, re

the Z-scores were over twice as large for the poisson model.

And also ran it it against the version forced to normality, the Z-scores were even smaller than those from xtreg on the un-transformed variable. The log-transformed variable performed worst of all of them, judging by significance/Z-scores.

Last edited by ben earnhart; 20 Nov 2014, 10:10.
Comment
Jonathan David

Join Date: Jul 2014

Posts: 23
#13

20 Nov 2014, 11:45

Apologies, the time differences, does make this a hard conversation to be apart of, I appreciate the advice, it is clearly something which could have many interpretation, although, I did the glm option, as did seem like the most viable (binomial/logit), due to the skewness of my data
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment