Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • transforming data

    Hi transforming data, I tried to transform using the log transformation commend
    gen inforaging=ln(ForagingPercentage)

    it managed to improve it slightly, just it still has a way to go, I am not sure what to go next.

  • #2
    Not sure what your end goal is but I recommend adding a 1 into the log command because otherwise you will get missing values when Foraging-Percent=0
    Code:
    gen inforaging=ln(1+ForagingPercentage)

    Comment


    • #3
      well my end goal is to make it normally distributed

      Comment


      • #4
        Well, (to paraphrase Inigo Montoya) I am not sure you want what you think you want, but check this out and see if it seems the right direction. Really, best off finding a model that fits your data without radical transformations. For example, in this totally clean example below, the end product only correlates at about .9 with the original -- and this is under ideal circumstances. Nowadays, with Poisson, Negative Binomial, zero-inflated whatever, you can find a model that works, without destroying the original distribution.

        Code:
        clear
        
        set obs 1000
        
        *=======start normal
        gen x=rnormal()
        hist x
        
        
        *=======make it non-normal
        replace x=exp(x)
        hist x
        
        
        *=======bring it back to uniform -- you can make almost *anything* uniform.
        xtile x2=x, nq(100)
        hist x2
        
        *=======make it normal.  Need to make it range 0 to 1, thus divide by 100.
        replace x2=invnorm(x2/100)
        hist x2
        Last edited by ben earnhart; 19 Nov 2014, 20:13.

        Comment


        • #5
          While I strongly endorse Ben's comment that you are probably better off fitting a model that has a logarithmic link function than log-transforming your data, if your goal is to normalize, and you are getting nearly satisfactory results with log(), and if zero or near-zero values are present in your data, you might look into the asinh() [inverse hyperbolic sine] function. It is near-logarithmic away from zero, but well behaved at zero and is often useful for normalizing percentages.

          But again, I think you should think long and hard about whether you really have good reason to try to normalize your data.

          Comment


          • #6
            Originally posted by Jonathan David View Post
            well my end goal is to make it normally distributed
            The end goal is usually to see how some variable influences another variable. So the distribution of a variable is usually an intermediate goal, if ever. However, making the marginal distribution normal is almost always a bad idea. If anything should be normally distributed, then it is the resdiduals, but if you have a reasonable sample size (> 30) that usually does not matter.

            If your variable is the explained/dependent/response/left-hand-side/y-variable than I strongly recommend against the transformation log(1+y), as there is no easy way to backtransform your coefficients to the original metric. Instead I would recommend using glm with the options link(log) vce(robust) or better yet if you model a proportion: link(logit) vce(robust). The latter will estimate a fractional logit model.
            ---------------------------------
            Maarten L. Buis
            University of Konstanz
            Department of history and sociology
            box 40
            78457 Konstanz
            Germany
            http://www.maartenbuis.nl
            ---------------------------------

            Comment


            • #7
              There are many reasons to transform and better advice here depends on knowing more about the distribution of the variable, and even more importantly the functional form of the model(s) contemplated.

              The name ForagingPercentage suggests a measured but bounded variable for which models with logit link are likely to be useful (if indeed it is the response variable; people seem to be assuming that, but I didn't see Jonathan stating that).

              See e.g. http://www.stata-journal.com/sjpdf.h...iclenum=st0147

              Conversely, although data with very skewed distributions often benefit from transformations or non-identity link functions, it is not because marginal normality is required by any of the usual models.

              The word percentage however covers at least two kinds of variables.

              One is "percentage of", necessarily bounded by 0 and 100%, as in percentage of males.

              The other is percentage change, as in (new - old) / old, where changes can be positive or negative. Setting aside problems if old is ever zero, such data occasionally include very large changes of either sign.

              asinh() could be useful for change data but for the first kind it has negligible effect on distribution shape. Here the similar sounding but quite different arcsine of square root lingers on in some literatures.

              It seems most likely that Jonathan's foraging data are the first kind.
              Last edited by Nick Cox; 20 Nov 2014, 02:11.

              Comment


              • #8
                Hi, perhaps it would be easier for me to just show you,

                1, DV = %Foraging, IV infantage,PP, Temp, Troopsize, year, then i do interactions between infantageXPP, infantXTemp, TempXPP, and infantXPPXTemp. Random effect is MotherID

                essentially as I said before, I need to transform %Foraging percentage, but I am not sure what to do.
                Attached Files

                Comment


                • #9
                  Did you try my code? As I feared, with all the zeros in there, making it normal was not possible. But if you can live with some skewness, it's pretty normal. Or, see below for code that ignores the 0's. The wisdom of doing this transformation is obviously questionable based on all the responses above, but it is possible. Then you run a selection model to account for the zeros. The interpretation and marginal effects are all messed up, but getting at a crude "x has a positive effect on y, but we don't know how much of an effect" should be possible.

                  But if you can find a model that fits your data, you can have it all: marginal effects and meaningful betas.

                  Code:
                  *=======bring it back to uniform -- you can make almost *anything* uniform.
                  xtile ForagingPercentageU=ForagingPercentage if ForagingPercentage!=0, nq(100)
                  hist ForagingPercentageU
                  
                  *=======make it normal.  Need to make it range 0 to 1, thus divide by 100.
                  gen ForagingPercentageN=invnorm(ForagingPercentageU/100)
                  hist ForagingPercentageN

                  Comment


                  • #10
                    Originally posted by Jonathan David View Post
                    DV = %Foraging [...] essentially as I said before, I need to transform %Foraging percentage
                    That is incorrect, you should not transform %foraging to make it normally distributed. I assume you want to use this variable in a linear regression, and a linear regression only "requires" the residuals to be normally distributed. I have put requires in quotes, because in practice it does not matter much if your dataset is large enough, and yours is.

                    Anyhow, as I said before, you should probably reconsider using linear regression, and us a fractional logit model instead.
                    ---------------------------------
                    Maarten L. Buis
                    University of Konstanz
                    Department of history and sociology
                    box 40
                    78457 Konstanz
                    Germany
                    http://www.maartenbuis.nl
                    ---------------------------------

                    Comment


                    • #11
                      My post #7 was evidently written at around the same time as Maarten's #6 but it's no surprise (to me, at any rate) that we said very similar things.

                      Jonathan: I don't see that you're engaging with any of the points made by contributors beyond confirming that foraging percent is what you are trying to explain.
                      Last edited by Nick Cox; 20 Nov 2014, 08:10.

                      Comment


                      • #12
                        Seems to cry out for Poisson, see the distribution. And when I ran:

                        Code:
                        encode MotherID, gen(momID)
                        xtset momID
                        xtreg ForagingPercentage Infantage PP Temp Year, re
                        xtpoisson ForagingPercentage Infantage PP Temp Year, re
                        the Z-scores were over twice as large for the poisson model.

                        And also ran it it against the version forced to normality, the Z-scores were even smaller than those from xtreg on the un-transformed variable. The log-transformed variable performed worst of all of them, judging by significance/Z-scores.



                        Click image for larger version

Name:	monkeys.png
Views:	1
Size:	43.3 KB
ID:	459616
                        Last edited by ben earnhart; 20 Nov 2014, 10:10.

                        Comment


                        • #13
                          Apologies, the time differences, does make this a hard conversation to be apart of, I appreciate the advice, it is clearly something which could have many interpretation, although, I did the glm option, as did seem like the most viable (binomial/logit), due to the skewness of my data

                          Comment

                          Working...
                          X