Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • kernal density normal doesnt appear normal

    Dear All,

    I am checking for normality for my survey data. The linktest, vif and swilk test all pass. However, my kdensity looks a little off. There no no missing data. Given the results of the linktest, vif and swilk, can i say my model is alright and proceed with analysis.




    Attached Files

  • #2
    Please don't post Word attachments. Why not and what to do instead are all explained at https://www.statalist.org/forums/help#stata

    I would recommend qnorm as a much (much) better way than any you mention to check for normality. I like kernel density estimation as much as most, but one needs to see the raw data too (in your case, the residuals).

    The evidence of the graph you show is that your residuals are about as good as anyone ever gets, but I really can't say whether your model is as good as could be. I don't know what analysis you intend if you've fitted a model and are looking at the residuals.

    Comment


    • #3
      Dear Nick,

      Thanks for your reply and apologies for posting in word.

      My data is survey data and I am trying to assess gender differences on productivity. One of the variables i use to measure productivity is yield and the kdensity normal that I provided earlier is based on regression on logyield. I have no missing data.

      I have applied the qnorm and even that does not appear normal. I am using the OLS model

      Comment


      • #4
        You're never going to get a perfect normal distribution. One way forward is a line-up check. Generate 8 samples with the same sample size as you have and the same mean and SD but from normal distributions. Now plot 9 plots, those 8 and your data, in an array. If your data look really, really different you may have a problem. (For 8 and 9, read 15 and 16, or 24 and 25, or whatever else is convenient.)

        Or if you get closer to a normal distribution of residuals, the price may be a less parsimonious model that is harder to fit and to interpret.

        Comment


        • #5
          Perhaps I should also add that the distribution of the yield ranges from 160 to 27122 and the mean distribution is 3400. and a standard deviation of not more than 2500. I think this could be the reason for the non normal distribution?
          Total observations is 5000 households
          Last edited by Patricia Ali; 25 Apr 2019, 06:17.

          Comment


          • #6
            I wouldn't expect yield of anything be normal, although lognormal may be nearer the mark.

            Comment


            • #7
              Thanks Nick. I am not familiar with lognormal but from what i have just read I decided to perform the following comand

              . gen noryield= exp(rnormal(160 27,122))

              where 160 is the minimum yield and 27122 is the maximum yield. however the newly generated variable generates 16 missing values.

              Am i on the right track here? and if so, what would you recommend for the missing values

              Comment


              • #8
                Hi Patricia,
                I think something might be getting out of track. The first thing to ask yourself is why do you need normality?
                Second what Nick suggests is that often, specially with things like income, wages, consumption or similar variables, they are not expected to follow a normal distribution, but they could follow a log normal distribution. This means that the natural log of the variable of interest (say log(income)) follows a normal distribution.
                HTH
                Fernando
                Last edited by FernandoRios; 25 Apr 2019, 07:51.

                Comment


                • #9
                  I agree with FernandoRios.

                  What you are doing is some way off, but it is easy to fix.

                  1. rnormal() expects a mean and SD, not at all a minimum and maximum. This is documented.

                  2. Inserting a comma like that will at best be misunderstood as presenting two arguments, 16027 and 122. If you exponentiate numbers like 16027 you will get really big results. Wrong any way.

                  3. For a lognormal the mean and SD should be on log scale, not the original scale.

                  But that is irrelevant. If you are checking that residuals are roughly normal, what the original untransformed data are like is a different issue.

                  Here is an analogue of what I think you want. This is all code you can reproduce yourself.

                  Code:
                  sysuse auto, clear
                  
                  gen log_price = ln(price)
                  
                  regress log_price weight
                  
                  predict residual, residual
                  
                  su res
                  scalar mean = r(mean)
                  scalar SD = r(sd)
                  
                  qnorm res, name(res, replace) subtitle(real) ytitle("")
                  
                  set seed 2803
                  
                  local names res
                  
                  forval j = 1/8 {
                      gen res`j' = rnormal(scalar(mean), scalar(SD))
                      qnorm res`j', name(res`j', replace) subtitle(fake `j') ytitle("")
                      local names `names' res`j'
                  }
                  
                  graph combine `names'
                  graph drop `names'

                  The easy thing to see in this example is that while the real residuals wiggle a bit, that is what even samples from a normal distribution do.
                  But this is just like a health check or machine service. No news is good news, but not spotting a problem is not proof that none exists.

                  All that said, for a response variable like yours, I would prefer to use a logarithmic link, not a log transformation. See e.g. https://blog.stata.com/2011/08/22/us...tell-a-friend/ for the main argument.

                  Note: The mean residual is essentially zero given what regression does, so we did not really need to calculate it. Indeed calculating the SD is just cosmetic, to get numbers in the same sort of interval.
                  Last edited by Nick Cox; 25 Apr 2019, 08:19.

                  Comment

                  Working...
                  X