Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data transformation of heavily skewed data.

    In a panel data, My dependent variable is Research Intensity of a firm. It is calculated as ratio of a firm's research expenses to the total sales. Missing values in a firm's research expenses were treated as zeros following the practice in literature. Resulting distribution of Research intensity is heavily skewed with lot of zero values. I wish to transform it into normally distributed data. I already tried taking log transformations but nothing helped. Following is the summary of my variable. I used gladder command to see optimal transformation but no luck. Following is the summary of my data variable and histograms. I am running a linear regression random effects causal analysis on panel data and Research Intensity is my dependent variable.
    Click image for larger version

Name:	Gladder.jpg
Views:	1
Size:	55.4 KB
ID:	1738986
    Click image for larger version

Name:	Graph.jpg
Views:	1
Size:	31.5 KB
ID:	1738987
    Click image for larger version

Name:	Screenshot 2024-01-06 053826.png
Views:	1
Size:	19.0 KB
ID:	1738988

  • #2
    Please note the longstanding and repeated request in FAQ documents to use your full real name.

    The goal here is doomed to be unattainable and even misconceived. Normal distribution is not a reference state that makes sense for data like these. Somewhere between 50 and 75% of values are zeros, a massive spike that will remain a massive spike under any one-to-one transformation. Logarithms of zero are naturally not defined. The rest of the values will remain a tail and the most you could do is reduce (absolute) skewness and kurtosis. Adding some constant before transformation won’t really help either.

    I would leave the data as they come and consider starting with a Poisson model using robust standard errors. One strong alternative is a two-part model starting with modelling positive as compared with zero values. These seem especially common in health economics but could be quite natural for your application.

    In any case normality of the marginal distribution of an outcome variable is not a requirement or ideal for any model.

    Comment


    • #3
      Dear Curious Student,

      Adding to Nick's great advice, and since you are curious, I suggest you have a look at this recent paper https://onlinelibrary.wiley.com/doi/10.1111/obes.12583.

      Best wishes,

      Joao

      Comment

      Working...
      X