data transformation

Luca Huber

Join Date: Feb 2015

Posts: 27
#1

data transformation

27 Feb 2015, 00:20

Dear boardmembers

For a seminar paper at university I analyse the interaction effect between a policy and immigration on unemployment (30 countries, 5 years). For doing so, I'll do a panel model and a cross sectional model.

First of all, I have to check if my variables (the residuals) are normaly distributed. For doing so, I use the Stata command "gladder" and "ladder". Do each year have to be normaly distributed (ladder immigration2002, ladder immigration 2003...) or is it enough, if I run the ladder command before changing into the long format (only ladder immigration).

I know how I have to continue when my data should be transformated by logarithm. I have to read the changes in the regression output by percentage change. ladder/gladder now advices me to use the square root transformation. How do I have to read the regression output when my variables have to be transformated by square root?

Thanks a lot for your help and kind regards
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35362
#2

27 Feb 2015, 02:08

This is backwards: You should first tell us precisely which model or models you are fitting with which commands to get better advice.

It is not clear whether you understand the difference between the original variables and the residuals. Every introductory econometrics book should be clear on this.

ladder and gladder are commands that have the purpose of helping you decide whether transformation of a distribution is a good idea.

To check on normality of a distribution of residuals which is as it is, and not something that will ever be transformed, qnorm is a more direct command.

A one-paragraph personal summary of a large and confusing literature is that while normality of errors often would be nice for the researcher, even when it is an assumption it is usually the least important assumption made in the model. Focusing on the distribution of residuals to excess is common, unfortunately, and for partly mysterious reasons.

Whether your variables would be better off transformed is a related but different question. Sometimes non-normality of residuals is a clue that one or more of the original variables should be transformed, but there are more direct ways of handling this.

But regression-type models do not include any assumption that any individual predictor or response is normally distributed. If that were so, then the use of indicator variables would be off-limits, etc. I mention this because it seems that you may be focusing on whether to transform your original variables.

Last edited by Nick Cox; 27 Feb 2015, 02:17.
Comment
Luca Huber

Join Date: Feb 2015

Posts: 27
#3

27 Feb 2015, 03:14

Dear Nick Cox

Thanks a lot for your help. First I want to find out if there is a relationship between immigration (indep. var: foreign borns in percentage of population) and unemployment (in percentage of population) in 30 countries and 5 years. For this, I want to run a panel regression with xtreg unempl immigration controlvariables, fe/re/robust and hausman test for finding out if I have to use fixed or random effects.

Than I want to find out if a policy has an influence on the relation above: For this, I'll do a cross sectional regression (reg unemployment2008 interactionterm2008 controlvariables2008) because this policy was only measured in 2008.

I know the difference beteen the variables and the residuals, but I read that not normaly distributed variables are a cause of not normaly distributed residuals. For finding out if I should transform the variables, I used gladder and ladder. They showed more or less clearly that I have to transform most of the variables by taking the square root. It is no problem for me to do so, but I don't really know how to interpret the coefficients then.

I'll try the qnorm command now.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3421
#4

27 Feb 2015, 03:39

Are you familiar with the "ecological fallacy"? If not, you may want to look into that, because it sounds to me like you are falling into that trap.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Luca Huber

Join Date: Feb 2015

Posts: 27
#5

27 Feb 2015, 03:51

Yes, I think - that's why I control for a lot of other possible reasons for unemployment (like gdp growth...)
Comment
daniel klein

Join Date: Mar 2014

Posts: 3811
#6

27 Feb 2015, 03:56

[...]
but I read that not normaly distributed variables are a cause of not normaly distributed residuals.

Skewness of the response (dependent variable) might be one reason for non-normal residuals, but as Nick points out, this is usually better solved using generalized linear models with an appropriate link function.

I do not see a how the distribution of predictors (independent variables) should affect the residuals. As Nick also already pointed out, if that was the case then you could not use categorical predictors in regression models. Whether there is a linear relationship between the predictors and the response is another question.

[...]
Than I want to find out if a policy has an influence on the relation above: For this, I'll do a cross sectional regression (reg unemployment2008 interactionterm2008 controlvariables2008) because this policy was only measured in 2008.

From your description, it is not clear to me what is meant by interactionterm here. But note that statistical interactions, i.e. product terms, can only be interpreted correctly if the lower order terms are also included in the model.

Best
Daniel
Comment
Luca Huber

Join Date: Feb 2015

Posts: 27
#7

27 Feb 2015, 05:05

Thanks to you too. yes, the regression will be: reg unemployment2008 interactionterm2008(=immigration2008*policy2008) immigration2008 policy2008 controlvariables2008.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3421
#8

27 Feb 2015, 05:09

Originally posted by Luca Huber View Post

Yes, I think - that's why I control for a lot of other possible reasons for unemployment (like gdp growth...)

How would controlling for variables solve the ecological inference problem?

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Luca Huber

Join Date: Feb 2015

Posts: 27
#9

27 Feb 2015, 06:06

You think that more unemployment doesn't have to be caused by more immigration. When I control now for all the other factors which could cause unemployment, there shouldn't be a problem anymore, i think. And because I look on all variables from an collective/aggregated view (in percentage of population), there shouldn't be an ecological inference - i hope.

But my problem stays the same: ladder immig2010 has a chi2 value of 4.3 for square and square root but 8 for identity or log (which I know). Doesn't that mean, that I have to transform the variable by taking its square root? The qnorm command gives me that picture (does it mean that the residuals are normaly distributed when I take the invert?):
Attached Files
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35362
#10

27 Feb 2015, 07:47

As before, immig2010 is one of your original variables, and is not a set of residuals.

Its marginal distribution not being normal is not a reason to transform it.
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 523
#11

27 Feb 2015, 15:02

Luca, I may be wrong, but reading the exchange of arguments between you and Maarten it seems to me that your idea of the issue of "ecological fallacy" (ecological inference problem, also termed fallacy of the wrong level) is not identical to what Maarten has in mind. For clarification the wikipedia article "ecological fallacy", especially the section "literacy and immigrants" may be a starting point.
Comment
Luca Huber

Join Date: Feb 2015

Posts: 27
#12

02 Mar 2015, 10:22

Dear Nick Cox, you're right, I understand now. With qnorm, all variables (or residuals?) are first over, than under, than again over and under the 45° line. How do I have to interpret that?
I spoke today with a friend and he thinks I don't have to transform any of my variables. We did a cross sectional regression and "predict m1p" - than "gen m1r = unemployment - m1p". This residual variable didn't look so nonnormaly distributed.

Dear Dirk Enzmann and Marten Buis: But I don't get why this should be a problem: I look on immigration (as percentage of the host population) and unemployment (as percentage of the host population". Of course, there are other factors which causes unemployment, but I can take them into account as control variables. Why should these two variables be on different levels?
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3421
#13

02 Mar 2015, 13:30

So your residuals look "normalish". Well, than you are done as far as normality is concerned.

To repeat my earlier question: Why do you think adding controll variables solves for the ecological inference problem? The answer is: it does not solve for the ecological inference problem. It is a conceptual problem, so you need to solve it on the conceptual level. As long as you keep talking about unemployement rate in a country and percentage of immigrants your are fine, as soon as you are talking about immigration influencing an individuals chance of unemployement you have falled in the ecological falacy trap.

The classic example is there was a positive correlation between percentage literate in US states and the percentage immigrants. This does not mean that immigrants are more literate, they just went to more industrialized and urban states where there were more jobs for them, and which also happend to have a higher literacy rate than the more rural states. If you looked at literacy of immigrant in individual level data, then that is exactly what you will find. So to solve the ecological falacy problem you need to make sure that your analysis is done on the level on which you want to draw conclusions. It is not solved by adding control variables.

Given that this classic example involves percentage immigration, you will have to live with the fact that teachers, advisors, reviewers, and people on Statalist will immediately suspect ecological falacy and you do well to have a good answer ready, i.e. not one involving control variables, as they are irrelevant to this problem, but one involving the level on which you want to draw conclusion (and make sure no individual level interpretation sneeked its way somewhere in the paper).

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Luca Huber

Join Date: Feb 2015

Posts: 27
#14

03 Mar 2015, 00:07

He told me that they look "normalish" - but how can I be sure they are? Is it enough to make the panel regression and take the residuals with "predict m1p" und "gen m1r = unemployment - m1p"? Aren't those only the residuals of the dependent variable? Is it enough if they look more or less "normalish"?

I thought I have to run this panel regression with fixed effects - but the hausman test tell me to take random effects (chi2 value over 0.05). With the RE-Model, it looks like immigration has a very small but positive effect on unemployment (relative high R2 and significant values). Do I have to check for heteroscedasticity too?
Then I have to include the interaction variable with the policy I want to look at.

"As long as you keep talking about unemployement rate in a country and percentage of immigrants your are fine" Yes, I never wanted to explain unemployment on individual level and am interested in the unemployment rate of a country. But thank you - I think I have to explain that right at the beginning of the paper.

Last edited by Luca Huber; 03 Mar 2015, 00:29.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3811
#15

03 Mar 2015, 01:04

Aren't those only the residuals of the dependent variable?

Which other residuals could there be?

Is it enough if they look more or less "normalish"?

As stated repeatedly, normality of the residuals is usually one of the least important assumptions underlying the linear model. Somehow bell-shaped residuals usually do, even with moderate sample size. Given your sample size, I would be more concerned about the number of controls added to the model.

Best
Daniel
1 like
Comment

Announcement

data transformation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment