Dear all, I’m facing an issue and would like your advice. In my regression analysis, county-level population density is the key explanatory variable, and county area is a control variable. I log-transformed both variables, and the results show they approximately follow a normal distribution.
The extreme values for county population, land area and population density in my data are not errors but reflect actual characteristics of certain counties. Since I focus on population density, I think the log transformation can reduce the influence of extreme values. I also find no significant difference between running the regression directly on the log-transformed population density or first removing extreme values of this variable before the log transformation.
My co-author think that it is necessary to first remove extreme values of total population and area, then remove extreme values of population density, before regression. I think that even though population density is derived from population and area, this variable is meaningful as an independent variable. This is similar to GDP per capita: you might exclude extreme GDP per capita values, but it seems excessive to first remove extreme GDP and population values before analyzing GDP per capita. I even think that if I use log-tranformed population density in the regression, it is not even necessary to remove extreme value of population density before taking log transformation for this variable.
I’d appreciate your thoughts on this approach. Thank you very much for your time and help!
The extreme values for county population, land area and population density in my data are not errors but reflect actual characteristics of certain counties. Since I focus on population density, I think the log transformation can reduce the influence of extreme values. I also find no significant difference between running the regression directly on the log-transformed population density or first removing extreme values of this variable before the log transformation.
My co-author think that it is necessary to first remove extreme values of total population and area, then remove extreme values of population density, before regression. I think that even though population density is derived from population and area, this variable is meaningful as an independent variable. This is similar to GDP per capita: you might exclude extreme GDP per capita values, but it seems excessive to first remove extreme GDP and population values before analyzing GDP per capita. I even think that if I use log-tranformed population density in the regression, it is not even necessary to remove extreme value of population density before taking log transformation for this variable.
I’d appreciate your thoughts on this approach. Thank you very much for your time and help!
Comment