Transformation of your Dependent and Independent variables

Anupam Ghosh

Join Date: Jan 2023

Posts: 119
#1

Transformation of your Dependent and Independent variables

04 Jan 2024, 15:17

Hi,

I am trying to assess the effect of multiple treatments (treat1 = 1/0, .., treat6 = 1/0) on county-level crime rates from 1990-2020. In doing so, I am stuck on a quite fundamental question. Is there any literature that looks at why and when we should transform the dependent and independent variables?

Currently, I have standardized all my crime (dependent) and control variables by their respective county means and standard deviations. However, I am interested in knowing, whether my model's efficacy changes if I use other transformed dependent variables, such as, Growth Rates in Crime or Per-Capita Crime rates?
Tags: data, panel, panel data, regression, Suggestion
Anupam Ghosh

Join Date: Jan 2023

Posts: 119
#2

04 Jan 2024, 15:34

I tried using both the standardized dependent variable ( Black_Prop_std) and the %change dependent variable (Black_PropChg) in my regressions. For the former, I get a -ve and an insignificant effect for one of my key treatment variables (Coastal_Maj13) whereas for the latter, I get a +ve and a significant effect for Coastal_Maj13. I am thus confused as to which should I be using. Examples are pasted below:

---------------------------------------------------------------------------------
| Robust
Black_Prop_std | Coefficient std. err. t P>|t| [95% conf. interval]
----------------+----------------------------------------------------------------
Coastal_Min13 | -.2866702 .1088544 -2.63 0.008 -.5001079 -.0732325
Coastal_Min49 | -.275763 .1364059 -2.02 0.043 -.5432227 -.0083033
Coastal_Min1020 | -.7377114 .2077075 -3.55 0.000 -1.144976 -.3304463
Coastal_Maj13 | -.5034967 .5525222 -0.91 0.362 -1.586862 .5798683
Coastal_Maj49 | .7073029 .3643629 1.94 0.052 -.0071262 1.421732
Coastal_Maj1020 | .3225687 .4281318 0.75 0.451 -.5168961 1.162034

---------------------------------------------------------------------------------
| Robust
Black_PropChg | Coefficient std. err. t P>|t| [95% conf. interval]
----------------+----------------------------------------------------------------
Coastal_Min13 | -.9729181 .7979341 -1.22 0.223 -2.537478 .5916421
Coastal_Min49 | .7249219 .9756478 0.74 0.458 -1.188093 2.637937
Coastal_Min1020 | 3.99696 1.071004 3.73 0.000 1.896975 6.096945
Coastal_Maj13 | 5.083457 1.918251 2.65 0.008 1.322221 8.844694
Coastal_Maj49 | 3.728417 .7936676 4.70 0.000 2.172223 5.284612
Coastal_Maj1020 | 5.755003 1.270587 4.53 0.000 3.263682 8.246323
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#3

04 Jan 2024, 16:18

I think you are thinking about this the wrong way so your question itself is ill-conceived and should not be answered.

Standardizing a variable just shifts the constant term and rescales the coefficients. If the dependent variable, without standardization, has recognizable units of measurement, standardizing it usually serves no purpose other than obfuscating the results and lending an air of mathiness to the presentation. If, however the dependent variable has arbitrary units of measurement that most people would not recognize then standardization lends a frame of reference to understanding the results and is helpful. But, in either case, there is no substantive change that results from standardization. All it does is change the units of measurement: the difference between studying distance in meters vs centimeters or kilometers.

When you calculate a change variable, you are changing the research question altogether. This should not be thought of as a transformation of the variable. It is replacing the variable by something that is somewhat related but really quite different. Things that predict the levels of crime rates do not necessarily, and often don't, predict changes in crime rates, and vice versa. So the decision about whether to use a change variable is not some little tweak to the analysis of crime. It actually changes the substance of what you are investigating. So in connection with whether to use a change variable or not, there will be no literature to guide you on whether to do this kind of "transformation," because it is not a transformation in that sense. Rather you need to decide, ideally before you even look at the data, or even gather it, whether you are studying rates of crime or changes in rates of crime. (I included that this should be done, ideally before even gathering data, because sometimes different study designs are better for getting at changes than levels.) If you are already working with a settled data set, you need to decide on what your research question is, and then go about answering it with the corresponding dependent variable.
Comment
Anupam Ghosh

Join Date: Jan 2023

Posts: 119
#4

04 Jan 2024, 16:44

Hi Clyde,

Thank you for responding. The goal of my research is to analyze the effect of my treatment on crime rates. All treatment variables have been lagged. Here's an example of how I get my treatment variables:

forvalues i = 1/20 {
by fips (year), sort: gen Hurr_Surface`i' = L`i'.vmax_sust
replace Hurr_Surface`i' = 0 if Hurr_Surface`i' == .
}

gen Major13 = inrange(Hurr_Surface1, 48, 999) | inrange(Hurr_Surface2, 48, 999) | inrange(Hurr_Surface3, 48, 999)

local i = 1
foreach dep_var in `dep_vars' {
qui xtreg `dep_var' Minor13 Minor49 Minor1020 Major13 Major49 Major1020 L1.Total_Crime_std Black_Prop_std Labor_Force_std Unemp_Rate_std Tot_Establishments_std Avg_Monthly_Wage_std i.year, fe vce(cluster fips)
estimates store HurrRegChg`i'
local i = `i' + 1
}

Hence what I am ideally trying to see is that contingent on a county facing a hurricane within t-1 to t-3 what is its effect on Crime rates in period t. I have also generated logged values of per-capita crime rates. It would be really very helpful if you can throw some more light on suitability of my design.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#5

04 Jan 2024, 19:22

Based on what you say in #4, there is clearly no role for using change in crime rates as your dependent variable. You are explicitly looking to see if hurricane exposure increases crime rates. The fact that your exposure variable is defined in terms of a look-back period of 3 years doesn't change that in any way. In fact, it is usually the case that the determinants of outcomes are factors that predate those outcomes, so when we are interested in looking for causal associations, it is quite common that the independent variables will involve lags. But that doesn't create a role for using the change in the outcome when what you are trying to study is the level of the outcome.

Now to the issue of log-transformation. There are a number of reasons that people are drawn to log-transform their outcome variables. One of them is to take a variable that has a very wide range of values and reduce the range. This may or may not be appropriate. What is crucial to understand when thinking about this is that when the range of values is wide, the untransformed and log-transformed outcome models cannot both be correct. At most one of them can properly represent the real world data generating process. And here's the key thing to understand. When you regress Y against X, you are saying that a constant difference in X is associated with some other constant additive difference in Y. When you regress log(Y) against X, you are saying that a constant difference in X is associated with some constant multiplicative difference in Y. Only one of these can be true of the actual relationship, and your decision to transform or not must be based on that.

As with the other things that have been discussed in this thread, ideally, these issues were all considered and decisions made prior to looking at the data, or, even better, before collecting the data. In the absence of that prospective approach, one can often learn a great deal by exploring the Y:X relationship graphically. Additive relationships look like straight lines. Multiplicative relationships look like exponential curves. (This graphical approach only works if the range of values of Y is large, at least a few orders of magnitude. If the range of values of Y is small, a small section of an exponential curve can look linear. In fact, if the range of values of Y is narrow, there is really no way to distinguish whether the Y:X or log(Y):X model is more appropriate as both will appear to fit the limited data well.)

Another thing to consider when you contemplate using log(Y) is to stick with Y but use -poisson- instead of -regress-. (You must use robust vce when you do that.)
1 like
Comment
Anupam Ghosh

Join Date: Jan 2023

Posts: 119
#6

04 Jan 2024, 21:09

Clyde, this is really helpful! Given that my key treatment variables are binary. Do you still think plotting Y and Xs can give me useful insights? Or would you plot only the control variables against the dependent variable?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#7

04 Jan 2024, 22:04

If your treatment variables are binary, then, no, graphing them will not help. That only works for continuous variables. If you have a single covariate ("control variable") then a graph of the outcome variable (or any transform of it you are contemplating) against that can be helpful in deciding whether to do the transform. But if you have multiple covariates, then there is the possibility that some will be linearly related to Y and others will be better related to log(Y). You actually would be better off thinking about transforming these covariates and leaving the outcome variable alone. Each covariate can be transformed, if need be, in its own best way. And again, remember that these transformations only accomplish something in the event that the range of values of the untransformed version of the variable is wide enough exhibit a substantial departure from linearity with Y.
1 like
Comment
Anupam Ghosh

Join Date: Jan 2023

Posts: 119
#8

09 Jan 2024, 22:58

Thanks Clyde!
Comment

Announcement

Transformation of your Dependent and Independent variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment