Comparing two dependent variables with different frequency distributions

MD Mubtasim-Fuad

Join Date: Dec 2021

Posts: 16
#1

Comparing two dependent variables with different frequency distributions

20 Mar 2022, 13:57

So I am running these two OLS regressions on two DVs, each having the same IVs. The problem is, one DV is continuous, and the other is measured categorically in a five-point Likert scale format (I converted both into an index ranging from -1 to 1, increasing from left-wing to right-wing). Now, here are the frequency distributions for both:

Aboriginal Support Index:

Racial Minority Support Index:

And here are the regressions I run on both:

Code:

reg pes19_donerm_index i.immigrant_status i.region i.gender_status i.age_group i.education_status i.religious_status i.income_group i.urban_status i.party_id_status i.soc_net_vis_status c.ideology_index

Code:

reg aboriginal_support_index i.immigrant_status i.region i.gender_status i.age_group i.education_status i.religious_status i.income_group i.urban_status i.party_id_status i.soc_net_vis_status c.ideology_index

Now my questions are, since one DV is categorical, and the other is continuous, can I still compare regression results from the two? Also, since the aboriginal support index is not normally distributed, will this affect my interpretation? Should I be z-standardizing the DVs to account for this? Finally, in regard to multicollinearity in my IVs, should I be fixing each IV at their mean values? Anyone know how to do this for linear regressions in STATA?

Last edited by MD Mubtasim-Fuad; 20 Mar 2022, 13:59.
Tags: regression
Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#2

20 Mar 2022, 16:57

No, you can't compare those regressions in a meaningful way. Standardizing the dependent variables will create the illusion, but not the reality of comparable scales. But any attempt to make assertions like "immigrant status has more of an effect on pes19_donerm_index than it does on aboriginal_support_index, is just wrong, and should not be attempted.

The distribution of the aboriginal support index itself doesn't matter. It may matter what the distribution of the residuals of that regression look like. But probably not. If your sample size is large enough to truly support the large number of explanatory variables in your model, then it is large enough that the normality even of the residuals is irrelevant because the central limit theorem will make all of the inferential statistics (standard error, t-tests, p-values, confidence intervals) asymptotically correct. If your sample size isn't really big enough to support that (you want to have 50 or more observations in the estimation sample for each numerator degree of freedom in the regression), then that's a much bigger problem than the distribution of the residuals: you'd just be overfitting noise in the data.

Multicolinearity among indicator ("dummy") variables for categorical predictors is normal, expected, and is not a problem. Just ignore it. Certainly taking means of these IVs, which are inherently 0/1 indicators of categories, makes no sense at all.

In short, resist the urge to tamper with the data here. None of the transformation you have proposed would be helpful, and all of them will just make new problems. The data are what they are, and the only real issue here is the validity of your regression models, and the adequacy of the size of the estimation sample.

All of that said, there is a huge spike at -1 in the distribution of the aboriginal support index. Now, while it is possible that the distributions of the explanatory variables will actually cause the model to fit that spike well, it would be unusual. You might want to consider undoing that -1 to 1 transformation, (if necessary add or subtract a constant to get that spike to be at zero), and then fit a zero-inflated Poisson model rather than a linear regression model. My hunch is that you would have a better model if you do that.
2 likes
Comment
MD Mubtasim-Fuad

Join Date: Dec 2021

Posts: 16
#3

20 Mar 2022, 19:19

Thanks very much for this. I transformed the Aboriginal Support Index to be scaled from 0 to 2 instead of from -1 to 1. Now, I replaced the racial minority index with a continuous "Muslim Support Index" that follows the same 0 to 2 scaling. Here is its distribution (prior to rescaling it to 0 to 2):

As you can notice, the Aboriginal Support Index and Muslim Support Index follow a similar distribution (that is, a large proportion of zero observations). Now, according to this article (What to do when you have excess zeros in the data! | by Moumita Ghorai | Data Science in a World of Chaos | Medium), the author makes a recommendation for using a standard Poisson model instead of a zero-inflated Poisson model (zero-inflated negative binomial models are also an option) in the case where there is an absence of a plausible predictor for excess zeros - that is, observations that have zero probability of being more than zero (or "Definite zeros"). In the context of my study, there does not seem to be any presence of "Definite zeros" (no Muslims or Aboriginals in my sample set that would obviously be zero observations). Therefore, my questions are, should I stick with a zero-inflated Poisson model or just use a standard Poisson model (or perhaps a ZINB model as I mentioned), and given both DVs are now continuous variables using similar scales, would the regressions results be comparable between the two?

I have heard a significant z-test using the vuong command in STATA, assuming I run a zero-inflated Poisson model, could help me make the decision between that and a standard Poisson model (Zero-inflated Poisson Regression | Stata Data Analysis Examples (ucla.edu)). Let me know if this is a good method for checking model applicability.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4457
#4

21 Mar 2022, 05:21

Not sure, but your histograms of aboriginal-support and muslim-support indices resemble that of scores on a 10-cm visual analog scale (VAS). I can see the bumps at the markings at each centimeter. And clustering of responses at one end or the other, or in the middle, is not unusual, depending upon the nature of the prompt (item question) and how it's posed.

If that's the case, then you'll probably want to avoid count regression models (Poisson, ZIP, ZINB etc.), because they assume open-ended response at the upper end. You could use something like

Code:

summarize muslim_support_index, meanonly generate double sco = (muslim_support_index - r(min)) / (r(max) - r(min)) replace sco = cond(sco < 0, 0, cond(sco > 1, 1, sco)) if !mi(sco) // might not be needed glm sco <predictors>, family(binomial) vce(robust)

which can accept continuous response data from zero through one. Google stata glm proportion for more information.

Last edited by Joseph Coveney; 21 Mar 2022, 05:24.
2 likes
Comment
MD Mubtasim-Fuad

Join Date: Dec 2021

Posts: 16
#5

22 Mar 2022, 14:10

Thanks for the suggestion! Converted both DVs to a 0-1 proportion, and ran the glm model as you described (without the third line of code, since there are no observations below 0 or above 1), and results were quite close to the OLS specification, though some variables were closer to being significant. My question now would be, if I added a nolog command to the glm model, could I actually interpret the magnitude of effects straight from the regression table? For example, looking at Aboriginal Support nolog regression results focusing on the region variable, if my regression spits out a significant coefficient of 0.06 for the Prairies (with Ontario as a reference), does this mean that shifting from Ontario to the Prairies, respondents have a propensity to rate Aboriginals 6% more poorly, all else held equal? Could I directly compare this with my Muslim Support regression results? Or should I stick to using a margins command to directly interpret this?
Comment
MD Mubtasim-Fuad

Join Date: Dec 2021

Posts: 16
#6

22 Mar 2022, 16:13

So the nolog command, turns out, just suppresses the log pseudo likelihood iteration output, but does not change the results itself. Since these are log psuedo-likelihoods however, can I simply exponentiate them (i.e. through the margins command) to observe if the marginal effect (all else held equal) maps directly to the 0-1 proportional index?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4457
#7

23 Mar 2022, 06:13

Originally posted by MD Mubtasim-Fuad View Post

. . . if my regression spits out a significant coefficient of 0.06 for the Prairies (with Ontario as a reference), does this mean that shifting from Ontario to the Prairies, respondents have a propensity to rate Aboriginals 6% more poorly, all else held equal?

No, I think that you'd need to use -margins , post- and then use -lincom- to see the difference in terms of proportion- or percentage point-like metric.

Could I directly compare this with my Muslim Support regression results? Or should I stick to using a margins command to directly interpret this?

To me you'd be comparing apples to oranges even with -margins- (see also what Clyde has to say about comparing the corresponding regression coefficients), but maybe you have some subject matter basis for considering them capable of being compared directly.

Originally posted by MD Mubtasim-Fuad View Post

. . . can I simply exponentiate them (i.e. through the margins command) to observe if the marginal effect (all else held equal) maps directly to the 0-1 proportional index?

-margins- by default will give you the predicted proportion-like metric here. I would not use the -predict()- option of -margins- to obtain the exponentiated predictions in this case.
Comment

Announcement

Comparing two dependent variables with different frequency distributions

Comment

Comment

Comment

Comment

Comment

Comment