Tri-modal/Bi-modal data

Fatima Alvi

Join Date: Jun 2014

Posts: 46
#1

Tri-modal/Bi-modal data

02 Aug 2018, 05:08

My dependent variable (test) is bunched up at certain values (ordered values- higher is "better"). The plot looks something like this (3 distinct concentration points)

After running a simple OLS regression, including on transformed "test" variable, I am not convinced of the result. Here's what the residual plot looks like

Here's an simplified version of the model I am running

Code:

reg test i.literate i.married i.scst age agesq i.treatment i.village i.sex income reg testsq i.literate i.married i.scst age agesq i.treatment i.village i.sex income reg lntest i.literate i.married i.scst age agesq i.treatment i.village i.sex income

Any suggestions on how I cam improve the results? I could try converting this into three bins and do a ordered probit perhaps. I'd much rather let this remain continuous though.

I appreciate any help on this.

Last edited by Fatima Alvi; 02 Aug 2018, 05:38.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

02 Aug 2018, 05:34

Please show graphs as .png. This advice is explicit within FAQ Advice #12.
Comment
Fatima Alvi

Join Date: Jun 2014

Posts: 46
#3

02 Aug 2018, 05:39

Originally posted by Nick Cox View Post

Please show graphs as .png. This advice is explicit within FAQ Advice #12.

Sorry about that. I didn't realize I had attached it a Stata Graph. Fixed.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

02 Aug 2018, 05:59

I wouldn't worry that much about the modality. I would worry that you are manifestly fitting a plain regression to a bounded response. What transformation did you use? Beta regression or a logit(-like) link may make much more sense.

Your residuals are clearly bounded by the lines

residual = 2.2 or so MINUS fitted

residual = 0 MINUS fitted
Comment
Fatima Alvi

Join Date: Jun 2014

Posts: 46
#5

02 Aug 2018, 06:17

Originally posted by Nick Cox View Post

I wouldn't worry that much about the modality. I would worry that you are manifestly fitting a plain regression to a bounded response. What transformation did you use? Beta regression or a logit(-like) link may make much more sense.

Your residuals are clearly bounded by the lines

residual = 2.2 or so MINUS fitted

residual = 0 MINUS fitted

I used log transformation, squared, sq root etc.

I'll look into beta regression. The data is indeed bounded (by design) by 0.05 at the left tail and 1.5 at the right.

By logit like link do you mean a glm type regression with a logit link? Would that be kosher on non-binary data?

Last edited by Fatima Alvi; 02 Aug 2018, 07:06.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#6

02 Aug 2018, 06:32

Which of those transformations did you use on the response? (It is hard to think of a problem in which rooting and squaring both spring to mind as solutions.)

I am not an authority on what is kosher.

But working on logit scale long predates logit regression for binary responses and is perfectly valid for (approximately) continuous proportions. See e.g.

SJ-8-2 st0147 . . . . . . . . . . . . . . Stata tip 63: Modeling proportions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. F. Baum
Q2/08 SJ 8(2):299--303 (no commands)
tip on how to model a response variable that appears
as a proportion or fraction

https://www.stata-journal.com/sjpdf....iclenum=st0147

Beta regression is a model, rather than a transformation.

A better plot for your response would be quantile test1; then ties could all be seen explicitly.
Comment
Fatima Alvi

Join Date: Jun 2014

Posts: 46
#7

02 Aug 2018, 07:10

Originally posted by Nick Cox View Post

Which of those transformations did you use on the response? (It is hard to think of a problem in which rooting and squaring both spring to mind as solutions.)

I am not an authority on what is kosher.

But working on logit scale long predates logit regression for binary responses and is perfectly valid for (approximately) continuous proportions. See e.g.

SJ-8-2 st0147 . . . . . . . . . . . . . . Stata tip 63: Modeling proportions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. F. Baum
Q2/08 SJ 8(2):299--303 (no commands)
tip on how to model a response variable that appears
as a proportion or fraction

https://www.stata-journal.com/sjpdf....iclenum=st0147

Beta regression is a model, rather than a transformation.

A better plot for your response would be quantile test1; then ties could all be seen explicitly.

I tried a bunch of transformation but the fitted values are from a log transformation. Here's a quantile plot the variable
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#8

02 Aug 2018, 07:34

I am not yet clear how this response is measured. I am happy to think that the graph in #7 is clearer than that in #1. (General note: if the shape of the kernel is discernible in a density estimate, you often need a different technique, as a discernible kernel shape means a spike in the original data, better understood by looking at it directly.)

But consider a response bounded by 0 and 1 where the bounds are attainable. No logarithmic transformation works and log(response + constant) lacks the rationale that it has for a response that is zero or positive. I would work with the original data, scale them to [0, 1] and apply a logit link as in Kit Baum's article.
Comment
Fatima Alvi

Join Date: Jun 2014

Posts: 46
#9

02 Aug 2018, 23:39

Originally posted by Nick Cox View Post

I am not yet clear how this response is measured. I am happy to think that the graph in #7 is clearer than that in #1. (General note: if the shape of the kernel is discernible in a density estimate, you often need a different technique, as a discernible kernel shape means a spike in the original data, better understood by looking at it directly.)

But consider a response bounded by 0 and 1 where the bounds are attainable. No logarithmic transformation works and log(response + constant) lacks the rationale that it has for a response that is zero or positive. I would work with the original data, scale them to [0, 1] and apply a logit link as in Kit Baum's article.

The variable test measures risk preference and has been constructed using answers to a set of lotteries. By design the values are bounded by 0 and 1.5.

I see Kit Baum's article mentions rescaling and using binomial family with logit link. In your responses elsewhere Nick, you've advised to use a continuous family with logit link (gaussian perhaps?). How does one interpret the coefficients when one does the latter (gaussian family)?

Actually even with the former, what does my rescaled variable mean if the original variable was a measure of risk preference such that higher value means lower risk aversion. Can these be interpreted as x% increase in risk preference if some RHS variable increases?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#10

03 Aug 2018, 01:39

What's paramount here in my view is respecting the bounds. You have a better chance of getting predictions nearly right for 0 (new 0) and 1.5 (new 1) with a logit link. You don't cite any posts where I advised Gaussian family but f(binomial) vce(robust) looks the better deal here.

I am not clear that it's terribly easy to interpret the coefficients for a linear model with either the original scale or a transformed scale for the response when the response has arbitrary units! The percent interpretation is qualitatively wrong for this kind of response, so you use little by abandoning it. Note that you only got a kind of logarithm by log(response + constant), so you already had a serious problem of interpretation.

Last edited by Nick Cox; 03 Aug 2018, 01:43.
Comment

Announcement

Tri-modal/Bi-modal data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment