Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tri-modal/Bi-modal data

    My dependent variable (test) is bunched up at certain values (ordered values- higher is "better"). The plot looks something like this (3 distinct concentration points)
    Click image for larger version

Name:	test1.png
Views:	1
Size:	10.9 KB
ID:	1456266

    After running a simple OLS regression, including on transformed "test" variable, I am not convinced of the result. Here's what the residual plot looks like

    Click image for larger version

Name:	test2.png
Views:	1
Size:	17.6 KB
ID:	1456267
    Here's an simplified version of the model I am running

    Code:
     reg test i.literate i.married i.scst age agesq i.treatment i.village i.sex income
    
     reg testsq i.literate i.married i.scst age agesq i.treatment i.village i.sex income
    
     reg lntest i.literate i.married i.scst age agesq i.treatment i.village i.sex income
    Any suggestions on how I cam improve the results? I could try converting this into three bins and do a ordered probit perhaps. I'd much rather let this remain continuous though.

    I appreciate any help on this.
    Last edited by Fatima Alvi; 02 Aug 2018, 05:38.

  • #2
    Please show graphs as .png. This advice is explicit within FAQ Advice #12.

    Comment


    • #3
      Originally posted by Nick Cox View Post
      Please show graphs as .png. This advice is explicit within FAQ Advice #12.
      Sorry about that. I didn't realize I had attached it a Stata Graph. Fixed.

      Comment


      • #4
        I wouldn't worry that much about the modality. I would worry that you are manifestly fitting a plain regression to a bounded response. What transformation did you use? Beta regression or a logit(-like) link may make much more sense.

        Your residuals are clearly bounded by the lines

        residual = 2.2 or so MINUS fitted

        residual = 0 MINUS fitted

        Comment


        • #5
          Originally posted by Nick Cox View Post
          I wouldn't worry that much about the modality. I would worry that you are manifestly fitting a plain regression to a bounded response. What transformation did you use? Beta regression or a logit(-like) link may make much more sense.

          Your residuals are clearly bounded by the lines

          residual = 2.2 or so MINUS fitted

          residual = 0 MINUS fitted
          I used log transformation, squared, sq root etc.

          I'll look into beta regression. The data is indeed bounded (by design) by 0.05 at the left tail and 1.5 at the right.

          By logit like link do you mean a glm type regression with a logit link? Would that be kosher on non-binary data?
          Last edited by Fatima Alvi; 02 Aug 2018, 07:06.

          Comment


          • #6
            Which of those transformations did you use on the response? (It is hard to think of a problem in which rooting and squaring both spring to mind as solutions.)

            I am not an authority on what is kosher.

            But working on logit scale long predates logit regression for binary responses and is perfectly valid for (approximately) continuous proportions. See e.g.

            SJ-8-2 st0147 . . . . . . . . . . . . . . Stata tip 63: Modeling proportions
            . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. F. Baum
            Q2/08 SJ 8(2):299--303 (no commands)
            tip on how to model a response variable that appears
            as a proportion or fraction

            https://www.stata-journal.com/sjpdf....iclenum=st0147

            Beta regression is a model, rather than a transformation.

            A better plot for your response would be quantile test1; then ties could all be seen explicitly.

            Comment


            • #7
              Originally posted by Nick Cox View Post
              Which of those transformations did you use on the response? (It is hard to think of a problem in which rooting and squaring both spring to mind as solutions.)

              I am not an authority on what is kosher.

              But working on logit scale long predates logit regression for binary responses and is perfectly valid for (approximately) continuous proportions. See e.g.

              SJ-8-2 st0147 . . . . . . . . . . . . . . Stata tip 63: Modeling proportions
              . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. F. Baum
              Q2/08 SJ 8(2):299--303 (no commands)
              tip on how to model a response variable that appears
              as a proportion or fraction

              https://www.stata-journal.com/sjpdf....iclenum=st0147

              Beta regression is a model, rather than a transformation.

              A better plot for your response would be quantile test1; then ties could all be seen explicitly.
              I tried a bunch of transformation but the fitted values are from a log transformation. Here's a quantile plot the variable

              Click image for larger version

Name:	test3.png
Views:	1
Size:	11.2 KB
ID:	1456283

              Comment


              • #8
                I am not yet clear how this response is measured. I am happy to think that the graph in #7 is clearer than that in #1. (General note: if the shape of the kernel is discernible in a density estimate, you often need a different technique, as a discernible kernel shape means a spike in the original data, better understood by looking at it directly.)

                But consider a response bounded by 0 and 1 where the bounds are attainable. No logarithmic transformation works and log(response + constant) lacks the rationale that it has for a response that is zero or positive. I would work with the original data, scale them to [0, 1] and apply a logit link as in Kit Baum's article.

                Comment


                • #9
                  Originally posted by Nick Cox View Post
                  I am not yet clear how this response is measured. I am happy to think that the graph in #7 is clearer than that in #1. (General note: if the shape of the kernel is discernible in a density estimate, you often need a different technique, as a discernible kernel shape means a spike in the original data, better understood by looking at it directly.)

                  But consider a response bounded by 0 and 1 where the bounds are attainable. No logarithmic transformation works and log(response + constant) lacks the rationale that it has for a response that is zero or positive. I would work with the original data, scale them to [0, 1] and apply a logit link as in Kit Baum's article.
                  The variable test measures risk preference and has been constructed using answers to a set of lotteries. By design the values are bounded by 0 and 1.5.

                  I see Kit Baum's article mentions rescaling and using binomial family with logit link. In your responses elsewhere Nick, you've advised to use a continuous family with logit link (gaussian perhaps?). How does one interpret the coefficients when one does the latter (gaussian family)?

                  Actually even with the former, what does my rescaled variable mean if the original variable was a measure of risk preference such that higher value means lower risk aversion. Can these be interpreted as x% increase in risk preference if some RHS variable increases?

                  Comment


                  • #10
                    What's paramount here in my view is respecting the bounds. You have a better chance of getting predictions nearly right for 0 (new 0) and 1.5 (new 1) with a logit link. You don't cite any posts where I advised Gaussian family but f(binomial) vce(robust) looks the better deal here.

                    I am not clear that it's terribly easy to interpret the coefficients for a linear model with either the original scale or a transformed scale for the response when the response has arbitrary units! The percent interpretation is qualitatively wrong for this kind of response, so you use little by abandoning it. Note that you only got a kind of logarithm by log(response + constant), so you already had a serious problem of interpretation.
                    Last edited by Nick Cox; 03 Aug 2018, 01:43.

                    Comment

                    Working...
                    X