Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • PROBIT REGRESSION - BINARY AND CONTINUOS VARIABLES - *Note: 2 failures and 0 successes completely determined.

    Hello everyone!
    I'm very grateful to have this channel to resolve my doubts!

    I'm working with a public dataset (5911 observations) from the Bank of Italy, a Survey on Households Income and Wealth (SHIW).
    I'm running a probit regression.
    My dependent variable is binary (1 if the respondent has stocks and zero otherwise)
    I have 17 independent variables in total, 13 binaries, and 4 continuous (age, age squared, log of income, and log of wealth).

    After running the regression this note appeared: Note: 2 failures and 0 successes completely determined. I already checked the state FAQ (https://www.stata.com/support/faqs/s...ic-regression/) and I followed the instructions to identify any covariate pattern or colinearity (I did this with only the binary variables). Then, after eliminating one variable I ran again the regression with only the binary variables, and apparently, everything was okay. But the problem is that when I add the continuous variables to the model, the same note appears, and I realized that the variable that is causing this message is the log of income, (when I remove this variable the note does not appear) and I don't know why, In this case, I can't follow the instructions given in the state FAQ, as the patterns are as many as the observations, of course, given the presence of the continuous variables.

    So, dear community, what can I do in this case? Is there any way to know why this happens? Is my model okay even if this note appears?

    Thank you in advance!!!!

  • #2
    My best guess is that at the very lowest levels of income nobody owns stock. You can check that as follows.
    Code:
    // FIRST RE-RUN THE REGRESSION, INCLUDING log_income
    keep if e(sample)
    graph twoway scatter stock_ownership log_income
    If you get a graph that looks like what is shown under "Case 1" at the link you included in #1, this is what is happening

    With this situation, your results are valid, but only applicable for incomes above that threshold value. As you are only losing two extreme cases to this, this is probably not a major problem--that's for you to decide.

    But if you are concerned about it, and if your data is a single cross section, not panel data, you can deal with this switching from logit/logistic to -firthlogit-, written by Joseph Coveney and available from SSC. -firthlogit- uses penalized maximum likelihood to estimate the model parameters. With simple logistic regression, when you have a threshold beyond which the outcome does not vary, the ordinary maximum likelihood estimate of the coefficient blows up to infinity (or negative infinity). That's why Stata looks for situations like this before proceeding, and eliminates those observations from the analysis. But the penalized maximum likelihood estimation that -firthlogit- uses produces finite estimates in this condition.

    Comment


    • #3
      Dear Clyde,

      I really appreciate your prompt response!

      As you suggested, I ran the keep if e(sample) command, and I got cero observations deleted, which means that all observation in my dataset was part of the estimation sample, right?
      My graph doesn't look like what is shown under "Case 1", it is a bit different (please see the attachment)... I noticed that the threshold value is 5, and there are only 4 observations below this limit, so I decided to delete them. Then I ran the regression again, and the note no longer appeared (please see second attachment).

      Can you give me any advice to asses this model? I'm aware that the interpretation of the coefficients is not like that of linear probability models.

      Thank you in advance!


      Attached Files

      Comment


      • #4
        As you suggested, I ran the keep if e(sample) command, and I got cero observations deleted, which means that all observation in my dataset was part of the estimation sample, right?
        Correct.

        My graph doesn't look like what is shown under "Case 1", it is a bit different (please see the attachment)..
        Yes, it is a bit different, but the essential point is the same: there is a threshold below which the outcome no longer varies.

        Can you give me any advice to asses this model? I'm aware that the interpretation of the coefficients is not like that of linear probability models.
        The coefficients of a probit model are, indeed, difficult to interpret. Grossly, their sign tells you the direction of the effect of a variable, and significance tests, if you use them, can be interpreted just as you would in a linear probability model. But that is rather weak tea. This is one of the reasons I rarely use probit models at all. Still, you can get more out of them by using the -margins- command. For example, if A is a dichotomous variable, -margins, dydx(A)- will give you the expected difference in probability of Y when A = 1 vs A = 0. And you can get the probability of Y at each of those values of A, adjusted for everything else in the model, by running -margins A-. If it is a continuous variable, -margins, dydx(A)- gives you that rate of change of probability of Y per unit change in A.

        Well, that is, you could do that if you had set up the regression appropriately. I see two variables named Age and Age2. If Age2 is the square of Age, then this alone makes the model, as implemented, incompatible with -margins-. So the first thing I would do is go back and revise the command using factor variable notation. Read -help fvvarlist- for all the details. Briefly, the discrete independent variables get an i. prefix. And Age and Age2 (assuming Age2 is the square of Age) get removed and replaced by c.Age##c.Age.

        The -margins- command is quite complicated, and I think the easiest way to get started with it is to read the excellent Richard Williams' https://www3.nd.edu/~rwilliam/stats/Margins01.pdf. But I think given the obscurity inherent in probit coefficients, the use of -margins- is well nigh indispensable for understanding the results of a probit regression. Which specific marginal effects and predictive margins you should calculate, of course, depends on your specific research questions.

        Comment


        • #5
          Dear Clyde,

          Thank you so much for your help! I really appreciate it!... I have been a little busy during the week, but now I can finally work on my research.

          I read the presentation about the margins you suggested, it was quite useful! Actually, I tried to implement it in my regression, but It did not work...I'm doing the regression under the survey context, taking into account the sampling weights and jackknife weights... I have been doing all my statistical analyses in that way (considering the weights). But apparently, the post-estimation analyses don't work for survey specifications. So I tried without the svy prefix and works, but I need my results considering the weights and the survey specification...

          I also wanted to assess the performance of my regression by using the roc and the classification matrix, but they did not work for a model done with the svy prefix.

          Can you please help me with this?

          Than you in advance!!!
          Attached Files

          Comment


          • #6
            I'm sorry, but I don't know how one would do these analyses with survey data and jackknife covariance estimation. If another Forum member is following this thread and knows the answer, I hope he or she will chime in.

            If you don't get a response by close of business on Monday (things can be slow here over the weekend), I recommend reposting as a new thread, and in the title make it clear that your question concerns the use of probit postestimation commands specifically with survey data and jackknife covariance estimation.

            Comment


            • #7
              Dear Clyde,

              Thank you so much for trying to help me!

              Tomorrow I will open a new thread.

              Have a nice day

              Comment


              • #8
                Dear Clyde,

                I want to tell you that I finally can run the margin for my probit regressions. I realized that the svy prefix was not necessary for computing the margins after running a probit model (with the survey data). The only things that do not work are the lroc and the stat classification commands because they are not allowed for surveys. (I read that in the help center)

                ...Now, I have another doubt regarding the margins... I was watching a video, and the guy says that to compute the margins for continuous variables, it is better to use the elasticities (margins eyex) instead of the dydx, but it changes the interpretation ( it is a percent change in y for a percent change in x), right?

                What do you suggest? Is it better to compute the eyex for continuous variables or use the dydx?

                Thank you in advance!!

                Comment


                • #9
                  I reject out of hand the notion that either is inherently better than the other. They are different, slightly related, things and which is better to use depends on the nature of the underlying analysis and the research questions to be answered.

                  The first thing to remember is that in a probit analysis, neither dydx nor eyex is constant: they change with the values of the regressor variables. (This is unlike a linear regression, where dydx is constant, or a log-log linear regression where eyex is constant. So if you calculate dydx or eyex following probit without specifying -at()- values in your -margins- command you are calculating an average value of a statistic whose value might vary considerably depending on the values of the independent variables.

                  The other thing to think about is which of these statistics is the answer to your research question. Is your research question better answered by knowing on average how many percentage points (additive) difference in probability of Y is associated with a unit (additive) difference in each regressor, or by knowing on average what percentage multiplicative change in the probability of Y is associated with a 1 percent multiplicative difference in each regressor. In fact, among your multiple regressors the answer might differ from one regressor to the next.

                  Let me also point out that two of your continuous regressors are overtly log-transformed, so if you calculate dydx for ln_vble, you are, in effect, calculating a semi-elasticity, dyex, for vble itself. In fact, for such variables, interpreting eyex(ln_vble) would be difficult as you would be, in effect, double-logging vble. There are, of course, some situations where log-log-vble is itself something meaningful and useful to talk about, though they are uncommon. All of this just re-emphasizes my main point: you have to thoughtfully consider many things to make these decisions. There is no a priori answer to this question.

                  Comment


                  • #10
                    Dear Clyde!


                    I really really want to thank you!!!

                    I understand clearly the interpretation of AME for the dummy variables, the dy/dx is the change in percentual points in the probability of Y given X = 1, right?... But I'm having trouble interpreting the marginal effects for the continuous variables (age, age2, ln_income, and ln_wealth), can you please explain this to me? (The results of the final probit model and the AME are attached)

                    Additionally, I would like to perform a MER for the quintiles of income and wealth.... particularly I would like to know what is the change in the probability of Y for a person with advanced financial literacy (F variable =1) given different levels of wealth and income (the quintiles). The survey staff provided a specific quantile variable (which takes values from 1 to 5), is there any way I can use this variable to run the MER??


                    Thank you in advance!!!

                    Attached Files

                    Comment


                    • #11
                      I understand clearly the interpretation of AME for the dummy variables, the dy/dx is the change in percentual points in the probability of Y given X = 1, right?...
                      Correct in principle. The exact wording is percentage points.

                      But I'm having trouble interpreting the marginal effects for the continuous variables (age, age2, ln_income, and ln_wealth), can you please explain this to me?
                      There is no marginal effect for age2. The AME for age takes into account the quadratic relationship. But let's focus on ln_income first, because there's no quadratic term to deal with.

                      One could, in principle calculate the predicted probability of y as a function of ln_income, say, and plot it on a graph. (You could approximate that, in fact, by running
                      Code:
                      margins, at(ln_income = (ln_income = (some list of numeric values that span the range of ln_income in the data and are reasonably close together))
                      marginsplot
                      At some values of ln_income the curve might be increasingsteeply, and at others it might be flat, or close to flat. But at each value of ln_income, the curve has a certain slope. (If your calculus is rusty, this would actually be the slope of a line tangent to the curve at that point. And the value of the slope would be the first derivative of the function: dy/d(ln_income).) That is the rate at which the probability of y grows per unit difference in ln_income, starting from the given value of ln_income. So there is a single value of the marginal effect of ln_income at each value of ln_income itself. In fact, it is a bit more complicated than that. Because the probit function is non-linear, if you were to replot the same graph but also change the values of other variables in your model, the curve would shift and also slightly change shape. So this would give you more marginal effects of ln_income. In the end, for every combination of values of all of the variables in your model there is a marginal effect of ln_income. The statistic that you are getting from -margins- is an average of those values. The average is a weighted average with the weights based upon the joint distribution of all the variables in the data set.

                      The situation with age is in most respects similar. But because the relationship to probit(y) is quadratic, any probability of y vs age curve (with whatever values of the other model variables you select might be) will have a different kind of shape from what we saw with ln_income, and may well be U-shaped, or, rather, in your situation, inverted-U shaped--with a peak around age 27. But the same principle applies. For every possible combination of values of all of the model variables other than age, you could graph the probability of y as a function of age and come up with a curve, probably an inverted U-shaped one. And there is a slope of the tangent to the curve that is the marginal effect of age at that value of age. And again, the statistic you are getting from -margins- is a weighted average of all the possible marginal effects of age, weighted by the joint distribution of all of the model variables.

                      Comment


                      • #12
                        Dear Clyde,


                        Thank you so much!!!

                        One last question, how do I know what my reference value is?... When you mentioned: "That is the rate at which the probability of y grows per unit difference in ln_income, starting from the given value of ln_income". What is the given value of ln_income?

                        Thank you in advance!!

                        Comment


                        • #13
                          The given value of ln_income is whatever you choose it to be. In the graphing exercise I described, you have many reference values--they are the numbers you put in the -at()- option. Each of those values serves as the reference value for one calculation of a marginal effect. Then all those marginal effects get averaged together.

                          By the way, I should add that in some situations there are specific values of a continuous variable that are of special interest, and one can just calculate the marginal effects at that or those specific values, as separate results, and discuss those. The choice of those reference values of interest, of course, depends on the content of your research questions.

                          Comment


                          • #14
                            Dear Clyde,


                            Thank you so much for your help!!

                            Right now I'm running a linear regression model to estimate the proportion of wealth invested in stocks (My dependent variable is continuous since it is: value of stocks/total financial wealth), and the independent variables are the same as the probit model (I have 13 dummies variables and 3 continuous ones)

                            I already run the model, and now I would like to asses its effectiveness... but the post estimations commands used for this purpose are not appropriate after the svy prefix, such as rvfplot or estat hettest.

                            Do you know how I can assess the effectiveness of my regression model when it has been run with the svy prefix?

                            Thank you in advance! )




                            Comment


                            • #15
                              I'm afraid I can't help you with that. I have not worked with -svy- data in a long time and am not up to speed on these things.

                              Comment

                              Working...
                              X