Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I use control variables(dummy) in probit?

    Hello,

    I am working on my thesis and was my professor explained that for my dataset a probit analysis is best.
    The dependent variable is the presence of logos on wine labels, and the independent variable is price of the product (available in either ordinal or numeric)

    I have 4 control variables (region, country, white/red, store) and transformed them into dummies (60 variables).
    Some dummy variables have a very low occurance, sometimes only one row.
    E.g. a unknown wine region will only occur once and has only one corresponding wine.

    When I command:
    Probit logo price (+60 dummy control variables)
    I get many omitted/perfect predicting findings in the analyses, I suspect partially due to the low N of the dummies.

    The goal of these control variables is to exclude other explanatory influences.

    Is this the right test/setting to analyse this?

    Looking very forward to your replies!

  • #2
    When you have cut the data into many small classes, some of which are even singletons, the likelihood that some of these indicators ("dummies") will end up being perfect predictors is high. Probit regressions are estimated by maximum likelihood (as are logistic regressions) and the maximum likelihood estimate of the coefficient of a perfect predictor is infinite (positive or negative). So such an estimation cannot converge. Stata's solution to this dilemma is to first detect perfect predictors and then remove them from the model, while informing you of the problem so you know that your model does not apply to those observations and variables. Note that even without perfect prediction, in this same circumstance you are at appreciable risk of having other variables that have lopsided associations with the outcome so that they are "nearly" perfect predictors. In such a situation, the maximum likelihood estimates will be very large (positive or negative) and are known to be biased upward (in magnitude). Stata does not take any action in this situation, nor warn you of it. You use these models at your own risk and it is your responsibility to be on the alert for such things. When you see very large coefficients in the output of a logistic or probit regression you should be investigated. (In the real world, coefficients of dichotomous predictors in logistic regressions greater than 4 in magnitude should be considered suspicious. I would similarly be suspicious of any probit regression result with a coefficient greater than about 2.25 in magnitude. Associations that strong are very uncommon in real world phenomena.)

    One approach that is sometimes used is estimation by penalized maximum likelihood, which reduces the bias from near-perfect predictors and tolerates perfect prediction. For logistic regressions, Joseph Coveney's -firthlogit- program, available from SSC, implements this. I am not aware of any penalized maximum likelihood estimation programs for probit models, however. If somebody else knows of one, it would be great if he or she chimes in.

    Assuming you want to stick with probit, the solution would be along the lines of consolidating categories for some of your constructs. For example, you might take a few countries that are rarely represented in the data set and combine them into a single "country" called "other" Or you might combine different regions that are similarly in their viticultural aspects into a single super-region to get a region variable with fewer levels.

    One technical point that has nothing to do with the problem you are raising. It is not necessary, and is inefficient, to create separate indicator variables for every region, every country, etc. Modern Stata has factor-variable notation. Read -help fvvarlist- to learn the details. The essence is, that you should just work with a variable like country and let it take on all the different values it needs to, and then refer to it in your regression model as i.country. Stata will then automatically expand that into the appropriate "virtual" indicator variables "on the fly." The advantages of this approach are: less coding to do, and, accordingly less opportunity to make a coding error; cleaner output from the regression command, the ability to use the -margins- command following your regression, no wasting memory on variables that are not needed for other purposes.

    Comment


    • #3
      Thank you for your post!

      I have a question regarding this part
      Assuming you want to stick with probit, the solution would be along the lines of consolidating categories for some of your constructs. For example, you might take a few countries that are rarely represented in the data set and combine them into a single "country" called "other" Or you might combine different regions that are similarly in their viticultural aspects into a single super-region to get a region variable with fewer levels.
      [/QUOTE]

      Is it also possible to combine the dummy variables in one variable? Instead of all dummies with either 0 or 1, I have one variable and give a different score to each region. Will this be threated as a categorical or a ordinal variable?

      I have run a normal regression and a probit analyses with this new variables.(5 instead of 60 dummies)
      It seems to look good, however, I am not sure whether what I did was right.
      The results are comparable but different, how do I know which test to choose?
      Thank you very much!

      Comment


      • #4
        Is it also possible to combine the dummy variables in one variable? Instead of all dummies with either 0 or 1, I have one variable and give a different score to each region.
        Well, I don't know quite what you mean by this. It depends on how you did it. You would have to show the code that you actually used for anyone to comment on it.

        Will this be threated as a categorical or a ordinal variable?
        That depends on how you specify it in the regression command. If you put it in with an i. prefix it will be treated as categorical, otherwise not.

        Comment


        • #5
          Thank you for your reply.

          When I had the i. to my variables they split up excluding the first result.
          How should I interpret that?

          I still get many ormitted and wrong results when I do that.
          I also combined some variables into dummies, but I got these results.
          I don't think this is right?


          coef. stand. z P>Z 95% confidence int.

          Price | .3834793 .0796339 4.82 0.000 .2273998 .5395589

          CountryD | .5206123 .2609226 2.00 0.046 .0092133 1.032011

          StoreD1 | 2.055988 .5564076 3.70 0.000 .9654491 3.146527

          StoreD2 | 1.948694 .572622 3.40 0.001 .826375 3.071012

          StoreD3 | 1.879534 .632948 2.97 0.003 .6389791 3.12009

          Store4D | 1.777864 .4892852 3.63 0.000 .8188822 2.736845

          Region1D | 5.170055 442.0454 0.01 0.991 -861.2229 871.563

          Region2D | 5.764943 442.0453 0.01 0.990 -860.628 872.1579

          Region3D | 5.695093 442.0454 0.01 0.990 -860.6979 872.0881

          _cons | -10.58196 442.0464 -0.02 0.981 -876.9769 855.813





          I am rethinking what test I should use and asked that question on the forum.
          Do you think Probit is suitable for this?

          Comment


          • #6
            I also tried to put all the dummies into 4 corresponding variables and added the i. as you suggested.
            This sadly gave a hard to interpret result.
            What should I do with this?

            variable coef. stand. z P>Z 95% confidence int.


            Price | .1290839 .0286728 4.50 0.000 .0722779 .18589
            |
            Country |
            2 | .7095612 .2997779 2.37 0.020 .115647 1.303475
            3 | 1.390062 .435532 3.19 0.002 .5271951 2.25293
            4 | -.2932175 .3647162 -0.80 0.423 -1.015786 .4293511
            5 | 1.154412 .3662747 3.15 0.002 .4287559 1.880068
            6 | 1.17062 .4451609 2.63 0.010 .2886758 2.052564
            7 | -.3091207 .4211214 -0.73 0.464 -1.143438 .5251968
            8 | -.1798728 .4417877 -0.41 0.685 -1.055134 .6953882
            9 | .8414237 .2712686 3.10 0.002 .3039918 1.378856
            |
            Store |
            2 | -.676145 .1356227 -4.99 0.000 -.944838 -.407452
            3 | -.971072 .1925783 -5.04 0.000 -1.352604 -.5895397
            4 | .0445692 .2757808 0.16 0.872 -.5018022 .5909407
            5 | -.4492983 .144791 -3.10 0.002 -.7361554 -.1624412
            |
            Region |
            2 | .0950511 .5051462 0.19 0.851 -.9057345 1.095837
            3 | .142673 .3995301 0.36 0.722 -.6488681 .9342141
            4 | .8943229 .4223713 2.12 0.036 .0575292 1.731117
            5 | -.1235267 .4186381 -0.30 0.768 -.9529242 .7058709
            6 | .3231016 .3177786 1.02 0.311 -.3064751 .9526782
            7 | -.570655 .3456619 -1.65 0.102 -1.255473 .1141635
            8 | .4386691 .2738695 1.60 0.112 -.1039158 .981254
            9 | -1.083224 .370672 -2.92 0.004 -1.817592 -.3488558
            10 | -.0004485 .3283427 -0.00 0.999 -.6509546 .6500577
            11 | -.6205461 .3292407 -1.88 0.062 -1.272831 .0317391
            12 | -1.265114 .359373 -3.52 0.001 -1.977097 -.5531317
            13 | .5964726 .4146174 1.44 0.153 -.2249592 1.417904
            14 | .1648142 .4527849 0.36 0.717 -.7322344 1.061863
            15 | -.550053 .4526868 -1.22 0.227 -1.446907 .3468012
            16 | -.1795469 .1944605 -0.92 0.358 -.5648083 .2057144
            17 | .8024304 .4018983 2.00 0.048 .0061973 1.598663
            18 | -1.476118 .3988627 -3.70 0.000 -2.266337 -.6858994
            19 | .1428049 .3849614 0.37 0.711 -.6198731 .905483
            20 | 0 (omitted)
            21 | .6023019 .3728376 1.62 0.109 -.1363566 1.34096
            22 | -.6708039 .346181 -1.94 0.055 -1.356651 .0150431
            23 | -.5355648 .2432168 -2.20 0.030 -1.017421 -.0537085
            24 | 0 (omitted)
            25 | .337532 .372546 0.91 0.367 -.4005489 1.075613
            26 | .3193889 .4157226 0.77 0.444 -.5042325 1.14301
            27 | 0 (omitted)
            28 | .8552782 .4402566 1.94 0.055 -.0169495 1.727506
            29 | .3743754 .2750131 1.36 0.176 -.1704751 .919226
            30 | 0 (omitted)
            31 | -.7986468 .3529532 -2.26 0.026 -1.497911 -.0993829
            32 | -.1203267 .3217028 -0.37 0.709 -.757678 .5170245
            33 | .0055573 .4172562 0.01 0.989 -.8211025 .8322171
            34 | .4539465 .2534728 1.79 0.076 -.0482288 .9561218
            35 | -.3150139 .2104912 -1.50 0.137 -.732035 .1020072
            36 | 0 (omitted)
            37 | -.237372 .3516688 -0.67 0.501 -.9340913 .4593473
            38 | 0 (omitted)
            39 | 0 (omitted)
            40 | -.3872518 .4228578 -0.92 0.362 -1.225009 .4505058
            41 | -1.221289 .4415654 -2.77 0.007 -2.09611 -.3464687
            42 | 0 (omitted)
            43 | 0 (omitted)
            |
            1.ColourD | .0055573 .0518961 0.11 0.915 -.0972583 .1083728
            _cons | -.6604965 .3512111 -1.88 0.063 -1.356309 .0353161

            Comment


            • #7
              The model in #6 looks best to me. It treats Country, Store, and Region as categorical variables--which is what they are.

              The omitted variables for 9 of the regions is normal and expected: every region lies within a single country. Consequently, one region from each country will be omitted from the analysis. Thank of it this way. If Country A contains regions 1, 2, 3, and 4 you do not need four region variables to identify all the regions: if you know that the region is in country A and that it is not region 2, 3, or 4, then it must be region 1. If identify specific region effects (as opposed to just adjusting for their influence) is important to your research question, then you can accomplish that by leaving Country out of the model. In that case, all of the regions will be represented (except for 1). Country effects would still be adjusted for in the model, because their information is carried in the region effects--but they are not separately identifiable. The bottom line is that you cannot simultaneously estimate country and region effects because they are confounded with each other. Choose one.

              In #1, you said "The dependent variable is the presence of logos on wine labels, and the independent variable is price of the product (available in either ordinal or numeric)." This is not really a precise statement of a statistical hypothesis, but it suggests that what you want to do is estimate the association of price with the probability of having a log on the wine label, adjusting for country, store, and region effects. The output in #6 is such a model. Probit coefficients are hard to interpret in understandable terms: they tell you how different along the normal ogive the probabilities are depending on price--and most people have difficulty wrapping their minds around that. But you can get a simpler interpretation by using the -margins- command. If, following the model in #6 (or the same model removing Country) you run
              Code:
              margins, dydx(Price)
              you will get the average marginal effect of price on the probability of having a logo. That is, you will get an average value for the difference in the probability of having a logo associated with a unit increase in price. Now, that is an average value, and it doesn't really apply to any particular observation in your data, but it might be considered a handy overall summary statistic. You might want to look at more specific marginal effects corresponding to particular configurations of region, store and price. The -margins- command has plenty of machinery for handling those situations. It is, however, a complicated command. So I refer you to the excellent Richard Williams' uncommonly lucid explanation of how it works: https://www3.nd.edu/~rwilliam/stats/Margins01.pdf. (The examples there do not include probit regressions, but they are handled exactly in the same way.)

              As for whether to use probit, I think this is up to you. Nothing you've said so far gives any reason to think that probit isn't as good as any other model you might use for this kind of data. Have you looked yet at how the model fits the data? A bad fit would be a reason to change models. But usually the most effective way of dealing with bad fit here would be to change the specification of the variables in the model. Probit modeling is pretty flexible and can accommodate many types of relationships if you specify the predictors appropriately. There are, of course, other models that can be used with dichotomous outcomes. Switching to logistic is unlikely to prove helpful. The logistic and probit distributions are, except for a scale factor, almost the same thing, so both models tend to give similar predictions and p-values, and even the coefficients tend to be similar except for a scale factor of pi/sqrt(3). The logistic model has the advantage that its results can be interpreted as odds ratios, which are more intuitive than probits. That's probably the reason I use logistic often, and probit pretty seldom. But it's more of a cosmetic than a scientific reason. Linear probability models are rather different, but again the results tend to be similar unless you are dealing with observations for which the predicted probabilities are close to 0 or 1.

              All of which is a length way of saying that in terms of the problems you have been talking about in this thread, switching from probit to some other model isn't going to help at all. You will encounter them all over again: they have to do with variable specification and have nothing to do with the choice of probit per se. There might be good reasons not to use probit here, but you haven't touched on any.

              Comment


              • #8
                Maarten has started a second discussion about this topic, which overlaps with Clyde's excellent advice.

                https://www.statalist.org/forums/for...test-to-choose

                Comment


                • #9
                  you wrote:''it suggests that what you want to do is estimate the association of price with the probability of having a log on the wine label, adjusting for country, store, and region effects. ''
                  That is indeed a very good summary.
                  The tip on margin effects is really good, I'll look into that.

                  Comment


                  • #10
                    I am trying to run oprobit for an analysis in which the dependent variable is ordered and independent variables consists of ordered, categorical and continuous type. How can I specify the model in stata? Since there are categorical and ordered independent variables should I have to construct dummies for them? I am confused between creating dummies and using '.i'. Is there a difference between the two?

                    Comment


                    • #11
                      Factor-variable notation (which includes i. prefixes as well as other things) almost completely eliminates the need to create your own indicator ("dummy") variables in Stata. Conceptually, when you use factor variable notation, Stata creates, internally, an equivalent to a set of indicator variables "on the fly". There are several reasons, however, to prefer factor variable notation to creating your own indicators:

                      1. If your data set has a large number of variables, or if the number of indicators that would be created is large, the data set could be come difficult to work with, or the operation could even fail by exceeding the maximum number of allowed variables. Factor-variable notation does not create actual variables, so the limits on number of variables don't apply.

                      2. Creating your own indicators, particularly if a variable has a large number of categories, is tedious and error prone. The simple use of i. prefixing is foolproof and requires only 2 keystrokes!

                      3. If you create your own indicators, they will not be handled correctly by the -margins- command. Now, since you're even raising the question, I'm guessing you have little or no experience with -margins-. But the -margins- command is one of the most important additions to Stata in the past several years, and factor-variable notation was designed specifically to work with it. -margins- greatly simplifies and improves the interpretation of regression results. And particularly for models like -oprobit- whose direct results are difficult to correctly interpret, you will need -margins- to make sense of your results.

                      That said, there are still a few dusty corners of Stata, little-used ancient commands that do not support factor-variable notation. For those things, you would need to create your own indicators. But nearly all of those commands' functionality can be accomplished with newer commands that do work with factor-variable notation. So I would relegate the practice of creating your own indicator variables to some remote corner of your mind--you might actually need it some day, but only in some pretty exotic circumstances. Go with factor-variable notation.

                      Comment


                      • #12
                        Click image for larger version

Name:	oprobit.png
Views:	1
Size:	32.0 KB
ID:	1649816


                        I am trying to run an oprobit model. The dependent variable var1 has 3 categories and independent variable health has 4 categories also age is continuous. I have a conceptual question regarding oprobit. Is the interpretation of oprobit estimates based on reference or base outcomes. What do the coefficients in the above output mean considering that the dependent variable has 3 categories?
                        Last edited by Anand Sunny; 13 Feb 2022, 03:11.

                        Comment


                        • #13
                          The coefficients of -oprobit-, like the coefficients of a -probit- model are difficult to interpret or explain. Because of the multiple levels of the outcome, they are even harder to explain.

                          For concreteness, let's call the three levels of your outcome variable low, medium, and high. An assumption of the oprobit model is that the effect of any variable on a low vs (medium or high) model is the same is the effect of that variable on a (low or medium) vs high model. That is, each coefficient can be seen as the effect of its variable on at or below a given level vs above that level.

                          In -logit- or -ologit- models this can be made a little less confusing, because when you exponentiate the coefficients these become odds ratios. In the probit model we are talking about sliding a long a normal curve, and there are no simple phrases to capture that part of it. But if you are comfortable with the ordinary probit model, this is the same thing applied to distinctions of at or below vs above any of the outcome levels.

                          Comment


                          • #14
                            Thank you for the reply. Can we interpret oprobit using marginal effects similar to the logit case.
                            Code:
                            foreach i in 1 2 3 {
                            margins Health, predict(outcome(`i'))
                            }
                            Code:
                            foreach i in 1 2 3 {
                            margins, predict(outcome(`i')) dydx(Health)
                            }
                            Can I use these codes to take marginal and average marginal effects ?
                            Can you suggest the modifications necessary in the code when I include more independent variables in the model?.

                            Comment


                            • #15
                              Unless you have an old version of Stata, you can do all the outcomes at once rather than one by one.

                              This handout may be useful.

                              https://www3.nd.edu/~rwilliam/xsoc73994/Margins05.pdf
                              -------------------------------------------
                              Richard Williams, Notre Dame Dept of Sociology
                              StataNow Version: 18.5 MP (2 processor)

                              EMAIL: [email protected]
                              WWW: https://www3.nd.edu/~rwilliam

                              Comment

                              Working...
                              X