How can I use control variables(dummy) in probit?

Maarten Vander

Join Date: Dec 2018

Posts: 14
#1

How can I use control variables(dummy) in probit?

19 Dec 2018, 07:56

Hello,

I am working on my thesis and was my professor explained that for my dataset a probit analysis is best.
The dependent variable is the presence of logos on wine labels, and the independent variable is price of the product (available in either ordinal or numeric)

I have 4 control variables (region, country, white/red, store) and transformed them into dummies (60 variables).
Some dummy variables have a very low occurance, sometimes only one row.
E.g. a unknown wine region will only occur once and has only one corresponding wine.

When I command:
Probit logo price (+60 dummy control variables)
I get many omitted/perfect predicting findings in the analyses, I suspect partially due to the low N of the dummies.

The goal of these control variables is to exclude other explanatory influences.

Is this the right test/setting to analyse this?

Looking very forward to your replies!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

19 Dec 2018, 09:44

When you have cut the data into many small classes, some of which are even singletons, the likelihood that some of these indicators ("dummies") will end up being perfect predictors is high. Probit regressions are estimated by maximum likelihood (as are logistic regressions) and the maximum likelihood estimate of the coefficient of a perfect predictor is infinite (positive or negative). So such an estimation cannot converge. Stata's solution to this dilemma is to first detect perfect predictors and then remove them from the model, while informing you of the problem so you know that your model does not apply to those observations and variables. Note that even without perfect prediction, in this same circumstance you are at appreciable risk of having other variables that have lopsided associations with the outcome so that they are "nearly" perfect predictors. In such a situation, the maximum likelihood estimates will be very large (positive or negative) and are known to be biased upward (in magnitude). Stata does not take any action in this situation, nor warn you of it. You use these models at your own risk and it is your responsibility to be on the alert for such things. When you see very large coefficients in the output of a logistic or probit regression you should be investigated. (In the real world, coefficients of dichotomous predictors in logistic regressions greater than 4 in magnitude should be considered suspicious. I would similarly be suspicious of any probit regression result with a coefficient greater than about 2.25 in magnitude. Associations that strong are very uncommon in real world phenomena.)

One approach that is sometimes used is estimation by penalized maximum likelihood, which reduces the bias from near-perfect predictors and tolerates perfect prediction. For logistic regressions, Joseph Coveney's -firthlogit- program, available from SSC, implements this. I am not aware of any penalized maximum likelihood estimation programs for probit models, however. If somebody else knows of one, it would be great if he or she chimes in.

Assuming you want to stick with probit, the solution would be along the lines of consolidating categories for some of your constructs. For example, you might take a few countries that are rarely represented in the data set and combine them into a single "country" called "other" Or you might combine different regions that are similarly in their viticultural aspects into a single super-region to get a region variable with fewer levels.

One technical point that has nothing to do with the problem you are raising. It is not necessary, and is inefficient, to create separate indicator variables for every region, every country, etc. Modern Stata has factor-variable notation. Read -help fvvarlist- to learn the details. The essence is, that you should just work with a variable like country and let it take on all the different values it needs to, and then refer to it in your regression model as i.country. Stata will then automatically expand that into the appropriate "virtual" indicator variables "on the fly." The advantages of this approach are: less coding to do, and, accordingly less opportunity to make a coding error; cleaner output from the regression command, the ability to use the -margins- command following your regression, no wasting memory on variables that are not needed for other purposes.
1 like
Comment
Maarten Vander

Join Date: Dec 2018

Posts: 14
#3

20 Dec 2018, 08:08

Thank you for your post!

I have a question regarding this part
Assuming you want to stick with probit, the solution would be along the lines of consolidating categories for some of your constructs. For example, you might take a few countries that are rarely represented in the data set and combine them into a single "country" called "other" Or you might combine different regions that are similarly in their viticultural aspects into a single super-region to get a region variable with fewer levels.
[/QUOTE]

Is it also possible to combine the dummy variables in one variable? Instead of all dummies with either 0 or 1, I have one variable and give a different score to each region. Will this be threated as a categorical or a ordinal variable?

I have run a normal regression and a probit analyses with this new variables.(5 instead of 60 dummies)
It seems to look good, however, I am not sure whether what I did was right.
The results are comparable but different, how do I know which test to choose?
Thank you very much!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

20 Dec 2018, 09:08

Is it also possible to combine the dummy variables in one variable? Instead of all dummies with either 0 or 1, I have one variable and give a different score to each region.

Well, I don't know quite what you mean by this. It depends on how you did it. You would have to show the code that you actually used for anyone to comment on it.

Will this be threated as a categorical or a ordinal variable?

That depends on how you specify it in the regression command. If you put it in with an i. prefix it will be treated as categorical, otherwise not.
1 like
Comment
Maarten Vander

Join Date: Dec 2018

Posts: 14
#5

20 Dec 2018, 10:10

Thank you for your reply.

When I had the i. to my variables they split up excluding the first result.
How should I interpret that?

I still get many ormitted and wrong results when I do that.
I also combined some variables into dummies, but I got these results.
I don't think this is right?

coef. stand. z P>Z 95% confidence int.

Price | .3834793 .0796339 4.82 0.000 .2273998 .5395589

CountryD | .5206123 .2609226 2.00 0.046 .0092133 1.032011

StoreD1 | 2.055988 .5564076 3.70 0.000 .9654491 3.146527

StoreD2 | 1.948694 .572622 3.40 0.001 .826375 3.071012

StoreD3 | 1.879534 .632948 2.97 0.003 .6389791 3.12009

Store4D | 1.777864 .4892852 3.63 0.000 .8188822 2.736845

Region1D | 5.170055 442.0454 0.01 0.991 -861.2229 871.563

Region2D | 5.764943 442.0453 0.01 0.990 -860.628 872.1579

Region3D | 5.695093 442.0454 0.01 0.990 -860.6979 872.0881

_cons | -10.58196 442.0464 -0.02 0.981 -876.9769 855.813

I am rethinking what test I should use and asked that question on the forum.
Do you think Probit is suitable for this?
Comment
Maarten Vander

Join Date: Dec 2018

Posts: 14
#6

20 Dec 2018, 10:40

I also tried to put all the dummies into 4 corresponding variables and added the i. as you suggested.
This sadly gave a hard to interpret result.
What should I do with this?

variable coef. stand. z P>Z 95% confidence int.

Price | .1290839 .0286728 4.50 0.000 .0722779 .18589
|
Country |
2 | .7095612 .2997779 2.37 0.020 .115647 1.303475
3 | 1.390062 .435532 3.19 0.002 .5271951 2.25293
4 | -.2932175 .3647162 -0.80 0.423 -1.015786 .4293511
5 | 1.154412 .3662747 3.15 0.002 .4287559 1.880068
6 | 1.17062 .4451609 2.63 0.010 .2886758 2.052564
7 | -.3091207 .4211214 -0.73 0.464 -1.143438 .5251968
8 | -.1798728 .4417877 -0.41 0.685 -1.055134 .6953882
9 | .8414237 .2712686 3.10 0.002 .3039918 1.378856
|
Store |
2 | -.676145 .1356227 -4.99 0.000 -.944838 -.407452
3 | -.971072 .1925783 -5.04 0.000 -1.352604 -.5895397
4 | .0445692 .2757808 0.16 0.872 -.5018022 .5909407
5 | -.4492983 .144791 -3.10 0.002 -.7361554 -.1624412
|
Region |
2 | .0950511 .5051462 0.19 0.851 -.9057345 1.095837
3 | .142673 .3995301 0.36 0.722 -.6488681 .9342141
4 | .8943229 .4223713 2.12 0.036 .0575292 1.731117
5 | -.1235267 .4186381 -0.30 0.768 -.9529242 .7058709
6 | .3231016 .3177786 1.02 0.311 -.3064751 .9526782
7 | -.570655 .3456619 -1.65 0.102 -1.255473 .1141635
8 | .4386691 .2738695 1.60 0.112 -.1039158 .981254
9 | -1.083224 .370672 -2.92 0.004 -1.817592 -.3488558
10 | -.0004485 .3283427 -0.00 0.999 -.6509546 .6500577
11 | -.6205461 .3292407 -1.88 0.062 -1.272831 .0317391
12 | -1.265114 .359373 -3.52 0.001 -1.977097 -.5531317
13 | .5964726 .4146174 1.44 0.153 -.2249592 1.417904
14 | .1648142 .4527849 0.36 0.717 -.7322344 1.061863
15 | -.550053 .4526868 -1.22 0.227 -1.446907 .3468012
16 | -.1795469 .1944605 -0.92 0.358 -.5648083 .2057144
17 | .8024304 .4018983 2.00 0.048 .0061973 1.598663
18 | -1.476118 .3988627 -3.70 0.000 -2.266337 -.6858994
19 | .1428049 .3849614 0.37 0.711 -.6198731 .905483
20 | 0 (omitted)
21 | .6023019 .3728376 1.62 0.109 -.1363566 1.34096
22 | -.6708039 .346181 -1.94 0.055 -1.356651 .0150431
23 | -.5355648 .2432168 -2.20 0.030 -1.017421 -.0537085
24 | 0 (omitted)
25 | .337532 .372546 0.91 0.367 -.4005489 1.075613
26 | .3193889 .4157226 0.77 0.444 -.5042325 1.14301
27 | 0 (omitted)
28 | .8552782 .4402566 1.94 0.055 -.0169495 1.727506
29 | .3743754 .2750131 1.36 0.176 -.1704751 .919226
30 | 0 (omitted)
31 | -.7986468 .3529532 -2.26 0.026 -1.497911 -.0993829
32 | -.1203267 .3217028 -0.37 0.709 -.757678 .5170245
33 | .0055573 .4172562 0.01 0.989 -.8211025 .8322171
34 | .4539465 .2534728 1.79 0.076 -.0482288 .9561218
35 | -.3150139 .2104912 -1.50 0.137 -.732035 .1020072
36 | 0 (omitted)
37 | -.237372 .3516688 -0.67 0.501 -.9340913 .4593473
38 | 0 (omitted)
39 | 0 (omitted)
40 | -.3872518 .4228578 -0.92 0.362 -1.225009 .4505058
41 | -1.221289 .4415654 -2.77 0.007 -2.09611 -.3464687
42 | 0 (omitted)
43 | 0 (omitted)
|
1.ColourD | .0055573 .0518961 0.11 0.915 -.0972583 .1083728
_cons | -.6604965 .3512111 -1.88 0.063 -1.356309 .0353161
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#7

20 Dec 2018, 11:33

The model in #6 looks best to me. It treats Country, Store, and Region as categorical variables--which is what they are.

The omitted variables for 9 of the regions is normal and expected: every region lies within a single country. Consequently, one region from each country will be omitted from the analysis. Thank of it this way. If Country A contains regions 1, 2, 3, and 4 you do not need four region variables to identify all the regions: if you know that the region is in country A and that it is not region 2, 3, or 4, then it must be region 1. If identify specific region effects (as opposed to just adjusting for their influence) is important to your research question, then you can accomplish that by leaving Country out of the model. In that case, all of the regions will be represented (except for 1). Country effects would still be adjusted for in the model, because their information is carried in the region effects--but they are not separately identifiable. The bottom line is that you cannot simultaneously estimate country and region effects because they are confounded with each other. Choose one.

In #1, you said "The dependent variable is the presence of logos on wine labels, and the independent variable is price of the product (available in either ordinal or numeric)." This is not really a precise statement of a statistical hypothesis, but it suggests that what you want to do is estimate the association of price with the probability of having a log on the wine label, adjusting for country, store, and region effects. The output in #6 is such a model. Probit coefficients are hard to interpret in understandable terms: they tell you how different along the normal ogive the probabilities are depending on price--and most people have difficulty wrapping their minds around that. But you can get a simpler interpretation by using the -margins- command. If, following the model in #6 (or the same model removing Country) you run

Code:

margins, dydx(Price)

you will get the average marginal effect of price on the probability of having a logo. That is, you will get an average value for the difference in the probability of having a logo associated with a unit increase in price. Now, that is an average value, and it doesn't really apply to any particular observation in your data, but it might be considered a handy overall summary statistic. You might want to look at more specific marginal effects corresponding to particular configurations of region, store and price. The -margins- command has plenty of machinery for handling those situations. It is, however, a complicated command. So I refer you to the excellent Richard Williams' uncommonly lucid explanation of how it works: https://www3.nd.edu/~rwilliam/stats/Margins01.pdf. (The examples there do not include probit regressions, but they are handled exactly in the same way.)

As for whether to use probit, I think this is up to you. Nothing you've said so far gives any reason to think that probit isn't as good as any other model you might use for this kind of data. Have you looked yet at how the model fits the data? A bad fit would be a reason to change models. But usually the most effective way of dealing with bad fit here would be to change the specification of the variables in the model. Probit modeling is pretty flexible and can accommodate many types of relationships if you specify the predictors appropriately. There are, of course, other models that can be used with dichotomous outcomes. Switching to logistic is unlikely to prove helpful. The logistic and probit distributions are, except for a scale factor, almost the same thing, so both models tend to give similar predictions and p-values, and even the coefficients tend to be similar except for a scale factor of pi/sqrt(3). The logistic model has the advantage that its results can be interpreted as odds ratios, which are more intuitive than probits. That's probably the reason I use logistic often, and probit pretty seldom. But it's more of a cosmetic than a scientific reason. Linear probability models are rather different, but again the results tend to be similar unless you are dealing with observations for which the predicted probabilities are close to 0 or 1.

All of which is a length way of saying that in terms of the problems you have been talking about in this thread, switching from probit to some other model isn't going to help at all. You will encounter them all over again: they have to do with variable specification and have nothing to do with the choice of probit per se. There might be good reasons not to use probit here, but you haven't touched on any.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

20 Dec 2018, 11:41

Maarten has started a second discussion about this topic, which overlaps with Clyde's excellent advice.

https://www.statalist.org/forums/for...test-to-choose
1 like
Comment
Maarten Vander

Join Date: Dec 2018

Posts: 14
#9

21 Dec 2018, 08:15

you wrote:''it suggests that what you want to do is estimate the association of price with the probability of having a log on the wine label, adjusting for country, store, and region effects. ''
That is indeed a very good summary.
The tip on margin effects is really good, I'll look into that.
Comment
Anand Sunny

Join Date: Feb 2021

Posts: 10
#10

13 Jan 2022, 08:08

I am trying to run oprobit for an analysis in which the dependent variable is ordered and independent variables consists of ordered, categorical and continuous type. How can I specify the model in stata? Since there are categorical and ordered independent variables should I have to construct dummies for them? I am confused between creating dummies and using '.i'. Is there a difference between the two?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

13 Jan 2022, 09:49

Factor-variable notation (which includes i. prefixes as well as other things) almost completely eliminates the need to create your own indicator ("dummy") variables in Stata. Conceptually, when you use factor variable notation, Stata creates, internally, an equivalent to a set of indicator variables "on the fly". There are several reasons, however, to prefer factor variable notation to creating your own indicators:

1. If your data set has a large number of variables, or if the number of indicators that would be created is large, the data set could be come difficult to work with, or the operation could even fail by exceeding the maximum number of allowed variables. Factor-variable notation does not create actual variables, so the limits on number of variables don't apply.

2. Creating your own indicators, particularly if a variable has a large number of categories, is tedious and error prone. The simple use of i. prefixing is foolproof and requires only 2 keystrokes!

3. If you create your own indicators, they will not be handled correctly by the -margins- command. Now, since you're even raising the question, I'm guessing you have little or no experience with -margins-. But the -margins- command is one of the most important additions to Stata in the past several years, and factor-variable notation was designed specifically to work with it. -margins- greatly simplifies and improves the interpretation of regression results. And particularly for models like -oprobit- whose direct results are difficult to correctly interpret, you will need -margins- to make sense of your results.

That said, there are still a few dusty corners of Stata, little-used ancient commands that do not support factor-variable notation. For those things, you would need to create your own indicators. But nearly all of those commands' functionality can be accomplished with newer commands that do work with factor-variable notation. So I would relegate the practice of creating your own indicator variables to some remote corner of your mind--you might actually need it some day, but only in some pretty exotic circumstances. Go with factor-variable notation.
1 like
Comment
Anand Sunny

Join Date: Feb 2021

Posts: 10
#12

13 Feb 2022, 01:52

I am trying to run an oprobit model. The dependent variable var1 has 3 categories and independent variable health has 4 categories also age is continuous. I have a conceptual question regarding oprobit. Is the interpretation of oprobit estimates based on reference or base outcomes. What do the coefficients in the above output mean considering that the dependent variable has 3 categories?

Last edited by Anand Sunny; 13 Feb 2022, 02:11.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#13

13 Feb 2022, 10:50

The coefficients of -oprobit-, like the coefficients of a -probit- model are difficult to interpret or explain. Because of the multiple levels of the outcome, they are even harder to explain.

For concreteness, let's call the three levels of your outcome variable low, medium, and high. An assumption of the oprobit model is that the effect of any variable on a low vs (medium or high) model is the same is the effect of that variable on a (low or medium) vs high model. That is, each coefficient can be seen as the effect of its variable on at or below a given level vs above that level.

In -logit- or -ologit- models this can be made a little less confusing, because when you exponentiate the coefficients these become odds ratios. In the probit model we are talking about sliding a long a normal curve, and there are no simple phrases to capture that part of it. But if you are comfortable with the ordinary probit model, this is the same thing applied to distinctions of at or below vs above any of the outcome levels.
Comment
Anand Sunny

Join Date: Feb 2021

Posts: 10
#14

15 Feb 2022, 00:37

Thank you for the reply. Can we interpret oprobit using marginal effects similar to the logit case.

Code:

foreach i in 1 2 3 { margins Health, predict(outcome(`i')) }

Code:

foreach i in 1 2 3 { margins, predict(outcome(`i')) dydx(Health) }

Can I use these codes to take marginal and average marginal effects ?
Can you suggest the modifications necessary in the code when I include more independent variables in the model?.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4994
#15

15 Feb 2022, 05:36

Unless you have an old version of Stata, you can do all the outcomes at once rather than one by one.

This handout may be useful.

https://www3.nd.edu/~rwilliam/xsoc73994/Margins05.pdf

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Announcement

How can I use control variables(dummy) in probit?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment