Probit with variable that predicts failure perfectly

Wilson Lee

Join Date: Mar 2017

Posts: 3
#1

Probit with variable that predicts failure perfectly

04 Mar 2017, 20:46

Dear Statalist,

I have a dataset with similar properties as the following:

TREAT GENDER AGE
1 1 32
1 1 43
1 1 23
1 1 55
1 1 33
1 1 23
1 1 56
0 1 34
0 2 54
0 2 40

I am running a probit with TREAT as the dependent variable. In this case, GENDER can take on 2 values - 1 or 2, but all the obs with GENDER=2 are untreated.

I tried running a probit followed by a predict

Code:

probit TREAT i.GENDER AGE predict double score summ score

However, no score is generated and the log is appended below

note: 1.GENDER != 1 predicts failure perfectly
1.GENDER dropped and 2 obs not used

note: 2.GENDER omitted because of collinearity
Iteration 0: log likelihood = -3.0141613
Iteration 1: log likelihood = -2.9598644
Iteration 2: log likelihood = -2.9592964
Iteration 3: log likelihood = -2.9592962

Probit regression Number of obs = 8
LR chi2(1) = 0.11
Prob > chi2 = 0.7405
Log likelihood = -2.9592962 Pseudo R2 = 0.0182

------------------------------------------------------------------------------
TREAT | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
2.GENDER | 0 (empty)
AGE | .0177287 .0551661 0.32 0.748 -.0903949 .1258524
_cons | .5144595 2.008838 0.26 0.798 -3.422791 4.45171
------------------------------------------------------------------------------

. predict double score
(option pr assumed; Pr(TREAT))
(10 missing values generated)

. summ score

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
score | 0

By changing the reference category, the score can be generated for observations with GENDER = 1

Code:

probit TREAT ib2.GENDER AGE

Does anyone know why will changing the base cause the output to be different in this case and what will be the appropriate solution when running a dataset like this?

Thanks in advance.
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

05 Mar 2017, 05:59

Hello Wilson,

Welcome to the Stata Forum.

The answer is straightforward, since you gave it to us.

In the first example, we see that the gender 2 predicted "success" (whatever it is, in my opinion) perfectly.

In the second example, you changed the reference level, gender 1 does not predict "success" perfectly.

As a matter of fact, Stata already told you that, for we may in your output

(please prefer to share output under CODE delimiters as recommended in the FAQ):

Code:

note: 1.GENDER != 1 predicts failure perfectly

In short, everybody whose gender is 2 did "succeed".

Hopefull that helps.

Last edited by Marcos Almeida; 05 Mar 2017, 06:03.

Best regards,

Marcos
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17710

05 Mar 2017, 07:08

Wilson:
as an aside to Marcos' helpful explanation, I can't follow your last statement

By changing the reference category, the score can be generated for observations with GENDER = 1

, since it would seem that the story remains the same regardless the reference category:

Code:

. input TREAT GENDER AGE

         TREAT     GENDER        AGE
  1.
.  1 1 32
  2.
.  1 1 43
  3.
.  1 1 23
  4.
.  1 1 55
  5.
.  1 1 33
  6.
.  1 1 23
  7.
.  1 1 56
  8.
.  0 1 34
  9.
.  0 2 54
 10.
.  0 2 40
 11.
. end

. probit TREAT i.GENDER AGE

note: 1.GENDER != 1 predicts failure perfectly
      1.GENDER dropped and 2 obs not used

note: 2.GENDER omitted because of collinearity
Iteration 0:   log likelihood = -3.0141613 
Iteration 1:   log likelihood = -2.9598644 
Iteration 2:   log likelihood = -2.9592964 
Iteration 3:   log likelihood = -2.9592962 

Probit regression                               Number of obs     =          8
                                                LR chi2(1)        =       0.11
                                                Prob > chi2       =     0.7405
Log likelihood = -2.9592962                     Pseudo R2         =     0.0182

------------------------------------------------------------------------------
       TREAT |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    2.GENDER |          0  (empty)
         AGE |   .0177287   .0551661     0.32   0.748    -.0903949    .1258524
       _cons |   .5144595   2.008838     0.26   0.798    -3.422791     4.45171
------------------------------------------------------------------------------

. probit TREAT ib1.GENDER AGE

note: 1.GENDER != 1 predicts failure perfectly
      1.GENDER dropped and 2 obs not used

note: 2.GENDER omitted because of collinearity
Iteration 0:   log likelihood = -3.0141613 
Iteration 1:   log likelihood = -2.9598644 
Iteration 2:   log likelihood = -2.9592964 
Iteration 3:   log likelihood = -2.9592962 

Probit regression                               Number of obs     =          8
                                                LR chi2(1)        =       0.11
                                                Prob > chi2       =     0.7405
Log likelihood = -2.9592962                     Pseudo R2         =     0.0182

------------------------------------------------------------------------------
       TREAT |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    2.GENDER |          0  (empty)
         AGE |   .0177287   .0551661     0.32   0.748    -.0903949    .1258524
       _cons |   .5144595   2.008838     0.26   0.798    -3.422791     4.45171
------------------------------------------------------------------------------

. probit TREAT ib2.GENDER AGE

note: 1.GENDER != 1 predicts failure perfectly
      1.GENDER dropped and 2 obs not used

Iteration 0:   log likelihood = -3.0141613 
Iteration 1:   log likelihood = -2.9598644 
Iteration 2:   log likelihood = -2.9592964 
Iteration 3:   log likelihood = -2.9592962 

Probit regression                               Number of obs     =          8
                                                LR chi2(1)        =       0.11
                                                Prob > chi2       =     0.7405
Log likelihood = -2.9592962                     Pseudo R2         =     0.0182

------------------------------------------------------------------------------
       TREAT |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      GENDER |
          1  |          0  (omitted)
          2  |          0  (empty)
             |
         AGE |   .0177287   .0551661     0.32   0.748    -.0903949    .1258524
       _cons |   .5144595   2.008838     0.26   0.798    -3.422791     4.45171
------------------------------------------------------------------------------

As a closing-out remark it is better to use 0/1 vs 1/2 coding for creating categorical variables.

Kind regards,
Carlo
(Stata 19.0)

Comment

Wilson Lee

Join Date: Mar 2017

Posts: 3
#4

12 Mar 2017, 19:03

Thanks Marcos and Carlo. Even though the story remains the same regardless of the reference categoryuse, the outputs when using predict are different when the reference category used is different.

When I use the following command

Code:

probit TREAT i.GENDER AGE predict double score summ score

The variable score is missing for all observations

Code:

Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- score | 0

When the based is reassigned

Code:

probit TREAT ib2.GENDER AGE predict double score2 summ score2

The score2 variable can be created for records where GENDER = 1

Code:

Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- score2 | 8 .8751636 .0436722 .821793 .9341289

Does anyone know the reason for this?

Regards,
Wilson
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#5

12 Mar 2017, 22:12

I believe that it has to do with the error message

note: 2.GENDER omitted because of collinearity

that you see only in the first arrangement (i.GENDER) and not when you set the omitted category (ib2.GENDER).

I'm guessing that the underlying reason is technical and has to do with the way factor variables work behind the scenes. You might want to ask technical support about this.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10195
#6

13 Mar 2017, 10:06

You might want to ask technical support about this.

No, I would not suggest contacting technical support. Think about what you are doing! Given that you know Prob(Treat=1|GENDER=2)=0, i.e., GENDER= 2 predicts failure perfectly, then under what grounds do you justify including GENDER as a covariate in the model? The only useful information that you can extract is for the sub-sample of GENDER=1, and therefore you should note that the coefficients and predictions that you end up with are for the model

Code:

probit TREAT AGE if GENDER==1

Inclusion of GENDER as a covariate contributes nothing as Carlo's post in #3 illustrates.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#7

13 Mar 2017, 17:27

I don't think that anything you're saying was ever in dispute, even by the OP. The question as I understood it was why Stata behaves differently with respect to producing predictions or not with different specifications of the omitted category.

You're not claiming that

Code:

ib2.GENDER

is synonymous with

Code:

if GENDER == 1

are you?
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10195

14 Mar 2017, 06:34

Hello Joseph, I was directing the comment in #6 to the OP. My point is that this is not a bug and does not arise if the model is properly specified. In particular, knowing that GENDER is a binary variable and receiving the following warning from Stata

note: 1.GENDER != 1 predicts failure perfectly

should immediately lead one to exclude the variable from the model.

You're not claiming that
ib2.GENDER is synonymous with
if GENDER == 1 are you?

2. In general no, but in the presence of such a misspecification, yes I am claiming that the coefficients and predictions will be exact! Once you disregard one category in a binary variable, the implication is that you are including a variable that is constant in the model which will automatically be omitted because of collinearity.

Code:

. clear

. input float TREAT GENDER AGE

         TREAT     GENDER        AGE
  1.
. 1 1 32
  2.
. 1 1 43
  3.
. 1 1 23
  4.
. 1 1 55
  5.
. 1 1 33
  6.
. 1 1 23
  7.
. 1 1 56
  8.
. 0 1 34
  9.
. 0 2 54
 10.
. 0 2 40
 11.
. end

.
.
.
. probit TREAT ib2.GENDER AGE

note: 1.GENDER != 1 predicts failure perfectly
      1.GENDER dropped and 2 obs not used

Iteration 0:   log likelihood = -3.0141613  
Iteration 1:   log likelihood = -2.9598644  
Iteration 2:   log likelihood = -2.9592964  
Iteration 3:   log likelihood = -2.9592962  

Probit regression                               Number of obs     =          8
                                                LR chi2(1)        =       0.11
                                                Prob > chi2       =     0.7405
Log likelihood = -2.9592962                     Pseudo R2         =     0.0182

------------------------------------------------------------------------------
       TREAT |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      GENDER |
          1  |          0  (omitted)
          2  |          0  (empty)
             |
         AGE |   .0177287   .0551661     0.32   0.748    -.0903949    .1258524
       _cons |   .5144595   2.008838     0.26   0.798    -3.422791     4.45171
------------------------------------------------------------------------------

.
. predict double score, pr
(2 missing values generated)

.
. probit TREAT AGE if GENDER==1

Iteration 0:   log likelihood = -3.0141613  
Iteration 1:   log likelihood = -2.9598644  
Iteration 2:   log likelihood = -2.9592964  
Iteration 3:   log likelihood = -2.9592962  

Probit regression                               Number of obs     =          8
                                                LR chi2(1)        =       0.11
                                                Prob > chi2       =     0.7405
Log likelihood = -2.9592962                     Pseudo R2         =     0.0182

------------------------------------------------------------------------------
       TREAT |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         AGE |   .0177287   .0551661     0.32   0.748    -.0903949    .1258524
       _cons |   .5144595   2.008838     0.26   0.798    -3.422791     4.45171
------------------------------------------------------------------------------

.
. predict double score2 if GENDER==1, pr
(2 missing values generated)

.
. list score score2, sep(10)

     +-----------------------+
     |     score      score2 |
     |-----------------------|
  1. | .86032445   .86032445 |
  2. |  .8991625    .8991625 |
  3. | .82179303   .82179303 |
  4. | .93182718   .93182718 |
  5. | .86422648   .86422648 |
  6. | .82179303   .82179303 |
  7. |  .9341289    .9341289 |
  8. |  .8680532    .8680532 |
  9. |         .           . |
 10. |         .           . |
     +-----------------------+

.

Last edited by Andrew Musau; 14 Mar 2017, 07:20.

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#9

15 Mar 2017, 00:02

If the omitted category is specified as 1.GENDER, Stata behaves similarly in messages (with the exception that it explicitly warns of collinearity) and in fitting the model. But when the user asks for predictions afterward it behaves as if it first sets the dropped 1.GENDER cases to all-missing before going ahead and using it in computing the linear predictions to give all-missing predictions. Or maybe it just sets e(sample) to always return missing values, I don't know.

Whatever, there's something going on in the logic of how the factor variables are implemented such that in cases like this the behavior with respect to the generation of predictions afterward is affected in curious ways. It would be worthwhile to get feedback from technical support in order to better understand what to expect in these cases.
Comment

Announcement