Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Probit with variable that predicts failure perfectly

    Dear Statalist,

    I have a dataset with similar properties as the following:

    TREAT GENDER AGE
    1 1 32
    1 1 43
    1 1 23
    1 1 55
    1 1 33
    1 1 23
    1 1 56
    0 1 34
    0 2 54
    0 2 40

    I am running a probit with TREAT as the dependent variable. In this case, GENDER can take on 2 values - 1 or 2, but all the obs with GENDER=2 are untreated.

    I tried running a probit followed by a predict
    Code:
    probit TREAT i.GENDER AGE
    predict double score
    summ score
    However, no score is generated and the log is appended below

    note: 1.GENDER != 1 predicts failure perfectly
    1.GENDER dropped and 2 obs not used

    note: 2.GENDER omitted because of collinearity
    Iteration 0: log likelihood = -3.0141613
    Iteration 1: log likelihood = -2.9598644
    Iteration 2: log likelihood = -2.9592964
    Iteration 3: log likelihood = -2.9592962

    Probit regression Number of obs = 8
    LR chi2(1) = 0.11
    Prob > chi2 = 0.7405
    Log likelihood = -2.9592962 Pseudo R2 = 0.0182

    ------------------------------------------------------------------------------
    TREAT | Coef. Std. Err. z P>|z| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    2.GENDER | 0 (empty)
    AGE | .0177287 .0551661 0.32 0.748 -.0903949 .1258524
    _cons | .5144595 2.008838 0.26 0.798 -3.422791 4.45171
    ------------------------------------------------------------------------------

    . predict double score
    (option pr assumed; Pr(TREAT))
    (10 missing values generated)

    . summ score

    Variable | Obs Mean Std. Dev. Min Max
    -------------+--------------------------------------------------------
    score | 0

    By changing the reference category, the score can be generated for observations with GENDER = 1
    Code:
    probit TREAT ib2.GENDER AGE
    Does anyone know why will changing the base cause the output to be different in this case and what will be the appropriate solution when running a dataset like this?

    Thanks in advance.

  • #2
    Hello Wilson,

    Welcome to the Stata Forum.

    The answer is straightforward, since you gave it to us.

    In the first example, we see that the gender 2 predicted "success" (whatever it is, in my opinion) perfectly.

    In the second example, you changed the reference level, gender 1 does not predict "success" perfectly.

    As a matter of fact, Stata already told you that, for we may in your output

    (please prefer to share output under CODE delimiters as recommended in the FAQ):

    Code:
    note: 1.GENDER != 1 predicts failure perfectly

    In short, everybody whose gender is 2 did "succeed".

    Hopefull that helps.
    Last edited by Marcos Almeida; 05 Mar 2017, 06:03.
    Best regards,

    Marcos

    Comment


    • #3
      Wilson:
      as an aside to Marcos' helpful explanation, I can't follow your last statement
      By changing the reference category, the score can be generated for observations with GENDER = 1
      , since it would seem that the story remains the same regardless the reference category:
      Code:
      . input TREAT GENDER AGE
      
               TREAT     GENDER        AGE
        1.
      .  1 1 32
        2.
      .  1 1 43
        3.
      .  1 1 23
        4.
      .  1 1 55
        5.
      .  1 1 33
        6.
      .  1 1 23
        7.
      .  1 1 56
        8.
      .  0 1 34
        9.
      .  0 2 54
       10.
      .  0 2 40
       11.
      . end
      
      . probit TREAT i.GENDER AGE
      
      note: 1.GENDER != 1 predicts failure perfectly
            1.GENDER dropped and 2 obs not used
      
      note: 2.GENDER omitted because of collinearity
      Iteration 0:   log likelihood = -3.0141613 
      Iteration 1:   log likelihood = -2.9598644 
      Iteration 2:   log likelihood = -2.9592964 
      Iteration 3:   log likelihood = -2.9592962 
      
      Probit regression                               Number of obs     =          8
                                                      LR chi2(1)        =       0.11
                                                      Prob > chi2       =     0.7405
      Log likelihood = -2.9592962                     Pseudo R2         =     0.0182
      
      ------------------------------------------------------------------------------
             TREAT |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
          2.GENDER |          0  (empty)
               AGE |   .0177287   .0551661     0.32   0.748    -.0903949    .1258524
             _cons |   .5144595   2.008838     0.26   0.798    -3.422791     4.45171
      ------------------------------------------------------------------------------
      
      . probit TREAT ib1.GENDER AGE
      
      note: 1.GENDER != 1 predicts failure perfectly
            1.GENDER dropped and 2 obs not used
      
      note: 2.GENDER omitted because of collinearity
      Iteration 0:   log likelihood = -3.0141613 
      Iteration 1:   log likelihood = -2.9598644 
      Iteration 2:   log likelihood = -2.9592964 
      Iteration 3:   log likelihood = -2.9592962 
      
      Probit regression                               Number of obs     =          8
                                                      LR chi2(1)        =       0.11
                                                      Prob > chi2       =     0.7405
      Log likelihood = -2.9592962                     Pseudo R2         =     0.0182
      
      ------------------------------------------------------------------------------
             TREAT |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
          2.GENDER |          0  (empty)
               AGE |   .0177287   .0551661     0.32   0.748    -.0903949    .1258524
             _cons |   .5144595   2.008838     0.26   0.798    -3.422791     4.45171
      ------------------------------------------------------------------------------
      
      . probit TREAT ib2.GENDER AGE
      
      note: 1.GENDER != 1 predicts failure perfectly
            1.GENDER dropped and 2 obs not used
      
      Iteration 0:   log likelihood = -3.0141613 
      Iteration 1:   log likelihood = -2.9598644 
      Iteration 2:   log likelihood = -2.9592964 
      Iteration 3:   log likelihood = -2.9592962 
      
      Probit regression                               Number of obs     =          8
                                                      LR chi2(1)        =       0.11
                                                      Prob > chi2       =     0.7405
      Log likelihood = -2.9592962                     Pseudo R2         =     0.0182
      
      ------------------------------------------------------------------------------
             TREAT |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
            GENDER |
                1  |          0  (omitted)
                2  |          0  (empty)
                   |
               AGE |   .0177287   .0551661     0.32   0.748    -.0903949    .1258524
             _cons |   .5144595   2.008838     0.26   0.798    -3.422791     4.45171
      ------------------------------------------------------------------------------
      As a closing-out remark it is better to use 0/1 vs 1/2 coding for creating categorical variables.
      Kind regards,
      Carlo
      (StataNow 18.5)

      Comment


      • #4
        Thanks Marcos and Carlo. Even though the story remains the same regardless of the reference categoryuse, the outputs when using predict are different when the reference category used is different.

        When I use the following command

        Code:
        probit TREAT i.GENDER AGE
        predict double score
        summ score
        The variable score is missing for all observations

        Code:
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
               score |          0
        When the based is reassigned

        Code:
        probit TREAT ib2.GENDER AGE
        predict double score2
        summ score2
        The score2 variable can be created for records where GENDER = 1
        Code:
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
              score2 |          8    .8751636    .0436722    .821793   .9341289
        Does anyone know the reason for this?

        Regards,
        Wilson

        Comment


        • #5
          I believe that it has to do with the error message

          note: 2.GENDER omitted because of collinearity

          that you see only in the first arrangement (i.GENDER) and not when you set the omitted category (ib2.GENDER).

          I'm guessing that the underlying reason is technical and has to do with the way factor variables work behind the scenes. You might want to ask technical support about this.

          Comment


          • #6
            You might want to ask technical support about this.
            No, I would not suggest contacting technical support. Think about what you are doing! Given that you know Prob(Treat=1|GENDER=2)=0, i.e., GENDER= 2 predicts failure perfectly, then under what grounds do you justify including GENDER as a covariate in the model? The only useful information that you can extract is for the sub-sample of GENDER=1, and therefore you should note that the coefficients and predictions that you end up with are for the model

            Code:
            probit TREAT AGE if GENDER==1
            Inclusion of GENDER as a covariate contributes nothing as Carlo's post in #3 illustrates.


            Comment


            • #7
              I don't think that anything you're saying was ever in dispute, even by the OP. The question as I understood it was why Stata behaves differently with respect to producing predictions or not with different specifications of the omitted category.

              You're not claiming that
              Code:
              ib2.GENDER
              is synonymous with
              Code:
              if GENDER == 1
              are you?

              Comment


              • #8
                Hello Joseph, I was directing the comment in #6 to the OP. My point is that this is not a bug and does not arise if the model is properly specified. In particular, knowing that GENDER is a binary variable and receiving the following warning from Stata

                note: 1.GENDER != 1 predicts failure perfectly
                should immediately lead one to exclude the variable from the model.

                You're not claiming that
                ib2.GENDER is synonymous with
                if GENDER == 1 are you?
                2. In general no, but in the presence of such a misspecification, yes I am claiming that the coefficients and predictions will be exact! Once you disregard one category in a binary variable, the implication is that you are including a variable that is constant in the model which will automatically be omitted because of collinearity.

                Code:
                . clear
                
                . input float TREAT GENDER AGE
                
                         TREAT     GENDER        AGE
                  1.
                . 1 1 32
                  2.
                . 1 1 43
                  3.
                . 1 1 23
                  4.
                . 1 1 55
                  5.
                . 1 1 33
                  6.
                . 1 1 23
                  7.
                . 1 1 56
                  8.
                . 0 1 34
                  9.
                . 0 2 54
                 10.
                . 0 2 40
                 11.
                . end
                
                .
                .
                .
                . probit TREAT ib2.GENDER AGE
                
                note: 1.GENDER != 1 predicts failure perfectly
                      1.GENDER dropped and 2 obs not used
                
                Iteration 0:   log likelihood = -3.0141613  
                Iteration 1:   log likelihood = -2.9598644  
                Iteration 2:   log likelihood = -2.9592964  
                Iteration 3:   log likelihood = -2.9592962  
                
                Probit regression                               Number of obs     =          8
                                                                LR chi2(1)        =       0.11
                                                                Prob > chi2       =     0.7405
                Log likelihood = -2.9592962                     Pseudo R2         =     0.0182
                
                ------------------------------------------------------------------------------
                       TREAT |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                      GENDER |
                          1  |          0  (omitted)
                          2  |          0  (empty)
                             |
                         AGE |   .0177287   .0551661     0.32   0.748    -.0903949    .1258524
                       _cons |   .5144595   2.008838     0.26   0.798    -3.422791     4.45171
                ------------------------------------------------------------------------------
                
                .
                . predict double score, pr
                (2 missing values generated)
                
                .
                . probit TREAT AGE if GENDER==1
                
                Iteration 0:   log likelihood = -3.0141613  
                Iteration 1:   log likelihood = -2.9598644  
                Iteration 2:   log likelihood = -2.9592964  
                Iteration 3:   log likelihood = -2.9592962  
                
                Probit regression                               Number of obs     =          8
                                                                LR chi2(1)        =       0.11
                                                                Prob > chi2       =     0.7405
                Log likelihood = -2.9592962                     Pseudo R2         =     0.0182
                
                ------------------------------------------------------------------------------
                       TREAT |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                         AGE |   .0177287   .0551661     0.32   0.748    -.0903949    .1258524
                       _cons |   .5144595   2.008838     0.26   0.798    -3.422791     4.45171
                ------------------------------------------------------------------------------
                
                .
                . predict double score2 if GENDER==1, pr
                (2 missing values generated)
                
                .
                . list score score2, sep(10)
                
                     +-----------------------+
                     |     score      score2 |
                     |-----------------------|
                  1. | .86032445   .86032445 |
                  2. |  .8991625    .8991625 |
                  3. | .82179303   .82179303 |
                  4. | .93182718   .93182718 |
                  5. | .86422648   .86422648 |
                  6. | .82179303   .82179303 |
                  7. |  .9341289    .9341289 |
                  8. |  .8680532    .8680532 |
                  9. |         .           . |
                 10. |         .           . |
                     +-----------------------+
                
                .
                Last edited by Andrew Musau; 14 Mar 2017, 07:20.

                Comment


                • #9
                  If the omitted category is specified as 1.GENDER, Stata behaves similarly in messages (with the exception that it explicitly warns of collinearity) and in fitting the model. But when the user asks for predictions afterward it behaves as if it first sets the dropped 1.GENDER cases to all-missing before going ahead and using it in computing the linear predictions to give all-missing predictions. Or maybe it just sets e(sample) to always return missing values, I don't know.

                  Whatever, there's something going on in the logic of how the factor variables are implemented such that in cases like this the behavior with respect to the generation of predictions afterward is affected in curious ways. It would be worthwhile to get feedback from technical support in order to better understand what to expect in these cases.

                  Comment

                  Working...
                  X