Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Automatic omission of category of categorical variable without specification in OLS

    Dear stata experts,

    I could use some help with the following problem. I am running a multivariate OLS regression with (standardized) test scores as the dependent variable, and a set of continuous and categorical variables as independent variables. For some of the factor variables, I added an extra category for 'missings'. This works fine for most categorical variables, however for the variable mum_age_deliv_cat (maternal age at delivery), this category is omitted in stata output automatically without specification of reason (multicollinearity etc).

    Code for multivariate regression is the following:
    Code:
    regress zks4_GCSE_tot mum_smokes##c.zea1_pgs i.sex ib3.mum_age_deliv_cat zdepression ib3.mum_SES ib3.marital_st_mum ib3.mum_ed_add ib6.cig_change, robust allbaselevels
    
    Linear regression                               Number of obs     =      5,627
                                                    F(28, 5598)       =     156.00
                                                    Prob > F          =     0.0000
                                                    R-squared         =     0.1924
                                                    Root MSE          =     .85361
    
    ------------------------------------------------------------------------------------------
                             |               Robust
               zks4_GCSE_tot |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------------------+----------------------------------------------------------------
                  mum_smokes |
              doesn't smoke  |          0  (base)
                     smokes  |  -.1532702   .0622202    -2.46   0.014    -.2752459   -.0312945
                             |
                    zea1_pgs |   .0896178   .0126652     7.08   0.000     .0647892    .1144464
                             |
       mum_smokes#c.zea1_pgs |
              doesn't smoke  |          0  (base)
                     smokes  |    .055319   .0326839     1.69   0.091    -.0087542    .1193922
                             |
                         sex |
                       Male  |          0  (base)
                     Female  |   .2701045   .0228218    11.84   0.000     .2253649    .3148442
                             |
           mum_age_deliv_cat |
                        <20  |  -.1404631   .0941917    -1.49   0.136    -.3251154    .0441892
                      20-24  |   -.110315    .036715    -3.00   0.003    -.1822907   -.0383393
                      25-29  |          0  (base)
                      30-34  |   .0396163   .0277931     1.43   0.154    -.0148689    .0941014
                        35+  |   .1217735   .0380242     3.20   0.001     .0472314    .1963156
                             |
                 zdepression |  -.0516424   .0123182    -4.19   0.000    -.0757908    -.027494
                             |
                     mum_SES |
                          I  |   .1243631   .0588214     2.11   0.035     .0090503    .2396759
                         II  |   .0022687   .0313728     0.07   0.942     -.059234    .0637715
    III (non-manual labour)  |          0  (base)
        III (manual labour)  |  -.1506617    .049965    -3.02   0.003    -.2486125    -.052711
                         IV  |  -.1566356   .0502407    -3.12   0.002    -.2551268   -.0581443
                          V  |   -.380365   .1006184    -3.78   0.000    -.5776161   -.1831139
                    Missing  |  -.2539358   .0404962    -6.27   0.000    -.3333241   -.1745475
                             |
              marital_st_mum |
              Never married  |  -.1206388   .0385989    -3.13   0.002    -.1963076   -.0449699
                  Separated  |  -.1357422   .0572148    -2.37   0.018    -.2479053    -.023579
               Ever married  |          0  (base)
                    Missing  |  -.1130427   .1802733    -0.63   0.531    -.4664482    .2403628
                             |
                  mum_ed_add |
                 CSE / None  |  -.2884407   .0391122    -7.37   0.000    -.3651157   -.2117657
                 Vocational  |  -.1568851   .0447715    -3.50   0.000    -.2446547   -.0691156
                   O-levels  |          0  (base)
                   A-levels  |   .1809204   .0312505     5.79   0.000     .1196574    .2421835
                     Degree  |   .4228745   .0440691     9.60   0.000      .336482    .5092671
                    Missing  |  -.0050217   .0820701    -0.06   0.951     -.165911    .1558676
                             |
                  cig_change |
                Went off it  |  -.1052319   .0450578    -2.34   0.020    -.1935626   -.0169012
                   Cut down  |   .0007196   .0611025     0.01   0.991    -.1190651    .1205042
                Craved more  |  -.0448434   .2700721    -0.17   0.868    -.5742895    .4846027
                   Had more  |  -.4333814   .0764357    -5.67   0.000    -.5832251   -.2835377
                  NO Change  |  -.0952739   .0793533    -1.20   0.230     -.250837    .0602893
             Never has this  |          0  (base)
                             |
                       _cons |   .1129289   .0281212     4.02   0.000     .0578005    .1680574
    ------------------------------------------------------------------------------------------
    The missing category for mum_age_deliv_cat isn't omitted until I include zdepression or mum_smokes to the regression.

    For example:
    Code:
    regress zks4_GCSE_tot i.mum_age_deliv_cat sex i.marital_st_mum i.mum_ed_add i.mum_SES
    
          Source |       SS           df       MS      Number of obs   =    11,904
    -------------+----------------------------------   F(20, 11883)    =    134.48
           Model |  2197.07793        20  109.853896   Prob > F        =    0.0000
        Residual |  9707.18812    11,883   .81689709   R-squared       =    0.1846
    -------------+----------------------------------   Adj R-squared   =    0.1832
           Total |   11904.266    11,903  1.00010636   Root MSE        =    .90382
    
    ------------------------------------------------------------------------------------------
               zks4_GCSE_tot |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------------------+----------------------------------------------------------------
           mum_age_deliv_cat |
                      20-24  |   .0309369   .0453468     0.68   0.495    -.0579502     .119824
                      25-29  |   .1836477   .0453645     4.05   0.000     .0947259    .2725695
                      30-34  |    .246943     .04722     5.23   0.000     .1543841     .339502
                        35+  |   .2855106    .052452     5.44   0.000      .182696    .3883251
                    Missing  |   .6171147    .064453     9.57   0.000     .4907763    .7434531
                             |
                         sex |   .2518333   .0165871    15.18   0.000     .2193198    .2843467
                             |
              marital_st_mum |
                  Separated  |  -.0312996   .0435235    -0.72   0.472    -.1166129    .0540136
               Ever married  |   .2037728   .0250902     8.12   0.000     .1545919    .2529536
                    Missing  |  -.0736996   .0423146    -1.74   0.082    -.1566431    .0092439
                             |
                  mum_ed_add |
                 Vocational  |   .1721054   .0345848     4.98   0.000     .1043134    .2398973
                   O-levels  |   .3589388   .0260158    13.80   0.000     .3079437     .409934
                   A-levels  |   .5817595    .030404    19.13   0.000     .5221626    .6413564
                     Degree  |   .9064845   .0398891    22.73   0.000     .8282955    .9846736
                    Missing  |   .2245926   .0366603     6.13   0.000     .1527325    .2964527
                             |
                     mum_SES |
                         II  |  -.1239549   .0526468    -2.35   0.019    -.2271513   -.0207585
    III (non-manual labour)  |  -.0957971   .0546101    -1.75   0.079    -.2028419    .0112477
        III (manual labour)  |  -.2680993   .0634493    -4.23   0.000    -.3924703   -.1437283
                         IV  |   -.316245   .0617112    -5.12   0.000     -.437209   -.1952809
                          V  |   -.420445   .0846138    -4.97   0.000    -.5863018   -.2545882
                    Missing  |  -.3964954   .0566912    -6.99   0.000    -.5076195   -.2853714
                             |
                       _cons |  -.8257235   .0739581   -11.16   0.000    -.9706935   -.6807534
    ------------------------------------------------------------------------------------------
    shows missing category for mum_age_deliv_cat correctly.

    I (manually) checked in data browser whether the missings for mum_age_deliv are the same observations as mum_smokes or zdepression, however this is not the case. Also see:

    Code:
    tab mum_age_deliv_cat
    
         Age of |
      mother at |
      delivery, |
        grouped |      Freq.     Percent        Cum.
    ------------+-----------------------------------
            <20 |        656        4.21        4.21
          20-24 |      2,705       17.38       21.59
          25-29 |      5,440       34.95       56.54
          30-34 |      3,878       24.91       81.46
            35+ |      1,397        8.98       90.43
        Missing |      1,489        9.57      100.00
    ------------+-----------------------------------
          Total |     15,565      100.00
    Code:
    tab mum_smokes if mum_age_deliv_cat==6
    
    mother smokes |
    any amount of |
      cigs during |
        pregnancy |      Freq.     Percent        Cum.
    --------------+-----------------------------------
    doesn't smoke |        312       78.00       78.00
           smokes |         88       22.00      100.00
    --------------+-----------------------------------
            Total |        400      100.00
    Code:
    tab mum_age_deliv_cat if missing(mum_smokes)
    
         Age of |
      mother at |
      delivery, |
        grouped |      Freq.     Percent        Cum.
    ------------+-----------------------------------
            <20 |        177        6.20        6.20
          20-24 |        510       17.85       24.05
          25-29 |        572       20.02       44.07
          30-34 |        350       12.25       56.32
            35+ |        159        5.57       61.88
        Missing |      1,089       38.12      100.00
    ------------+-----------------------------------
          Total |      2,857      100.00

    Finally, when I try to run the regression with the missing category set as the baselevel, this is the response I get:

    Code:
    . regress zks4_GCSE_tot mum_smokes ib6.mum_age_deliv_cat
    note: 5.mum_age_deliv_cat omitted because of collinearity
    note: 6b.mum_age_deliv_cat identifies no observations in the sample
    
          Source |       SS           df       MS      Number of obs   =     9,936
    -------------+----------------------------------   F(5, 9930)      =    161.76
           Model |  729.671972         5  145.934394   Prob > F        =    0.0000
        Residual |  8958.40157     9,930  .902155244   R-squared       =    0.0753
    -------------+----------------------------------   Adj R-squared   =    0.0749
           Total |  9688.07354     9,935  .975145802   Root MSE        =    .94982
    
    -----------------------------------------------------------------------------------
        zks4_GCSE_tot |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    ------------------+----------------------------------------------------------------
           mum_smokes |  -.4360266   .0240728   -18.11   0.000    -.4832141   -.3888391
                      |
    mum_age_deliv_cat |
                 <20  |  -.6483968   .0581204   -11.16   0.000    -.7623246    -.534469
               20-24  |   -.466057   .0384945   -12.11   0.000    -.5415141   -.3905999
               25-29  |  -.2004997   .0344241    -5.82   0.000     -.267978   -.1330215
               30-34  |   -.055606   .0358554    -1.55   0.121    -.1258899    .0146779
                 35+  |          0  (omitted)
             Missing  |          0  (empty)
                      |
                _cons |   .3428121   .0312033    10.99   0.000     .2816473    .4039768
    -----------------------------------------------------------------------------------
    I am at a loss as to why this happens, and it now states that there are no observations in the sample. Hope someone can help me!

    PS: This is my first post, so I hope I formatted everything the right way. Apologies upfront if not!

    Kind regards,
    Wouter


  • #2
    Welcome to Statalist, Wouter.

    The problem is likely due to observations being dropped from your regressions because of missing values in variables other than mum_age_deliv_cat and mum_smokes. So
    Code:
    note: 6b.mum_age_deliv_cat identifies no observations in the sample
    is exactly correct - for the estimation sample, if not for the entire dataset.

    Note that you apparently have 15,565 observations in your dataset, but your final regression is being run on 9,936 observations. You are losing a lot of observations,more than just the 2,857 for which mum_smokes is missing. This tells me your dependent variable zks4_GCSE_tot also has substantial missing values.

    Perhaps using the misstable command (see help misstable) will help you understand the patterns of missing values for your variables
    • zks4_GCSE_tot
    • mum_smokes
    • mum_age_deliv
    where a missing value of mum_age_deliv corresponds to a value of 6 for mum_age_deliv_cat.

    Comment


    • #3
      Thanks for the answer William! I stared myself blind on this and was oblivious for this simple explanation. When I perform the misstable command with patterns it shows the following:

      Code:
      misstable patterns zks4_GCSE_tot mum_smokes zdepression mum_age_deliv
      
         Missing-value patterns
           (1 means complete)
      
                    |   Pattern
          Percent   |  1  2  3  4
        ------------+-------------
             61%    |  1  1  1  1
                    |
             14     |  1  1  1  0
              7     |  1  0  0  1
              4     |  0  0  0  1
              3     |  0  0  0  0
              3     |  1  1  0  1
              2     |  1  0  0  0
              2     |  0  1  0  0
              2     |  1  0  1  1
             <1     |  1  1  0  0
             <1     |  1  0  1  0
             <1     |  0  1  1  0
             <1     |  0  0  1  0
        ------------+-------------
            100%    |
      
        Variables are  (1) mum_age_deliv  (2) mum_smokes  (3) zdepression
                       (4) zks4_GCSE_tot
      Can it safely be concluded from this that there is no observation in which all other variables are present and mum_age_deliv_cat is not, and that is why stata omitted this category in the regression output?

      Comment


      • #4
        I would word it slightly differently.

        In every case where mum_age_deliv is missing (a zero in the first column) so that mum_age_deliv_cat is 6, at least one of the other three variables is also missing, and the observation will be omitted from the regression. So there will be no observations with mum_age_deliv_cat is 6 in the estimation sample.

        Comment

        Working...
        X