Number of observations in my descriptive statistics (summarize statistics) is higher than the reported observations in my regression output?

YH jordaan

Join Date: Dec 2015
Posts: 46

Number of observations in my descriptive statistics (summarize statistics) is higher than the reported observations in my regression output?

08 Jan 2016, 10:18

Hi guys,

Im still working on my thesis on how bank quality (measured by dummy variable goodbank, proxied by bond credit ratings of those banks) influences market liquidity and funding liquidity. I have a unbalanced panel data set containing bank specific (panel) data and macroeconomics time serie data (which applies to each bank of course).

My question is the following:

How could it be that the lowest value of # of observations in my descriptive statistics (summarize statistics) is higher than the reported observations in my regression output?

See below first my summary statistics:

Code:

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        mliq |      4928    .0590628    .0609607   .0019011   .4217335
    goodbank |      5296    .6752266    .4683343          0          1
        size |      4952      11.043    1.730852   6.774847   15.17113
        racr |      4319    13.81973    2.870637       7.62      44.07
         nim |      4216    3.329013    1.299908      -8.87       15.1
-------------+--------------------------------------------------------
 Crisisdummy |      5131    .1641006    .3704029          0          1
changeinfl~n |      5063    .9989563    .3413335       .065        1.8
   DiffLibor |      4836   -.0959344    .4775776  -1.711507   .5692567
changeFedF~d |      5063   -.0826052    .4537597      -1.66        .57

And here my regression output:

Code:

. areg mliq goodbank size  racr nim  Crisisdummy  changeinflation  DiffLibor changeFedFund, absorb(gvkey) r

Linear regression, absorbing indicators           Number of obs   =       3465
                                                  F(   8,   3333) =      31.43
                                                  Prob > F        =     0.0000
                                                  R-squared       =     0.8097
                                                  Adj R-squared   =     0.8023
                                                  Root MSE        =     0.0285

---------------------------------------------------------------------------------
                |               Robust
           mliq |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
       goodbank |  -.0082468   .0015715    -5.25   0.000     -.011328   -.0051657
           size |   .0109495   .0020169     5.43   0.000     .0069949     .014904
           racr |   .0028888    .000365     7.91   0.000     .0021731    .0036044
            nim |  -.0077708   .0014891    -5.22   0.000    -.0106904   -.0048513
    Crisisdummy |   -.011055   .0022849    -4.84   0.000     -.015535    -.006575
changeinflation |  -.0030651   .0015295    -2.00   0.045     -.006064   -.0000663
      DiffLibor |  -.0082581   .0018577    -4.45   0.000    -.0119003   -.0046158
  changeFedFund |  -.0039041   .0020787    -1.88   0.060    -.0079798    .0001716
          _cons |  -.0652826    .023603    -2.77   0.006    -.1115603   -.0190048
----------------+----------------------------------------------------------------
          gvkey |   absorbed                                     (124 categories)

Tags: None

Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#2

08 Jan 2016, 11:03

you asked:

How could it be that the lowest value of # of observations in my descriptive statistics (summarize statistics) is higher than the reported observations in my regression output?

However, that is not true in the output you showed - in your descriptive stats each variable has more than 4,000 observations while your regression has 3,465 - so, what is your question?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

08 Jan 2016, 11:27

Stata analytical commands like areg generally do case-wise deletion: if any variable is missing, the observation is omitted from the analysis. Suppose you have 10 observations of y, x1, and x2, but x1 is missing in observation 2 and x2 is missing in observations 8 and 9. Your summary statistics will show 10 observations for y, 9 for x1, and 8 for x2. But in the analysis, observations 2, 8, and 9 will all be omitted, leaving you with just 7 observations.

This is discussed in the section of the help missing output titled Estimation commands.

Last edited by William Lisowski; 08 Jan 2016, 11:30.
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#4

08 Jan 2016, 12:46

whoops - sorry for the misreading (and I even quoted it!)
Comment
YH jordaan

Join Date: Dec 2015

Posts: 46
#5

08 Jan 2016, 20:53

Originally posted by William Lisowski View Post

Stata analytical commands like areg generally do case-wise deletion: if any variable is missing, the observation is omitted from the analysis. Suppose you have 10 observations of y, x1, and x2, but x1 is missing in observation 2 and x2 is missing in observations 8 and 9. Your summary statistics will show 10 observations for y, 9 for x1, and 8 for x2. But in the analysis, observations 2, 8, and 9 will all be omitted, leaving you with just 7 observations.

This is discussed in the section of the help missing output titled Estimation commands.

Thanks for your answer. Makes sense. I assume this does not differ with "xtreg"? Or do all commands use this case wise deletion?
Comment
YH jordaan

Join Date: Dec 2015

Posts: 46
#6

08 Jan 2016, 20:54

Originally posted by Rich Goldstein View Post

whoops - sorry for the misreading (and I even quoted it!)

No problem . I am not a native English speaker, so probably the question could have been specified more clearly.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#7

09 Jan 2016, 06:04

re: your #5 - in Stata (afaik), all commands use casewise deletion (other than pwcorr)
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#8

09 Jan 2016, 06:24

Try: count if ~missing(mliq, goodbank, size, racr, nim, Crisisdummy, changeinflation, DiffLibor, changeFedFund, gvkey)
Comment
YH jordaan

Join Date: Dec 2015

Posts: 46
#9

09 Jan 2016, 15:50

Originally posted by Ariel Karlinsky View Post

Try: count if ~missing(mliq, goodbank, size, racr, nim, Crisisdummy, changeinflation, DiffLibor, changeFedFund, gvkey)

Thank, this value is 1831. Total observations are 5296. 5296 - 1831 = the # of observations in my regression output.
Comment
YH jordaan

Join Date: Dec 2015

Posts: 46
#10

09 Jan 2016, 15:53

Now I'm wondering how to get descriptive statistics of my variables which are actually used in my regression. So in other words: descriptive statistics of my variables who survived case-wise delition of Stata. I think I should incorperate something like: generate a dummy variable which indicates if a variable is missing and then run summarize, if option?

EDIT: I already found a solution by using e(sample) after running my regression.

Last edited by YH jordaan; 09 Jan 2016, 15:59.
Comment
Krishna Bhandari

Join Date: Sep 2015

Posts: 1
#11

01 May 2022, 07:06

esample not found error occurs after running sum function with if (esample) option. Kindly guide what needs to be done?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#12

01 May 2022, 08:56

Krishna:
welcome to this forum.
As -summarize- stores resulst in -r()- and not in -e()-, to obtain the number of observations included in your -summarize- code, you can type:

Code:

use "C:\Program Files\Stata17\ado\base\a\auto.dta" . sum rep78 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- rep78 | 69 3.405797 .9899323 1 5 . di r(N) 69 .

Last edited by Carlo Lazzaro; 01 May 2022, 09:01.

Kind regards,
Carlo
(Stata 19.0)
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#13

01 May 2022, 09:59

esample not found error occurs after running sum function with if (esample) option. Kindly guide what needs to be done?

The option is

Code:

... if e(sample)

Here is an example - the variable rep78 has 5 missing values.

Code:

. sysuse auto, clear
(1978 automobile data)

. regress price length i.rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(5, 63)        =      3.76
       Model |   132668930         5  26533786.1   Prob > F        =    0.0048
    Residual |   444128029        63  7049651.25   R-squared       =    0.2300
-------------+----------------------------------   Adj R-squared   =    0.1689
       Total |   576796959        68  8482308.22   Root MSE        =    2655.1

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      length |   65.02221   15.48443     4.20   0.000     34.07904    95.96537
             |
       rep78 |
          2  |   728.5196   2105.194     0.35   0.730    -3478.374    4935.414
          3  |   1539.622   1940.569     0.79   0.431    -2338.295     5417.54
          4  |   1777.926   1980.059     0.90   0.373    -2178.907    5734.759
          5  |     2572.1   2061.701     1.25   0.217    -1547.881     6692.08
             |
       _cons |  -7724.697   3477.005    -2.22   0.030    -14672.94   -776.4562
------------------------------------------------------------------------------

. summarize price

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    6165.257    2949.496       3291      15906

. summarize price if e(sample)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         69    6146.043     2912.44       3291      15906

.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#14

01 May 2022, 16:15

YH Jordan:
as an aside to previous helpful hints, the issue can be also explained in the following way: while -summarize- adopts available case analysis (hence, the number of observations can change for each variable, due to differences in missing values), estimation commands, like -regression-, adopt complete case analysis (only observations with all observed values are included in the inferential procedure).

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment

Announcement

Number of observations in my descriptive statistics (summarize statistics) is higher than the reported observations in my regression output?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment