Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Number of observations in my descriptive statistics (summarize statistics) is higher than the reported observations in my regression output?

    Hi guys,

    Im still working on my thesis on how bank quality (measured by dummy variable goodbank, proxied by bond credit ratings of those banks) influences market liquidity and funding liquidity. I have a unbalanced panel data set containing bank specific (panel) data and macroeconomics time serie data (which applies to each bank of course).

    My question is the following:

    How could it be that the lowest value of # of observations in my descriptive statistics (summarize statistics) is higher than the reported observations in my regression output?

    See below first my summary statistics:

    Code:
        Variable |       Obs        Mean    Std. Dev.       Min        Max
    -------------+--------------------------------------------------------
            mliq |      4928    .0590628    .0609607   .0019011   .4217335
        goodbank |      5296    .6752266    .4683343          0          1
            size |      4952      11.043    1.730852   6.774847   15.17113
            racr |      4319    13.81973    2.870637       7.62      44.07
             nim |      4216    3.329013    1.299908      -8.87       15.1
    -------------+--------------------------------------------------------
     Crisisdummy |      5131    .1641006    .3704029          0          1
    changeinfl~n |      5063    .9989563    .3413335       .065        1.8
       DiffLibor |      4836   -.0959344    .4775776  -1.711507   .5692567
    changeFedF~d |      5063   -.0826052    .4537597      -1.66        .57
    And here my regression output:

    Code:
    . areg mliq goodbank size  racr nim  Crisisdummy  changeinflation  DiffLibor changeFedFund, absorb(gvkey) r
    
    Linear regression, absorbing indicators           Number of obs   =       3465
                                                      F(   8,   3333) =      31.43
                                                      Prob > F        =     0.0000
                                                      R-squared       =     0.8097
                                                      Adj R-squared   =     0.8023
                                                      Root MSE        =     0.0285
    
    ---------------------------------------------------------------------------------
                    |               Robust
               mliq |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    ----------------+----------------------------------------------------------------
           goodbank |  -.0082468   .0015715    -5.25   0.000     -.011328   -.0051657
               size |   .0109495   .0020169     5.43   0.000     .0069949     .014904
               racr |   .0028888    .000365     7.91   0.000     .0021731    .0036044
                nim |  -.0077708   .0014891    -5.22   0.000    -.0106904   -.0048513
        Crisisdummy |   -.011055   .0022849    -4.84   0.000     -.015535    -.006575
    changeinflation |  -.0030651   .0015295    -2.00   0.045     -.006064   -.0000663
          DiffLibor |  -.0082581   .0018577    -4.45   0.000    -.0119003   -.0046158
      changeFedFund |  -.0039041   .0020787    -1.88   0.060    -.0079798    .0001716
              _cons |  -.0652826    .023603    -2.77   0.006    -.1115603   -.0190048
    ----------------+----------------------------------------------------------------
              gvkey |   absorbed                                     (124 categories)

  • #2
    you asked:
    How could it be that the lowest value of # of observations in my descriptive statistics (summarize statistics) is higher than the reported observations in my regression output?
    However, that is not true in the output you showed - in your descriptive stats each variable has more than 4,000 observations while your regression has 3,465 - so, what is your question?

    Comment


    • #3
      Stata analytical commands like areg generally do case-wise deletion: if any variable is missing, the observation is omitted from the analysis. Suppose you have 10 observations of y, x1, and x2, but x1 is missing in observation 2 and x2 is missing in observations 8 and 9. Your summary statistics will show 10 observations for y, 9 for x1, and 8 for x2. But in the analysis, observations 2, 8, and 9 will all be omitted, leaving you with just 7 observations.

      This is discussed in the section of the help missing output titled Estimation commands.
      Last edited by William Lisowski; 08 Jan 2016, 12:30.

      Comment


      • #4
        whoops - sorry for the misreading (and I even quoted it!)

        Comment


        • #5
          Originally posted by William Lisowski View Post
          Stata analytical commands like areg generally do case-wise deletion: if any variable is missing, the observation is omitted from the analysis. Suppose you have 10 observations of y, x1, and x2, but x1 is missing in observation 2 and x2 is missing in observations 8 and 9. Your summary statistics will show 10 observations for y, 9 for x1, and 8 for x2. But in the analysis, observations 2, 8, and 9 will all be omitted, leaving you with just 7 observations.

          This is discussed in the section of the help missing output titled Estimation commands.
          Thanks for your answer. Makes sense. I assume this does not differ with "xtreg"? Or do all commands use this case wise deletion?

          Comment


          • #6
            Originally posted by Rich Goldstein View Post
            whoops - sorry for the misreading (and I even quoted it!)
            No problem . I am not a native English speaker, so probably the question could have been specified more clearly.

            Comment


            • #7
              re: your #5 - in Stata (afaik), all commands use casewise deletion (other than pwcorr)

              Comment


              • #8
                Try: count if ~missing(mliq, goodbank, size, racr, nim, Crisisdummy, changeinflation, DiffLibor, changeFedFund, gvkey)

                Comment


                • #9
                  Originally posted by Ariel Karlinsky View Post
                  Try: count if ~missing(mliq, goodbank, size, racr, nim, Crisisdummy, changeinflation, DiffLibor, changeFedFund, gvkey)

                  Thank, this value is 1831. Total observations are 5296. 5296 - 1831 = the # of observations in my regression output.

                  Comment


                  • #10
                    Now I'm wondering how to get descriptive statistics of my variables which are actually used in my regression. So in other words: descriptive statistics of my variables who survived case-wise delition of Stata. I think I should incorperate something like: generate a dummy variable which indicates if a variable is missing and then run summarize, if option?

                    EDIT: I already found a solution by using e(sample) after running my regression.
                    Last edited by YH jordaan; 09 Jan 2016, 16:59.

                    Comment


                    • #11
                      esample not found error occurs after running sum function with if (esample) option. Kindly guide what needs to be done?

                      Comment


                      • #12
                        Krishna:
                        welcome to this forum.
                        As -summarize- stores resulst in -r()- and not in -e()-, to obtain the number of observations included in your -summarize- code, you can type:
                        Code:
                        use "C:\Program Files\Stata17\ado\base\a\auto.dta"
                        . sum rep78
                        
                            Variable |        Obs        Mean    Std. dev.       Min        Max
                        -------------+---------------------------------------------------------
                               rep78 |         69    3.405797    .9899323          1          5
                        
                        
                        . di r(N)
                        69
                        
                        .
                        Last edited by Carlo Lazzaro; 01 May 2022, 10:01.
                        Kind regards,
                        Carlo
                        (StataNow 18.5)

                        Comment


                        • #13
                          esample not found error occurs after running sum function with if (esample) option. Kindly guide what needs to be done?
                          The option is
                          Code:
                          ... if e(sample)
                          Here is an example - the variable rep78 has 5 missing values.
                          Code:
                          . sysuse auto, clear
                          (1978 automobile data)
                          
                          . regress price length i.rep78
                          
                                Source |       SS           df       MS      Number of obs   =        69
                          -------------+----------------------------------   F(5, 63)        =      3.76
                                 Model |   132668930         5  26533786.1   Prob > F        =    0.0048
                              Residual |   444128029        63  7049651.25   R-squared       =    0.2300
                          -------------+----------------------------------   Adj R-squared   =    0.1689
                                 Total |   576796959        68  8482308.22   Root MSE        =    2655.1
                          
                          ------------------------------------------------------------------------------
                                 price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                          -------------+----------------------------------------------------------------
                                length |   65.02221   15.48443     4.20   0.000     34.07904    95.96537
                                       |
                                 rep78 |
                                    2  |   728.5196   2105.194     0.35   0.730    -3478.374    4935.414
                                    3  |   1539.622   1940.569     0.79   0.431    -2338.295     5417.54
                                    4  |   1777.926   1980.059     0.90   0.373    -2178.907    5734.759
                                    5  |     2572.1   2061.701     1.25   0.217    -1547.881     6692.08
                                       |
                                 _cons |  -7724.697   3477.005    -2.22   0.030    -14672.94   -776.4562
                          ------------------------------------------------------------------------------
                          
                          . summarize price
                          
                              Variable |        Obs        Mean    Std. dev.       Min        Max
                          -------------+---------------------------------------------------------
                                 price |         74    6165.257    2949.496       3291      15906
                          
                          . summarize price if e(sample)
                          
                              Variable |        Obs        Mean    Std. dev.       Min        Max
                          -------------+---------------------------------------------------------
                                 price |         69    6146.043     2912.44       3291      15906
                          
                          .

                          Comment


                          • #14
                            YH Jordan:
                            as an aside to previous helpful hints, the issue can be also explained in the following way: while -summarize- adopts available case analysis (hence, the number of observations can change for each variable, due to differences in missing values), estimation commands, like -regression-, adopt complete case analysis (only observations with all observed values are included in the inferential procedure).
                            Kind regards,
                            Carlo
                            (StataNow 18.5)

                            Comment

                            Working...
                            X