Wish for summarize

Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#1

Wish for summarize

22 Sep 2015, 09:57

Hi,

I was responding to a participant's question earlier, and I just realized of something that I thought would be useful to have in the summarize command. As it currently is, when you use it for descriptive statistics of several values only the statistics of the last variable are stored in r(), and they're all scalars. I just thought that when you're passing more than one variable in the varlist it would be more useful to have a vector for each statistic that holds the values for all the variables. You could also, as an alternative, create a scalar for each statistic-variable combination, and that would work too. If you do the vectors you could have rownames or colnames, depending if you decide to do it in a one column vector or a 1 row vector, where each rowname or colname would be the name of the variable so that would make the stats easily accessible. For example

Code:

quiet summarize var1 var2 mat sds = r(sd) // this would now be a matrix not a scalar di sds[var1,1] // If you decide to make it a single column vector

If you decide to them all as scalars you could have an abbreviation of the statistic followed by the underscore and then the variable name. For example

Code:

quiet summarize var1 var2 di r(sd_var1)

I would prefer the vector method but either one would work. This way we don't have to call the summarize several times when programming, and could hold the statistics in memory across the program. This is just a thought, maybe there is already another command that does this and I apologize for my possible ignorance beforehand. The good thing is that this could work in many different versions of Stata so it can be implemented in older versions with just an update. Of course... the problem is that old codes may need a revision....

Alfonso Sanchez-Penalver
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

22 Sep 2015, 10:58

Check out -moments- from SSC, which goes some distance in this direction.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#3

22 Sep 2015, 11:17

This should be fairly easy to implement yourself. I am bit reluctant when it comes to changes of built-in commands. I think it would be undesirable to have the same "old" names for returned results now containing a completely different thing, e.g., r(sd) would return a vector instead of a scalar. I guess especially for experienced users/programmers it would be rather hard to get used to such a change. It would further require at least a new (sub)version release of Stata, because if you make such major changes you need to be sure not to break old code - at least under version control. At best I can see summarize return additional results.

In any case (and especially if you are going to program such a thing) I would strongly recommend to use the vector/matrix approach. The r(stat_varname) approach will fail because the name limits of 32 characters applies to r(), too. Thus, if stats was mean then the maximum length of a variable name would be limited to 27 and the command would fail for any longer variable name.

Best
Daniel
Comment
Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#4

22 Sep 2015, 14:53

Thanks Nick, I will definitely check it out. Thanks for the comments Daniel. I agree that it shouldn't be hard to program, I could always go ahead and base most of the code on summarize. I do agree that the vector/matrix idea is better, it was my first thought. If Stata goes ahead and add this functionality I agree that's a better idea if they return additional results, maybe under e() so it doesn't mess up old code.

Alfonso Sanchez-Penalver
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

23 Sep 2015, 02:38

Headline: It is not at all necessary to repeat summarize, detail yourself to see many results on skewness and kurtosis.

Expanding further on my suggestions:

1. moments (SSC) dates from 2004 and has an option to save results to a matrix. Next time I revisit the code I will add an option to save to a Stata dataset. Saving to variables is not difficult with the existing version and svmat

2. I'd not trust skewness and kurtosis especially, for various reasons. They can reflect individual extreme values rather than the general shape of a distribution. They are limited by sample size (see http://www.stata-journal.com/sjpdf.h...iclenum=st0204) and they are typically dependent on each other. But with care they can help you spot problems, or rather features of your dataset.

3. L-moments remain underused in many fields of statistical science. For a Stata implementation see lmoments (SSC).. For reference know that a Gaussian and any other symmetric distribution has L-skewness (t_3) of 0 (no surprise there) and that a Gaussian has L-kurtosis t_4 of 0.123 (3 d.p.). L-kurtosis is less ambiguously a measure of tail weight than is kurtosis.

4. Always, always plot the data too. My multqplot (SJ) lets you look at about 12 variables at a time.

Some examples here

Code:

. webuse nlswork , clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. lmoments , short

-----------------------------------------------------------------------------
                   n = 13452 |        l_1         l_2         t_3         t_4
-----------------------------+-----------------------------------------------
                      NLS ID |   2588.740     862.218      -0.003       0.001
              interview year |     79.117       3.369       0.013      -0.015
                  birth year |     48.107       1.740      -0.033       0.029
         age in current year |     30.203       3.677       0.049       0.051
   1=white, 2=black, 3=other |      1.296       0.214       0.454       0.025
1 if married, spouse present |      0.626       0.234      -0.251      -0.171
          1 if never married |      0.208       0.165       0.584       0.176
     current grade completed |     12.682       1.219       0.126       0.262
       1 if college graduate |      0.189       0.153       0.623       0.234
               1 if not SMSA |      0.284       0.203       0.432      -0.017
           1 if central city |      0.342       0.225       0.317      -0.125
                  1 if south |      0.408       0.242       0.184      -0.208
      industry of employment |      7.842       1.683      -0.045      -0.102
                  occupation |      4.839       1.688       0.266       0.123
                  1 if union |      0.229       0.176       0.543       0.118
  weeks unemployed last year |      2.112       1.939       0.842       0.648
       total work experience |      6.773       2.419       0.210       0.105
        job tenure, in years |      3.448       1.862       0.377       0.184
          usual hours worked |     36.199       4.717      -0.314       0.341
      weeks worked last year |     50.728      11.489      -0.043       0.246
       ln(wage/GNP deflator) |      1.714       0.254       0.040       0.144
-----------------------------------------------------------------------------

. moments, matname(results)

-----------------------------------------------------------------------------
                   n = 13452 |       mean          SD    skewness    kurtosis
-----------------------------+-----------------------------------------------
                      NLS ID |   2588.740    1493.488      -0.010       1.803
              interview year |     79.117       5.894       0.042       1.754
                  birth year |     48.107       3.033      -0.130       1.995
         age in current year |     30.203       6.414       0.199       2.169
   1=white, 2=black, 3=other |      1.296       0.479       1.171       3.048
1 if married, spouse present |      0.626       0.484      -0.520       1.270
          1 if never married |      0.208       0.406       1.438       3.067
     current grade completed |     12.682       2.373       0.110       4.230
       1 if college graduate |      0.189       0.391       1.591       3.531
               1 if not SMSA |      0.284       0.451       0.958       1.918
           1 if central city |      0.342       0.474       0.667       1.445
                  1 if south |      0.408       0.492       0.374       1.140
      industry of employment |      7.842       3.016      -0.083       1.498
                  occupation |      4.839       3.223       1.116       3.609
                  1 if union |      0.229       0.420       1.293       2.671
  weeks unemployed last year |      2.112       7.001       4.469      25.083
       total work experience |      6.773       4.409       0.943       3.390
        job tenure, in years |      3.448       3.770       1.875       6.784
          usual hours worked |     36.199      10.034      -0.808       7.879
      weeks worked last year |     50.728      21.332      -0.185       3.381
       ln(wage/GNP deflator) |      1.714       0.460       0.170       4.100
-----------------------------------------------------------------------------

gen which = ""
replace which = word("`: rownames results'", _n)
svmat results
forval j = 1/5 {
    rename results`j' `: word `j' of `: colnames results''
}

. ds
idcode    age       nev_mar   not_smsa  ind_code  wks_ue    hours     which     SD
year      race      grade     c_city    occ_code  ttl_exp   wks_work  n         skewness
birth_yr  msp       collgrad  south     union     tenure    ln_wage   mean      kurtosis

Last edited by Nick Cox; 23 Sep 2015, 02:41.

Comment

Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#6

23 Sep 2015, 09:42

Hi Nick,

glad to hear you will think of adding more functionality into moments. I was looking at it yesterday and realized that adding what I was looking for when it's byable becomes tricky if the user decides to use the by.

I love it when one of my topics peaks your interest because I always end up with a list of papers to read. I've downloaded the Hoskin (1992) paper you reference in help lmoments and will definitely read yours on the size effects on skewness and kurtosis estimates.

I had already downloaded multqplot some time ago. Let's see when I get some time to check it out. Thanks again for pitching in on this!!!

Alfonso Sanchez-Penalver
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#7

23 Sep 2015, 10:09

When using by: don't overlook the skew() and kurt() functions in egen, which have been available to users since 1999!
Comment
Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#8

23 Sep 2015, 10:33

Yes, I know. I meant that when using by: if you're going to store results in matrices, it just became much more complex, because now you would have a matrix per category. This was related to you revising moments.

Alfonso Sanchez-Penalver
Comment

Announcement