Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Wish for summarize

    Hi,

    I was responding to a participant's question earlier, and I just realized of something that I thought would be useful to have in the summarize command. As it currently is, when you use it for descriptive statistics of several values only the statistics of the last variable are stored in r(), and they're all scalars. I just thought that when you're passing more than one variable in the varlist it would be more useful to have a vector for each statistic that holds the values for all the variables. You could also, as an alternative, create a scalar for each statistic-variable combination, and that would work too. If you do the vectors you could have rownames or colnames, depending if you decide to do it in a one column vector or a 1 row vector, where each rowname or colname would be the name of the variable so that would make the stats easily accessible. For example
    Code:
    quiet summarize var1 var2
    mat sds = r(sd) // this would now be a matrix not a scalar
    di sds[var1,1] // If you decide to make it a single column vector
    If you decide to them all as scalars you could have an abbreviation of the statistic followed by the underscore and then the variable name. For example
    Code:
    quiet summarize var1 var2
    di r(sd_var1)
    I would prefer the vector method but either one would work. This way we don't have to call the summarize several times when programming, and could hold the statistics in memory across the program. This is just a thought, maybe there is already another command that does this and I apologize for my possible ignorance beforehand. The good thing is that this could work in many different versions of Stata so it can be implemented in older versions with just an update. Of course... the problem is that old codes may need a revision....
    Alfonso Sanchez-Penalver

  • #2
    Check out -moments- from SSC, which goes some distance in this direction.

    Comment


    • #3
      This should be fairly easy to implement yourself. I am bit reluctant when it comes to changes of built-in commands. I think it would be undesirable to have the same "old" names for returned results now containing a completely different thing, e.g., r(sd) would return a vector instead of a scalar. I guess especially for experienced users/programmers it would be rather hard to get used to such a change. It would further require at least a new (sub)version release of Stata, because if you make such major changes you need to be sure not to break old code - at least under version control. At best I can see summarize return additional results.

      In any case (and especially if you are going to program such a thing) I would strongly recommend to use the vector/matrix approach. The r(stat_varname) approach will fail because the name limits of 32 characters applies to r(), too. Thus, if stats was mean then the maximum length of a variable name would be limited to 27 and the command would fail for any longer variable name.

      Best
      Daniel

      Comment


      • #4
        Thanks Nick, I will definitely check it out. Thanks for the comments Daniel. I agree that it shouldn't be hard to program, I could always go ahead and base most of the code on summarize. I do agree that the vector/matrix idea is better, it was my first thought. If Stata goes ahead and add this functionality I agree that's a better idea if they return additional results, maybe under e() so it doesn't mess up old code.
        Alfonso Sanchez-Penalver

        Comment


        • #5
          Headline: It is not at all necessary to repeat summarize, detail yourself to see many results on skewness and kurtosis.

          Expanding further on my suggestions:

          1. moments (SSC) dates from 2004 and has an option to save results to a matrix. Next time I revisit the code I will add an option to save to a Stata dataset. Saving to variables is not difficult with the existing version and svmat

          2. I'd not trust skewness and kurtosis especially, for various reasons. They can reflect individual extreme values rather than the general shape of a distribution. They are limited by sample size (see http://www.stata-journal.com/sjpdf.h...iclenum=st0204) and they are typically dependent on each other. But with care they can help you spot problems, or rather features of your dataset.

          3. L-moments remain underused in many fields of statistical science. For a Stata implementation see lmoments (SSC).. For reference know that a Gaussian and any other symmetric distribution has L-skewness (t_3) of 0 (no surprise there) and that a Gaussian has L-kurtosis t_4 of 0.123 (3 d.p.). L-kurtosis is less ambiguously a measure of tail weight than is kurtosis.

          4. Always, always plot the data too. My multqplot (SJ) lets you look at about 12 variables at a time.

          Some examples here

          Code:
          . webuse nlswork , clear
          (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
          
          . lmoments , short
          
          -----------------------------------------------------------------------------
                             n = 13452 |        l_1         l_2         t_3         t_4
          -----------------------------+-----------------------------------------------
                                NLS ID |   2588.740     862.218      -0.003       0.001
                        interview year |     79.117       3.369       0.013      -0.015
                            birth year |     48.107       1.740      -0.033       0.029
                   age in current year |     30.203       3.677       0.049       0.051
             1=white, 2=black, 3=other |      1.296       0.214       0.454       0.025
          1 if married, spouse present |      0.626       0.234      -0.251      -0.171
                    1 if never married |      0.208       0.165       0.584       0.176
               current grade completed |     12.682       1.219       0.126       0.262
                 1 if college graduate |      0.189       0.153       0.623       0.234
                         1 if not SMSA |      0.284       0.203       0.432      -0.017
                     1 if central city |      0.342       0.225       0.317      -0.125
                            1 if south |      0.408       0.242       0.184      -0.208
                industry of employment |      7.842       1.683      -0.045      -0.102
                            occupation |      4.839       1.688       0.266       0.123
                            1 if union |      0.229       0.176       0.543       0.118
            weeks unemployed last year |      2.112       1.939       0.842       0.648
                 total work experience |      6.773       2.419       0.210       0.105
                  job tenure, in years |      3.448       1.862       0.377       0.184
                    usual hours worked |     36.199       4.717      -0.314       0.341
                weeks worked last year |     50.728      11.489      -0.043       0.246
                 ln(wage/GNP deflator) |      1.714       0.254       0.040       0.144
          -----------------------------------------------------------------------------
          
          . moments, matname(results)
          
          -----------------------------------------------------------------------------
                             n = 13452 |       mean          SD    skewness    kurtosis
          -----------------------------+-----------------------------------------------
                                NLS ID |   2588.740    1493.488      -0.010       1.803
                        interview year |     79.117       5.894       0.042       1.754
                            birth year |     48.107       3.033      -0.130       1.995
                   age in current year |     30.203       6.414       0.199       2.169
             1=white, 2=black, 3=other |      1.296       0.479       1.171       3.048
          1 if married, spouse present |      0.626       0.484      -0.520       1.270
                    1 if never married |      0.208       0.406       1.438       3.067
               current grade completed |     12.682       2.373       0.110       4.230
                 1 if college graduate |      0.189       0.391       1.591       3.531
                         1 if not SMSA |      0.284       0.451       0.958       1.918
                     1 if central city |      0.342       0.474       0.667       1.445
                            1 if south |      0.408       0.492       0.374       1.140
                industry of employment |      7.842       3.016      -0.083       1.498
                            occupation |      4.839       3.223       1.116       3.609
                            1 if union |      0.229       0.420       1.293       2.671
            weeks unemployed last year |      2.112       7.001       4.469      25.083
                 total work experience |      6.773       4.409       0.943       3.390
                  job tenure, in years |      3.448       3.770       1.875       6.784
                    usual hours worked |     36.199      10.034      -0.808       7.879
                weeks worked last year |     50.728      21.332      -0.185       3.381
                 ln(wage/GNP deflator) |      1.714       0.460       0.170       4.100
          -----------------------------------------------------------------------------
          
          gen which = ""
          replace which = word("`: rownames results'", _n)
          svmat results
          forval j = 1/5 {
              rename results`j' `: word `j' of `: colnames results''
          }
          
          . ds
          idcode    age       nev_mar   not_smsa  ind_code  wks_ue    hours     which     SD
          year      race      grade     c_city    occ_code  ttl_exp   wks_work  n         skewness
          birth_yr  msp       collgrad  south     union     tenure    ln_wage   mean      kurtosis
          Last edited by Nick Cox; 23 Sep 2015, 02:41.

          Comment


          • #6
            Hi Nick,

            glad to hear you will think of adding more functionality into moments. I was looking at it yesterday and realized that adding what I was looking for when it's byable becomes tricky if the user decides to use the by.

            I love it when one of my topics peaks your interest because I always end up with a list of papers to read. I've downloaded the Hoskin (1992) paper you reference in help lmoments and will definitely read yours on the size effects on skewness and kurtosis estimates.

            I had already downloaded multqplot some time ago. Let's see when I get some time to check it out. Thanks again for pitching in on this!!!
            Alfonso Sanchez-Penalver

            Comment


            • #7
              When using by: don't overlook the skew() and kurt() functions in egen, which have been available to users since 1999!

              Comment


              • #8
                Yes, I know. I meant that when using by: if you're going to store results in matrices, it just became much more complex, because now you would have a matrix per category. This was related to you revising moments.
                Alfonso Sanchez-Penalver

                Comment

                Working...
                X