Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to differentiate "tabulate", "table", "tabstat", "tabdisp"?

    I don't know which one should be used in specific cases.

    Many thanks in advance!

  • #2
    So, it's probably easiest to see with some examples. BTW, I end up using tabulate and tabstat a lot, table less so, and I have to look up the syntax every time I do. I've never used tabdisp.

    So, tabulate is great for counting the number of observations in various categories:

    Code:
    . tabulate year_founded if sample==1 & inrange(year_founded, 1995, 2000)
    
           Year |
        founded |
        (min of |
       founding |
      year from |
        Gerard, |
    NCET, NETS, |
          & VX) |      Freq.     Percent        Cum.
    ------------+-----------------------------------
           1995 |         39        7.56        7.56
           1996 |         68       13.18       20.74
           1997 |         85       16.47       37.21
           1998 |         97       18.80       56.01
           1999 |         89       17.25       73.26
           2000 |        138       26.74      100.00
    ------------+-----------------------------------
          Total |        516      100.00
    
    
    . tabulate year_founded target_success if  sample==1 & inrange(year_founded, 1995, 2000)
    
          Year |
       founded |
       (min of |
      founding |
     year from |
       Gerard, |
         NCET, |  1 if target had IPO
       NETS, & |    or acquisition
           VX) |         0          1 |     Total
    -----------+----------------------+----------
          1995 |        14         25 |        39
          1996 |        40         28 |        68
          1997 |        56         29 |        85
          1998 |        59         38 |        97
          1999 |        64         25 |        89
          2000 |       105         33 |       138
    -----------+----------------------+----------
         Total |       338        178 |       516
    
    
    . tabulate target_status if  sample==1 & inrange(year_founded, 1995, 2000)
    
       IPO, Acquired, |
    etc. Zombie means |
             <=2 emps |      Freq.     Percent        Cum.
    ------------------+-----------------------------------
           1 - Zombie |         44        8.53        8.53
    2 - Going Concern |        230       44.57       53.10
         3 - Acquired |        113       21.90       75.00
              4 - IPO |         65       12.60       87.60
           5 - Failed |         64       12.40      100.00
    ------------------+-----------------------------------
                Total |        516      100.00
    tabstat is a lot like summarize, it just gives you more options over which stats to include. I use it a lot because I like to see the median.

    Code:
    * Doing it with summarize
    . summ max_emp cum_patents_age3 age_exit if located_in_cluster ==0 & sample==1 & inrange(year_founded, 1995, 2000)
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
         max_emp |        277     29.6787    84.97388          1       1060
    cum_patent~3 |        319    1.416928    4.083079          0         41
        age_exit |         98    9.877551    4.899538        -11         21
    
    . summ max_emp cum_patents_age3 age_exit if located_in_cluster ==1 & sample==1 & inrange(year_founded, 1995, 2000)
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
         max_emp |        185    114.6703    385.0243          1       3000
    cum_patent~3 |        197    4.137056    11.53241          0         76
        age_exit |        111    7.828829     4.31672          0         20
    
    
    * Doing the same thing with tabstat
    . tabstat max_emp cum_patents_age3 age_exit if  sample==1 & inrange(year_founded, 1995, 2000), stats(n mean p25 median p75 min max) col(stat)
    >  by(located_in_cluster)
    
    Summary for variables: max_emp cum_patents_age3 age_exit
         by categories of: located_in_cluster (1 if startup located in Silicon V, Boston, or San Diego (no matter where started)
    
    located_in_cluster |         N      mean       p25       p50       p75       min       max
    -------------------+----------------------------------------------------------------------
                     0 |       277   29.6787         4        10        25         1      1060
                       |       319  1.416928         0         0         1         0        41
                       |        98  9.877551         7        10        13       -11        21
    -------------------+----------------------------------------------------------------------
                     1 |       185  114.6703         6        21        65         1      3000
                       |       197  4.137056         0         0         2         0        76
                       |       111  7.828829         5         7        11         0        20
    -------------------+----------------------------------------------------------------------
                 Total |       462  63.71212         5        13        35         1      3000
                       |       516  2.455426         0         0         1         0        76
                       |       209  8.789474         5         8        12       -11        21
    ------------------------------------------------------------------------------------------
    Table is a little more flexible, but usually requires more typing
    Code:
    . table located_in_cluster if sample==1 & inrange(year_founded, 1995, 2000), c(n max_emp mean max_emp median max_emp p75 max_emp) row col
    
    ----------------------------------------------------------------------
    1 if      |
    startup   |
    located   |
    in        |
    Silicon   |
    V,        |
    Boston,   |
    or San    |
    Diego (no |
    matter    |
    where     |
    started   |    N(max_emp)  mean(max_emp)   med(max_emp)   p75(max_emp)
    ----------+-----------------------------------------------------------
            0 |           277        29.6787             10             25
            1 |           185       114.6703             21             65
              |
        Total |           462       63.71212             13             35
    ----------------------------------------------------------------------
    Hope that helps!

    Comment


    • #3
      I've never used tabdisp.
      I suspect that the vast majority of Stata users have never used it either. -tabdisp- is just a command that writes data from a long data set onto the Results window in the layout that -table- produces. -table- is, in fact, a wrapper for -tabdisp-: it produces an appropriate data set of statistics that is a suitable input for -tabdisp-, calls -tabdisp-, restores the original data, and then exits. The "produces an appropriate data set" part is done with -collapse-

      I have used -tabdisp- a handful of times in the 24 years I have been using Stata. The situation where it came in handy is when the data set is very large and a complicated table is needed: the calls to -collapse- that -table- uses can be time consuming. In this situation, you can gain appreciable efficiency by generating the appropriate statistics for input to -tabdisp- yourself using -gen- and -egen- and -keep if _n == 1- commands tailored specifically to your problem that run much faster than -collapse- (which is burdened with decoding an elaborate syntax and having to cope with all manner of special situations and problems that might arise in the general case), and then -tabdisp- writes it out in the desired way.

      In general, though, it's just simpler to use -table-, and the efficiency penalty for doing so is usually ignorable.

      Comment

      Working...
      X