weights in tabstat and table results wildly differ

Ariel Karlinsky

Join Date: Jun 2015
Posts: 491

weights in tabstat and table results wildly differ

24 Jan 2018, 03:00

I noticed that when calculating weighted sums, tabstat and table wildly differ. Code to replicate:

Code:

clear all
sysuse auto

tabstat mpg [aw=weight], s(sum) by(rep78)
table rep78 [aw=weight], c(sum mpg) row

And the results which are wildly differ (even the ratio in each level to the total):

Code:

. tabstat mpg [aw=weight], s(sum) by(rep78)

Summary for variables: mpg
by categories of: rep78 (Repair Record    1978)

rep78        sum

1     127980
2     501920
3    1850920
4    1049930
5     668530

Total    4199280


. table rep78 [aw=weight], c(sum mpg) row


Repair    
Record    
1978         sum(mpg)

1    41.28387
2    149.6593
3    561.0549
4    365.8293
5    287.8211

Total    1384.974

Any idea what's going on here?

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

24 Jan 2018, 08:17

This is not a bug in the technical sense,* but something that StataCorp should probably change.

If you check out -help table- you will see that it does not support aweights: only fweights, iweights, and pweights. (This also makes sense since -table- calls -collapse-, which also does not support aweights. Anyway, it would be better if -table- gave an error message to that effect and no misleading output.

-tabstat-, by contrast, does support aweights.

*The definition of bug that I use is that the program produces incorrect output when given valid input. It is poor design for a program to produce apparently valid output when given invalid input, but design deficiencies are not the same thing as bugs.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3850

24 Jan 2018, 09:26

It is true that the documentation for table does not list aweights as a supported weight type. However, collapse does allow aweights (they are actually the default) and this is documented. The manual also explains in detail how aweights affect the sum statistic. Here is a clumsy piece of code to demonstrate

Code:

sysuse auto , clear

// replicate by hand
matrix table = J(5, 1, .)
generate aw  = .
generate sum = .
forvalues j = 1/5 {
    preserve
    keep if rep78 == `j'
    summarize weight , meanonly
    replace aw = r(N)*weight/r(sum)
    replace sum = sum(aw*mpg)
    matrix table[`j', 1] = sum[_N]
    restore
}

table rep78 [aw=weight], c(sum mpg) row
matlist table
collapse (sum) mpg [weight=weight] , by(rep78)
list

The relevant output

Code:

. table rep78 [aw=weight], c(sum mpg) row

----------------------
Repair    |
Record    |
1978      |   sum(mpg)
----------+-----------
        1 |   41.28387
        2 |   149.6593
        3 |   561.0549
        4 |   365.8293
        5 |   287.8211
          | 
    Total |   1384.974
----------------------

. matlist table

             |        c1 
-------------+-----------
          r1 |  41.28387 
          r2 |  149.6593 
          r3 |  561.0549 
          r4 |  365.8293 
          r5 |  287.8211 

. collapse (sum) mpg [weight=weight] , by(rep78)
(analytic weights assumed)

. list

     +-----------------+
     | rep78       mpg |
     |-----------------|
  1. |     1   41.2839 |
  2. |     2   149.659 |
  3. |     3   561.055 |
  4. |     4   365.829 |
  5. |     5   287.821 |
     |-----------------|
  6. |     .   103.457 |
     +-----------------+

. 
end of do-file

Best
Daniel

Last edited by daniel klein; 24 Jan 2018, 09:28. Reason: added collapse command

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

24 Jan 2018, 10:07

I became a bit confused, because neither Clyde's nor Daniel's answer addressed the much larger values reported by tabstat. So I looked at help weights and it tells me

aweights, or analytic weights, are weights that are inversely proportional to the variance of an observation; that is, the variance of the jth observation is assumed to be sigma^2/w_j, where w_j are the weights. Typically, the observations represent averages and the weights are the number of elements that gave rise to the average. For most Stata commands, the recorded scale of aweights is irrelevant; Stata internally rescales them to sum to N, the number of observations in your data, when it uses them.

It does not seem to me that tabstat is doing the indicated rescaling. So we seem to have for the three commands:
collapse supports aweights, documents that fact, rescales aweights to calculate sums, and documents that it does the rescaling.

tabstat supports aweights, documents that fact, does not rescale aweights to calculate sums, and does not document how it handles aweights.

table supports aweights, does not document that fact, and rescales aweights to calculate sums.

Essentially, tabstat treats aweights as fweights in this case.

Code:

. tabstat mpg [fw=weight], s(sum) by(rep78) Summary for variables: mpg by categories of: rep78 (Repair Record 1978) rep78 | sum ---------+---------- 1 | 127980 2 | 501920 3 | 1850920 4 | 1049930 5 | 668530 ---------+---------- Total | 4199280 --------------------

Last edited by William Lisowski; 24 Jan 2018, 10:11.
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#5

24 Jan 2018, 11:00

Interesting. It seems that tabstat calls summarize and the latter does rescale weights, but the rescaling does not affect the sum. Probably arguments can be made that the sum should not be affected by rescaling; probably arguments can be made that it should. However, it seems indisputable that the behavior here is inconsistent and poorly documented. Something should be done.

Best
Daniel
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#6

25 Jan 2018, 00:48

From my (limited) experience working with weights (in surveys, etc.) the sum (or total) estimator should of course be affected by weights.
Assume for example that we have sampled two individuals, each representing 1,000 individuals. Ind 1's income is 1000, Ind 2's income 5000.
The total estimator should be 1000*1000 + 1000 * 5000.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#7

25 Jan 2018, 03:11

The question is not whether the sum should be affected by weights; it should and it is (pretty much in the way suggested in Ariel's income example). The question is whether the sum should be affected by the scale of the weights.

Whether the sum reported by the descriptive command summarize is supposed to be an estimator of the total is yet another, though probably related, question.

Best
Daniel

Last edited by daniel klein; 25 Jan 2018, 03:13.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

25 Jan 2018, 06:08

How the sum should be affected by weights depends on the type of weight. Note the definition of aweights quoted in post #4. Those weights need not tell us anything about the number of values, only about the precision of each value. Of course the typical use case cited tells us that the weights are the number of values of which an average is comprised, but this need not be the case. That would be the justification for rescaling the aweights to sum to the number of observations. When the weights do represent the number of values averaged, then it seems to me sum (and count) should be calculated by treating the weights as fweights.

Compared to other packages I have worked with, Stata has a particularly subtle grasp of the different uses to which weights can be put, and a good ability to easily accommodate those different uses, and I find continued reference to the output of help weights to be useful in refreshing my understanding.

Last edited by William Lisowski; 25 Jan 2018, 06:29.
Comment

Announcement

weights in tabstat and table results wildly differ

Comment

Comment

Comment

Comment

Comment

Comment

Comment