Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • weights in tabstat and table results wildly differ

    I noticed that when calculating weighted sums, tabstat and table wildly differ. Code to replicate:
    Code:
    clear all
    sysuse auto
    
    tabstat mpg [aw=weight], s(sum) by(rep78)
    table rep78 [aw=weight], c(sum mpg) row
    And the results which are wildly differ (even the ratio in each level to the total):
    Code:
    . tabstat mpg [aw=weight], s(sum) by(rep78)
    
    Summary for variables: mpg
    by categories of: rep78 (Repair Record    1978)
    
    rep78        sum
    
    1     127980
    2     501920
    3    1850920
    4    1049930
    5     668530
    
    Total    4199280
    
    
    . table rep78 [aw=weight], c(sum mpg) row
    
    
    Repair    
    Record    
    1978         sum(mpg)
    
    1    41.28387
    2    149.6593
    3    561.0549
    4    365.8293
    5    287.8211
    
    Total    1384.974

    Any idea what's going on here?

  • #2
    This is not a bug in the technical sense,* but something that StataCorp should probably change.

    If you check out -help table- you will see that it does not support aweights: only fweights, iweights, and pweights. (This also makes sense since -table- calls -collapse-, which also does not support aweights. Anyway, it would be better if -table- gave an error message to that effect and no misleading output.

    -tabstat-, by contrast, does support aweights.

    *The definition of bug that I use is that the program produces incorrect output when given valid input. It is poor design for a program to produce apparently valid output when given invalid input, but design deficiencies are not the same thing as bugs.

    Comment


    • #3
      It is true that the documentation for table does not list aweights as a supported weight type. However, collapse does allow aweights (they are actually the default) and this is documented. The manual also explains in detail how aweights affect the sum statistic. Here is a clumsy piece of code to demonstrate

      Code:
      sysuse auto , clear
      
      // replicate by hand
      matrix table = J(5, 1, .)
      generate aw  = .
      generate sum = .
      forvalues j = 1/5 {
          preserve
          keep if rep78 == `j'
          summarize weight , meanonly
          replace aw = r(N)*weight/r(sum)
          replace sum = sum(aw*mpg)
          matrix table[`j', 1] = sum[_N]
          restore
      }
      
      table rep78 [aw=weight], c(sum mpg) row
      matlist table
      collapse (sum) mpg [weight=weight] , by(rep78)
      list
      The relevant output

      Code:
      . table rep78 [aw=weight], c(sum mpg) row
      
      ----------------------
      Repair    |
      Record    |
      1978      |   sum(mpg)
      ----------+-----------
              1 |   41.28387
              2 |   149.6593
              3 |   561.0549
              4 |   365.8293
              5 |   287.8211
                | 
          Total |   1384.974
      ----------------------
      
      . matlist table
      
                   |        c1 
      -------------+-----------
                r1 |  41.28387 
                r2 |  149.6593 
                r3 |  561.0549 
                r4 |  365.8293 
                r5 |  287.8211 
      
      . collapse (sum) mpg [weight=weight] , by(rep78)
      (analytic weights assumed)
      
      . list
      
           +-----------------+
           | rep78       mpg |
           |-----------------|
        1. |     1   41.2839 |
        2. |     2   149.659 |
        3. |     3   561.055 |
        4. |     4   365.829 |
        5. |     5   287.821 |
           |-----------------|
        6. |     .   103.457 |
           +-----------------+
      
      . 
      end of do-file
      Best
      Daniel
      Last edited by daniel klein; 24 Jan 2018, 09:28. Reason: added collapse command

      Comment


      • #4
        I became a bit confused, because neither Clyde's nor Daniel's answer addressed the much larger values reported by tabstat. So I looked at help weights and it tells me

        aweights, or analytic weights, are weights that are inversely proportional to the variance of an observation; that is, the variance of the jth observation is assumed to be sigma^2/w_j, where w_j are the weights. Typically, the observations represent averages and the weights are the number of elements that gave rise to the average. For most Stata commands, the recorded scale of aweights is irrelevant; Stata internally rescales them to sum to N, the number of observations in your data, when it uses them.
        It does not seem to me that tabstat is doing the indicated rescaling. So we seem to have for the three commands:
        1. collapse supports aweights, documents that fact, rescales aweights to calculate sums, and documents that it does the rescaling.
        2. tabstat supports aweights, documents that fact, does not rescale aweights to calculate sums, and does not document how it handles aweights.
        3. table supports aweights, does not document that fact, and rescales aweights to calculate sums.
        Essentially, tabstat treats aweights as fweights in this case.
        Code:
        . tabstat mpg [fw=weight], s(sum) by(rep78)
        
        Summary for variables: mpg
             by categories of: rep78 (Repair Record 1978)
        
           rep78 |       sum
        ---------+----------
               1 |    127980
               2 |    501920
               3 |   1850920
               4 |   1049930
               5 |    668530
        ---------+----------
           Total |   4199280
        --------------------
        Last edited by William Lisowski; 24 Jan 2018, 10:11.

        Comment


        • #5
          Interesting. It seems that tabstat calls summarize and the latter does rescale weights, but the rescaling does not affect the sum. Probably arguments can be made that the sum should not be affected by rescaling; probably arguments can be made that it should. However, it seems indisputable that the behavior here is inconsistent and poorly documented. Something should be done.

          Best
          Daniel

          Comment


          • #6
            From my (limited) experience working with weights (in surveys, etc.) the sum (or total) estimator should of course be affected by weights.
            Assume for example that we have sampled two individuals, each representing 1,000 individuals. Ind 1's income is 1000, Ind 2's income 5000.
            The total estimator should be 1000*1000 + 1000 * 5000.

            Comment


            • #7
              The question is not whether the sum should be affected by weights; it should and it is (pretty much in the way suggested in Ariel's income example). The question is whether the sum should be affected by the scale of the weights.

              Whether the sum reported by the descriptive command summarize is supposed to be an estimator of the total is yet another, though probably related, question.
              ​​​​​​
              Best
              Daniel
              Last edited by daniel klein; 25 Jan 2018, 03:13.

              Comment


              • #8
                How the sum should be affected by weights depends on the type of weight. Note the definition of aweights quoted in post #4. Those weights need not tell us anything about the number of values, only about the precision of each value. Of course the typical use case cited tells us that the weights are the number of values of which an average is comprised, but this need not be the case. That would be the justification for rescaling the aweights to sum to the number of observations. When the weights do represent the number of values averaged, then it seems to me sum (and count) should be calculated by treating the weights as fweights.

                Compared to other packages I have worked with, Stata has a particularly subtle grasp of the different uses to which weights can be put, and a good ability to easily accommodate those different uses, and I find continued reference to the output of help weights to be useful in refreshing my understanding.
                Last edited by William Lisowski; 25 Jan 2018, 06:29.

                Comment

                Working...
                X