Creating Sums by group, and calculating averages leaving one observation out

Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#1

Creating Sums by group, and calculating averages leaving one observation out

12 Nov 2018, 20:34

Hi All,

I have data that resembles the following:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(Individual x Year) 1 32 1990 2 3 1990 3 23 1990 4 3 1990 1 2 1991 2 23 1991 3 2 1991 4 1 1991 end

In the above, I have data by individuals and year. What I wish to do is to i) Calculate the average of the variable x leaving the contribution of Individual i, for each individual, and year, and ii) Calculating the weight of the other observations in the total sum of the variable x, leaving out variable i's contribution.

Part i) is relatively simple.

Code:

by year, egen Total=sum(x) by year Individual, sort: replace Total=Total-x

However, ii) is more convoluted. This is because each individual will be assigned a different weight in total sum, for every other individual, as each individual has her own specific value of Total, disregarding her contribution. For instance, for 1990, individual 1 will have weight of total output that is different with respect to total output disregarding 2, 3 and 4. Similarly, for 2 and so on. Would the joinby command be appropriate? Any guidance is much appreciated.

Best,
CS
Tags: None

David Benson

Join Date: Oct 2018
Posts: 489

12 Nov 2018, 23:17

Hi Chinmay,

I don't know if this is what you meant. I added one year of fake data to yours (and one more individual), to make sure that the code can handle different number of individuals per year.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(individual x year)
1 32 1990
2  3 1990
3 23 1990
4  3 1990
1  2 1991
2 23 1991
3  2 1991
4  1 1991
1 23 1992
2  3 1992
3  5 1992
4  2 1992
5  2 1992
end

Code:

bysort year (individ): gen year_count = _N
label var year_count "Number of individuals that year"
bysort year: gen  avg_with    = total / year_count
bysort year: egen avg_with2   = mean(x)  // just as a double-check. avg_with = avg_with2 (drop 1 of them)
bysort year: gen  avg_without = (total - x) / (year_count - 1)  // calc avg that year *excluding* that individual

label var avg_with "Avg of X (that year), including focal individual"
label var avg_with2 "Same as avg_with, just calc using egen"
label var avg_without "Avg of X (that year) excluding focal individual"

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35438

13 Nov 2018, 00:24

Most of this is covered by a FAQ. https://www.stata.com/support/faqs/d...ng-properties/ Our own Statalist FAQ Advice includes a reminder to look at the FAQs.

The code so far will cover the benign case with no missing values.

Correcting an error in #1 and then building on david Benson's helpful code, the case with possibly some missing values is in essence

Code:

* egen, sum() undocumented since Stata 9  
bysort year: egen total = total(x)
by year : egen year_count = count(x)  
gen  avg_without = (total - cond(missing(x), 0, x)) / (year_count - !missing(x))

although there is a more direct solution using rangestat (SSC)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(individual x year)
1 32 1990
2  3 1990
3 23 1990
4  3 1990
1  2 1991
2 23 1991
3  2 1991
4  1 1991
1 23 1992
2  3 1992
3  5 1992
4  2 1992
5  2 1992
end

rangestat (mean) x, int(year 0 0) excludeself
list, sepby(year)
     +----------------------------------+
     | indivi~l    x   year      x_mean |
     |----------------------------------|
  1. |        1   32   1990   9.6666667 |
  2. |        2    3   1990   19.333333 |
  3. |        3   23   1990   12.666667 |
  4. |        4    3   1990   19.333333 |
     |----------------------------------|
  5. |        1    2   1991   8.6666667 |
  6. |        2   23   1991   1.6666667 |
  7. |        3    2   1991   8.6666667 |
  8. |        4    1   1991           9 |
     |----------------------------------|
  9. |        1   23   1992           3 |
 10. |        2    3   1992           8 |
 11. |        3    5   1992         7.5 |
 12. |        4    2   1992        8.25 |
 13. |        5    2   1992        8.25 |
     +----------------------------------+

However, I am puzzled by

Calculating the weight of the other observations in the total sum of the variable x, leaving out individual i's contribution.

even with the correction in bold.

The proportions of the total of the others will differ according to which individual is left out, so you would seem to need as many extra variables as there are individuals. Is that right?

Last edited by Nick Cox; 13 Nov 2018, 00:26.

Comment

Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#4

13 Nov 2018, 07:03

Thanks David Benson and Nick Cox . This is super helpful. Yes you are right Nick. I guess the way it would look like is that for each individual-year cell, there would be N extra variables. This is because this individual will be given a weight corresponding to every other individual left out (including herself left out, this would be just zero). As such, there would be N^2 extra variables each year (where N could vary as well).
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35438
#5

13 Nov 2018, 07:34

What are you going to do with these extra variables?
Comment
Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#6

13 Nov 2018, 09:11

Thanks for your response, Nick. The entire procedure is more complicated than that, but I did not wish to inundate Statalist users with the procedure in its entirety. I am performing a variance decomposition by the variable x. Now, for each individual, the variance of x can be expressed in terms of covariances with every other individual (in addition to some other terms, which are not the focus), However, these are not simple covariances. For each individual i, her variance equals the weighted average of the covariances with all other individuals, where the weights correspond to the fraction that each of the other individuals' have in the sum of x, disregarding individual i,. And, I wish to do this for every individual.

As such, my query was first to figure out how the weights would be calculated, and worry about covariances later. I am not sure if there is a procedure that obviates the need for extra variables.

Many thanks,
CS
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35438
#7

13 Nov 2018, 09:20

I don't recognise the recipe here, but that means nothing. I can be as ignorant as anyone else away from home. I imagine it has a name in economics or some other social science. If it's standard, there may be a program.
Comment
Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#8

13 Nov 2018, 09:43

Thanks Nick Cox . Are there any FAQs or recommendations you can point towards for generating the weights as above?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35438
#9

13 Nov 2018, 12:07

I just reacted to it as an ugly-looking problem and backed away, so I have no useful thoughts for you.
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#10

13 Nov 2018, 15:39

I don't do variance decomposition, so this is far from my expertise, however, if it is common, you ought to be able to find something for it.

When you began, I thought you were going to do something like a DFBETA (DFBETA after a regression calculates the difference between the regression coefficient calculated for all of the data and the regression coefficient calculated with the particular observation deleted). I thought you would then do something like, "how much does the average change when I omit a particular observation."

In the old Statalist, someone asked the question (see https://www.stata.com/statalist/arch.../msg01138.html)

I would like to do a variance decomposition. Particularly I want to express total variation as a sum of between-group and within-group variation. I was wondering if there are any stata modules to do that.

The answer this person got was:

If it is a continuous variable, you can use -xtreg-;

xtreg y , i(groupvar)

will report the within group and between group variances.

But that would lump the groups together over time, and you want this on a per-year basis.

There is a Stata Journal article "Estimating variance components in Stata" here that might be of use.

I hope some of this might be helpful in pointing you in the right direction.
Comment

Announcement