Calculating average values by groups excluding own value

Markus Volt

Join Date: Mar 2017

Posts: 11
#1

Calculating average values by groups excluding own value

09 May 2018, 10:55

Dear all,

I am using Stata 14.

I would like to calculate the average (mean of the variable "decision") of a specific group (here it is defined as beeing in a specific market "market_id") per frim ("retailer"), thereby excluding the firm itself. My code looks as follows

clear
input str2 retailer byte(market_id retailer_id year decision)
"CO" 10 1 11 1
"CO" 10 1 12 1
"CO" 10 1 13 1
"DL" 10 2 11 1
"DL" 10 2 12 0
"DL" 10 2 13 0
"DL" 10 2 14 0
"DL" 10 2 15 1
"EV" 5 3 11 1
"EV" 5 3 12 0
"EV" 5 3 13 0
"NW" 5 4 11 0
"NW" 5 4 12 0
"NW" 5 4 13 0
"NW" 5 4 14 0
"GY" 10 5 11 0
"GY" 10 5 12 0
"GY" 10 4 13 1
end

sort market_id year
by market_id year: egen peer_decision = mean(decision)

This gives me the average of all firms in a given market per year. Hoiwever, I only want the average of the competitors thereby excluding firm i itself, when calculating the average by firm and group.

Example:

CO is on the same market as DL and GY. Thus I want the average of the decision of DL and GY (by year). Yet, by market_id year: egen peer_decision = mean(decision) inludes CO itself.

I could of course drop CO calculate the value and merge again. Yet, as I have many retailers this might soon become very messi.

I hope my question is clear.

Best
Markus
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35338

09 May 2018, 11:11

Thanks for the data example.

There is a Stata literature on such questions, hinging on the identity

mean of others = total of others / (#observations - 1)

= (sum of all MINUS this value) / (#observations -1)

with modifications for missing values.

But you could jump to rangestat (SSC), which offers the following kind of solution.

Code:

sort market_id year retailer

rangestat decision, interval(year 0 0) by(market_id) excludeself

list, sepby(market_id year)

     +-------------------------------------------------------------+
     | retailer   market~d   retail~d   year   decision   decis~an |
     |-------------------------------------------------------------|
  1. |       EV          5          3     11          1          0 |
  2. |       NW          5          4     11          0          1 |
     |-------------------------------------------------------------|
  3. |       EV          5          3     12          0          0 |
  4. |       NW          5          4     12          0          0 |
     |-------------------------------------------------------------|
  5. |       EV          5          3     13          0          0 |
  6. |       NW          5          4     13          0          0 |
     |-------------------------------------------------------------|
  7. |       NW          5          4     14          0          . |
     |-------------------------------------------------------------|
  8. |       CO         10          1     11          1         .5 |
  9. |       DL         10          2     11          1         .5 |
 10. |       GY         10          5     11          0          1 |
     |-------------------------------------------------------------|
 11. |       CO         10          1     12          1          0 |
 12. |       DL         10          2     12          0         .5 |
 13. |       GY         10          5     12          0         .5 |
     |-------------------------------------------------------------|
 14. |       CO         10          1     13          1         .5 |
 15. |       DL         10          2     13          0          1 |
 16. |       GY         10          4     13          1         .5 |
     |-------------------------------------------------------------|
 17. |       DL         10          2     14          0          . |
     |-------------------------------------------------------------|
 18. |       DL         10          2     15          1          . |
     +-------------------------------------------------------------+

Last edited by Nick Cox; 09 May 2018, 11:14.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17653
#3

09 May 2018, 11:14

Markus: are you looking for something along the following lines?

Code:

egen flag=group( retailer) . forvalues i = 1(1)4 { bysort market_id (year): egen peer_decision`i' = mean(decision) if flag!=`i' }

PS: in all likelihood superseded by Nick's faster and better reply.

Last edited by Carlo Lazzaro; 09 May 2018, 11:16.

Kind regards,
Carlo
(StataNow 18.5)
1 like
Comment
Markus Volt

Join Date: Mar 2017

Posts: 11
#4

10 May 2018, 10:52

Nick, thanks a lot. The code works perfect, thats what I was looking for.
Comment

Attaullah Shah

Join Date: Aug 2014
Posts: 1667

10 May 2018, 11:57

Here is one more approach to finding statistics excluding the focal observation.

Code:

ssc install asrol

bys market_id year: asrol decision, stat(mean) xf(focal)
     +-------------------------------------------------------------+
     | retailer   market~d   retail~d   year   decision   mean_d~n |
     |-------------------------------------------------------------|
  1. |       NW          5          4     11          0          1 |
  2. |       EV          5          3     11          1          0 |
  3. |       NW          5          4     12          0          0 |
  4. |       EV          5          3     12          0          0 |
  5. |       EV          5          3     13          0          0 |
     |-------------------------------------------------------------|
  6. |       NW          5          4     13          0          0 |
  7. |       NW          5          4     14          0          . |
  8. |       GY         10          5     11          0          1 |
  9. |       DL         10          2     11          1         .5 |
 10. |       CO         10          1     11          1         .5 |
     |-------------------------------------------------------------|
 11. |       DL         10          2     12          0         .5 |
 12. |       CO         10          1     12          1          0 |
 13. |       GY         10          5     12          0         .5 |
 14. |       CO         10          1     13          1         .5 |
 15. |       GY         10          4     13          1         .5 |
     |-------------------------------------------------------------|
 16. |       DL         10          2     13          0          1 |
 17. |       DL         10          2     14          0          . |
 18. |       DL         10          2     15          1          . |
     +-------------------------------------------------------------+

The xf is an abbreviation that I use for "excluding focal". asrol allows excluding focal observation with two flavors. The first one is to exclude only the current observation while the second one is to exclude all observation of the relevant variable if there are similar (duplicate) values of the rangevar elsewhere in the given window. An example will better explain the distinction between the two options. Consider the following data of 5 observations, where X is the variable of interest for which we would like to calculate arithmetic mean and year is the rangevar.

Code:

 Example A:

         asrol X, stat(mean) xf(focal) gen(xfocal)

    Example B:

        asrol X, stat(mean) xf(year) gen(xfyear)
        
          +---------------------------------------+
          | year     X        xfocal       xfyear |
          |---------------------------------------|
          | 2001   100          350           350 |
          | 2002   200          325           325 |
          | 2003   300          300     266.66667 |
          | 2003   400          275     266.66667 |
          | 2004   500          250           250 |
          +---------------------------------------+

Explanation :

In Example A, we invoke the option xf() as xf(focal). asrol generates a new variable xfocal that contains the mean values of the rest of the observations in the given window, excluding the focal observation. Therefore, in the year 2001, xfocal variable has a value of 350, that is the average of the values of X in the years 2002, 2003, 2003, 2004 i.e. (200+300+400+500)/4 = 350. Similarly, the second observation of the xfocal variable is 325, that is (100+300+400+500)/4 = 325. Similar calculations are made when required statistics are estimated in a rolling window.

Example B differs from Example A in the definition of the focal observation(s). In Example B, we invoke the option xf() as xf(year), where year is an existing numeric variable. With this option, the focal observation(s) is(are) defined as the current observation and other observations where the focal observation of the rangevar has duplicates. Our data set has two duplicate values in the rangevar, i.e., year 2003. Therefore, the mean values are calculated as shown below:

Code:

        +-------------------------------------------------------+       
        |       obs 1: (200 + 300 + 400 + 500)/4 = 350          |
        |       obs 2: (100 + 300 + 400 + 500)/4 = 325          |                       
        |       obs 3: (100 + 200 + 500 )     /3 = 266.66667    |               
        |       obs 4: (100 + 200 + 500 )     /3 = 266.66667    |       
        |       obs 5: (100 + 200 + 300 + 400)/4 = 250          |                       
        +-------------------------------------------------------+

Application of example B:
One application is to find statistics overs industries while excluding a given industry itself. An industry will have many duplicates in terms of years as firms in a given industry will share similar years.

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

Comment

Bilal Caliskan

Join Date: Jul 2020

Posts: 18
#6

10 Sep 2021, 14:45

Hello,
I have a dataset including repetitive household members (multiple time observed same household members) and households. I would like to create a variable which is average education of employed members in each households. As I mentioned, my household members were observed multiple times. Please, could you help me to find out exact stata code to finish this process. Also, I attached the codes which I used. But, this gives me mean of all employed observations in household without distinguishing multiple time observation for household members. Thanks in advance.

egen mean_worker_educ=mean(worker_educ) if employment==1 | employment==4, by(hhnum)
Comment

Announcement