Creating a variable containing average of other variable

Thibault Jolie

Join Date: Apr 2014

Posts: 2
#1

Creating a variable containing average of other variable

03 Apr 2014, 08:45

Dear all,

I have a question about something that seems relatively easy to do, but I can't figure out for some reason.

For my research I need to generate a variable which should equal: var 1 / mean(var 2). However, the mean(var2) should only consist of those values which belong to a specific group. Both variables have been sorted into 9 categories. So for example, in the following table:

Var1 Cat1 Var 2 Cat 2

X 3 Y1 2

Y2 3

Y3 3

Y4 1

I would ideally divide X / mean (Y2 & Y3).

Is there anyone who know an efficient fix for this issue?
Thank you for your consideration!

Best regards,
Thibault
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#2

03 Apr 2014, 09:38

I think you want something like this:

levelsof Cat1, local(cats)
gen conditional_mean = .
foreach c of local cats {
egen t = total(Var2/(Cat2 == `c')) if Cat1 == `c' // ADDS UP ALL VALUES OF Var2 WHERE Cat2 == `c'
egen c = total((Cat2 == `c') & !missing(Var2)) if Cat1 == `c' // COUNTS CASES WHERE Cat2 == `c' AND Var2 HAS A NON-MISSING VALUE
replace conditional_mean = t/c if Cat1 == `c'
drop t c
}
gen new_variable = Var1/conditional_mean

The key thing that makes this work is that the expression Var2/Cat2==`c' evaluates to Var2 in those observations where Cat2 = `c', but to missing in other cases.

There may be a shorter way to do this, but this code is relatively self-explanatory, and if you come back to look at it in 6 months you'll probably still be able to figure out what it does.

Hope this helps.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35208
#3

03 Apr 2014, 10:25

There is a review of technique in this territory in

SJ-11-2 dm0055 . . . . . . . . . . . . . . Speaking Stata: Compared with ...
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q2/11 SJ 11(2):305--314 (no commands)
reviews techniques for relating values to values in other
observations

Abstract: Many problems in data management center on relating values to values in other observations, either within a dataset as a whole or within groups such as panels. This column reviews some basic Stata techniques helpful for such tasks, including the use of subscripts, summarize, by:, sum(), cond(), and egen. Several techniques exploit the fact that logical expressions yield 1 when true and 0 when false. Dividing by zero to yield missings is revealed as a surprisingly valuable device.

Clyde's code makes use of one of my favourite tricks, but it will, I think, yield results only when Cat1 and Cat2 are identical. I don't think that is what is wanted here. If that guess is right, try omitting the first two if qualifiers within the loop.

Last edited by Nick Cox; 03 Apr 2014, 10:34.
Comment
Jorge Eduardo Perez Perez

Join Date: Mar 2014

Posts: 429
#4

03 Apr 2014, 11:11

Out of curiosity, what about

Code:

preserve keep var2 cat2 collapse (mean) cmeanvar2=var2 , by(cat2) ren cat2 cat1 tempfile mean save `mean' restore merge n:1 cat1 using `mean' gen new_var = var1/cmeanvar2

This code obviously has the disadvantage of the need to store a temporary file, but other than that, what are the advantages of the previous approach over this merge approach?

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35208
#5

03 Apr 2014, 11:46

Assuming that the code is equivalent, the advantages and disadvantages start with what is easy and difficult for anyone to think about.

Just as the Swedes supposedly mix two gene pools (risk-willing Vikings and risk-averse Volvos, so to speak), so also many Stata users seek one of two poles, within file solutions based on by: and between file solutions based on merge. by-pilots and merge-mavens, in one terminology.
Comment
Thibault Jolie

Join Date: Apr 2014

Posts: 2
#6

05 Apr 2014, 15:09

Thank you for all of your responses.
However, my problem has actually gotten a little more complicated than what I explained, and so I am now once again stuck.

I now have 3 different conditions, all 3 of which should be combined in order to compare the two variables. These 3 conditions do, however, contain the same responses for both variables.
An example of an observation could be: company X in category 1 belongs to industry A & class A1 in year 2010. This observation should then be matched with the mean of all the companies in category 2 also belonging to industry A & class A1 in year 2010.

To solve this issue, I thought that I could potentially split up the problem into two different loops. In the first loop, I could create a variable that displays the mean for all variables in category 2 per combination of the 3 conditions. Then, in a next loop I was thinking of combining this information for each observation in category 1.

The first loop that I was writing was based on Clyde Schechter and looked like this:

gen conditional_mean =

forval c = 1/9 {
forval i = 1/6 {
forval j = 2009/2012 {
local condition `"cond1== `c' & cond2== `i' & cond3== `j'"'
egen t = total(cat2/`condition')
egen c = total(`condition')
replace conditional_mean = t/c if `condition'
}
}
}

However, the part that I have put in bold is not working.
Is there anyone else who has a suggetion of something I could try? Would there be an efficient solution that combines both of my steps into one loop?

Thank you for your consideration. If there is anything else I need to clarify, please let me know.

Best regards,
Thibault
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#7

06 Apr 2014, 11:38

Thibault,

Macro evaluation is just string substitution. So when Stata sees

local condition `"cond1== `c' & cond2== `i' & cond3== `j'"'
egen t = total(cat2/`condition')

it transforms that into:

egen t = total(cat2/cond1== `c' & cond2== `i' & cond3== `j')

which is clearly not what you want. You need parentheses around your condition for it to work in the -egen t = - statement. I would actually accomplish that by building it right into the definition of local condition, thus:

local condition `"(cond1== `c' & cond2== `i' & cond3== `j')"'

(While you could just put parentheses into the -egen t- statement around `condition', if you move on to something more complicated where they're needed again, you might forget. Putting them right into the definition assures they will never be forgotten--and their presence can never hurt even when they aren't needed, as in the -egen c- statement.)

Hope this helps.
Comment

Var1	Cat1	Var 2	Cat 2
X	3	Y1	2
		Y2	3
		Y3	3
		Y4	1

Announcement

Creating a variable containing average of other variable

Comment

Comment

Comment

Comment

Comment

Comment