Creating conditional mean

Arne VS

Join Date: May 2020

Posts: 11
#1

Creating conditional mean

06 May 2020, 08:19

Hi,

I hope someone can help me out with the following:
Given a data set like this:

HTML Code:

CompanyID Capex Post 1 100 0 1 200 1 2 300 0 2 200 1

where Post equals 0 or 1 if an observation is before or after a certain point in time, respectively.

How do I calculate mean (Capex) only of those companies where a company's mean (Capex) & Post == 1 is bigger than mean (Capex) & Post == 0?
Using the example data, I want mean (Capex) to be 150.

Thanks in advance.

Last edited by Arne VS; 06 May 2020, 08:56.
Tags: None
Lance Erickson

Join Date: Apr 2014

Posts: 144
#2

06 May 2020, 09:50

Does this give you what you want?

Code:

bysort CompanyID (Post): gen diff = Capex[2]-Capex[1] bysort CompanyID: egen mean = mean(Capex) if diff>0
Comment
Arne VS

Join Date: May 2020

Posts: 11
#3

06 May 2020, 10:14

Not quite, I will elaborate.
I need to check first if a company's mean(Capex) for the years where Post==1 is bigger than a company's mean(Capex) for the years where Post==0.
If this is true, then calculate mean(Capex) of all company observations where the above rule is satisfied (so calculate mean for Capex values both Post==0 and Post==1 cases).

HTML Code:

CompanyID Year Capex Post 1 2012 100 0 1 2013 120 0 1 2014 200 1 1 2015 300 1 2 2012 200 0 2 2013 140 0 2 2014 100 1 2 2015 200 1

Using company 1 from the above example: calculate mean capex for company 1 where Post==0: 220/2=110, calculate mean capex for company 1 where Post==1: 500/2=250.
Since mean where Post==1 is bigger than mean where Post==0 for company 1, calculate mean capex for every value of capex for that company: 720/4=180.
Iterate this for all companies and display mean(Capex) of all companies that satisfied the constraint mentioned. I am only interested in the output of the end mean, not the mean per company.
Comment

Lance Erickson

Join Date: Apr 2014
Posts: 144

06 May 2020, 10:34

Arne,

I changed the data for company 2 so that its Post=1 Capex mean would be lower than its Post=0 Capex mean. If I understand what you're asking, this should do it.

Code:

clear
input CompanyID Year Capex Post
1         2012  100   0
1         2013  120   0
1         2014  200   1
1         2015  300   1
2         2012  200   0
2         2013  140   0
2         2014  100   1
2         2015  140   1
end

preserve
    collapse Capex, by(CompanyID Post)
    bysort CompanyID: gen diff = Capex[2]-Capex[1]
    levelsof CompanyID if diff>0, local(comp) separate(,)
restore

bysort CompanyID: egen comp_mean = mean(Capex) if inlist(CompanyID,`comp')
egen mean = mean(Capex)

The variable mean is a constant that is the mean of companies that have higher Post=1 than Post=0 means.

Lance

Comment

Arne VS

Join Date: May 2020

Posts: 11
#5

06 May 2020, 10:49

The new variable mean does not seem to show the correct value. Using the provided data, the mean should be 180. 162,5 is the mean which includes values of company 2, which does not satisfy the constraint.
Also, when I try running the code on my full dataset, I get the following error message when running:

HTML Code:

. bysort CompanyID: egen comp_mean = mean(Capex) if inlist(CompanyID,`comp') expression too long r(130);

Last edited by Arne VS; 06 May 2020, 10:52.
Comment

Lance Erickson

Join Date: Apr 2014
Posts: 144

06 May 2020, 12:10

I can't reproduce the error you are getting. Are you running the full set of code at the same time? The local macro comp needs to still be in memory for the besort command after the restore.

The comp_mean variable contains the value of 180. Do you want that value to be a constant for all observations? This would do that.

Code:

clear
input CompanyID Year Capex Post
1         2012  100   0
1         2013  120   0
1         2014  200   1
1         2015  300   1
2         2012  200   0
2         2013  140   0
2         2014  100   1
2         2015  140   1
end

preserve
    collapse Capex, by(CompanyID Post)
    bysort CompanyID: gen diff = Capex[2]-Capex[1]
    levelsof CompanyID if diff>0, local(comp) separate(,)
restore

bysort CompanyID: egen comp_mean = mean(Capex) if inlist(CompanyID,`comp')
gen mean = comp_mean[1]

Of course, that assumes that the first observation satisfied your criteria.

Comment

Arne VS

Join Date: May 2020

Posts: 11
#7

06 May 2020, 12:51

Originally posted by Lance Erickson View Post

Arne,

I changed the data for company 2 so that its Post=1 Capex mean would be lower than its Post=0 Capex mean. If I understand what you're asking, this should do it.

Code:

clear input CompanyID Year Capex Post 1 2012 100 0 1 2013 120 0 1 2014 200 1 1 2015 300 1 2 2012 200 0 2 2013 140 0 2 2014 100 1 2 2015 140 1 end preserve collapse Capex, by(CompanyID Post) bysort CompanyID: gen diff = Capex[2]-Capex[1] levelsof CompanyID if diff>0, local(comp) separate(,) restore bysort CompanyID: egen comp_mean = mean(Capex) if inlist(CompanyID,`comp') egen mean = mean(Capex)

The variable mean is a constant that is the mean of companies that have higher Post=1 than Post=0 means.

Lance

Using this set of code, comp_mean is calculated for companies which meet the constraint. This is correct.
Then, I need to have a new mean calculated for all comp_mean.
Basically, mean(comp_mean). But of course, only one comp_mean for a given CompanyID should be used in calculating the new mean, since for some observations, I have only 1 Post==0 observation, while for others I have 2 or more.

HTML Code:

comp_mean mean 180 162.5 180 162.5 180 162.5 180 162.5 162.5 162.5 162.5 162.5

I would pick the comp_mean for the relevant CompanyID and calculate the new mean for those values of comp_mean. Example: comp_mean for CompanyID1 + comp_mean for CompanyIDx... divided by the amount of companies that have a value for comp_mean.
Comment

Lance Erickson

Join Date: Apr 2014
Posts: 144

06 May 2020, 13:08

ok...I think this gets you where you want now...assuming I haven't still misunderstood something. I added another company that satisfies the criterion to make is easier to see if the code does what is intended.

Code:

clear
input CompanyID Year Capex Post
1         2012  100   0
1         2013  120   0
1         2014  200   1
1         2015  300   1
2         2012  200   0
2         2013  140   0
2         2014  100   1
2         2015  140   1
3         2012  100   0
3         2013  120   0
3         2014  500   1
3         2015  800   1
end

preserve
    collapse Capex, by(CompanyID Post)
    bysort CompanyID: gen diff = Capex[2]-Capex[1]
    levelsof CompanyID if diff>0, local(comp) separate(,)
restore

bysort CompanyID: egen comp_mean = mean(Capex) if inlist(CompanyID,`comp')

preserve
    duplicates drop CompanyID, force
    egen mean = mean(comp_mean) if comp_mean!=.
    replace mean = mean[_n-1] if mean==.
    tempfile mean
    save "`mean'"
restore

merge m:1 CompanyID using "`mean'", keepusing(mean)

Although there might be a more efficient way to do it, it seems this should work fine as long as the first company satisfies the criterion.

Comment

Arne VS

Join Date: May 2020

Posts: 11
#9

06 May 2020, 13:13

I would like to thank you for all the help you have provided! This displays the mean I am interested in.
However, for my actual (large) dataset, stata returns an error because there are too many ID's in inlist. Is there any way to get around this?
Comment

Lance Erickson

Join Date: Apr 2014
Posts: 144

#10

06 May 2020, 13:53

The inlist function is limited to 250 arguments, and the name of the variable counts as one of the arguments. One idea would be to loop through sets of companies. I got this to work with the toy dataset and I *think* it will work with your larger dataset.

Code:

clear
input CompanyID Year Capex Post
1         2012  100   0
1         2013  120   0
1         2014  200   1
1         2015  300   1
2         2012  200   0
2         2013  140   0
2         2014  100   1
2         2015  140   1
3         2012  100   0
3         2013  120   0
3         2014  500   1
3         2015  800   1
end

preserve
    collapse Capex, by(CompanyID Post)
    bysort CompanyID: gen diff = Capex[2]-Capex[1]
    local l = 0
    forvalues i = 1/`=ceil(_N/249)' {
        local f = 1+`l'
        if "`i'"=="1" local l = `f'+`=_N-1'
        if "`i'"!="1" local l = `f'+248
        levelsof CompanyID if diff>0 in `f'/`l', local(comp`i') separate(,)
        if "`i'"=="1" local inlist inlist(CompanyID,`comp`i'')
        if "`i'"!="1" local inlist `inlist' | inlist(CompanyID,`comp`i'')
    }
restore

bysort CompanyID: egen comp_mean = mean(Capex) if `inlist'

preserve
    duplicates drop CompanyID, force
    egen mean = mean(comp_mean) if comp_mean!=.
    replace mean = mean[_n-1] if mean==.
    tempfile mean
    save "`mean'"
restore

merge m:1 CompanyID using "`mean'", keepusing(mean)

I have to imagine there is a more efficient way to do this...perhaps someone else will see the thread and have a better solution.

Comment

Arne VS

Join Date: May 2020

Posts: 11
#11

06 May 2020, 14:05

When running this I get the error: "observation numbers out of range r(198)" after the loop.
After the line:

HTML Code:

bysort CompanyID: egen comp_mean = mean(Capex) if `inlist'

I get the same error
"expression too long r(130);"
Comment

Lance Erickson

Join Date: Apr 2014
Posts: 144

#12

06 May 2020, 14:53

I found the problem causing the out of range error. The code below fixes it. To figure it out, I had to create a dataset that was larger. The first part of the code does that.

This approach runs into an internal limit if there are too many companies. The code will work for 550 companies. I didn't fine tune the limit but 600 companies trips the limit. If you have more companies than that, one idea is to break the data into smaller datasets, run the code and then append them back together again. Again, though, I imagine there is a more efficient approach to do this that others might identify for you.

Code:

clear 
set obs 550
gen CompanyID = _n
expand 4
egen Year = seq(), from(2012) to(2015)
gen Capex = round(uniform()*999)
gen Post=Year>=2014

preserve
    collapse Capex, by(CompanyID Post)
    bysort CompanyID: gen diff = Capex[2]-Capex[1]
    local l = 0
    forvalues i = 1/`=ceil(_N/249)' {
        local f = 1+`l'
        local l = `f'+248
        if "`i'"=="`=ceil(_N/249)'" local l = _N
        levelsof CompanyID if diff>0 in `f'/`l', local(comp`i') separate(,)
        if "`i'"=="1" local inlist inlist(CompanyID,`comp`i'')
        if "`i'"!="1" local inlist `inlist' | inlist(CompanyID,`comp`i'')
    }
restore

bysort CompanyID: egen comp_mean = mean(Capex) if `inlist'

preserve
    duplicates drop CompanyID, force
    egen mean = mean(comp_mean) if comp_mean!=.
    replace mean = mean[_n-1] if mean==.
    tempfile mean
    save "`mean'"
restore

merge m:1 CompanyID using "`mean'", keepusing(mean)

Comment

Arne VS

Join Date: May 2020

Posts: 11
#13

06 May 2020, 15:37

Thank you for the help!
My dataset consists of around 3500 companies, so I will try breaking my dataset into several smaller sets and try running the code.
Comment

Announcement