Calculating standard deviation from the previous ten years

Diedrich Friedrich

Join Date: Mar 2015

Posts: 8
#1

Calculating standard deviation from the previous ten years

12 Apr 2015, 04:55

Hi all,

I have the following problem:

I need to create a variable Cashflow volatility which is defined as the standard deviation of cash flow-to-assets (CFTA) from the previous ten fiscal years (e.g. for 2011: 2001,2002,...,2010). The firm is required to have at least 3 observations. (As in my previous post we have fyear as variable for fiscal year and gvkey as firm identifier)

Unfortunately I have no idea how to include the point regarding the minimum observation requirement. I tried to build up an example beneath.

It would be great, if someone helped me.

Thank you very much.

input gvkey CFTA fyear 101 5 2001 101 6 2002 101 7 2003 101 -1 2004 101 3 2005 101 -1 2006 101 1 2007 101 4 2008 101 2 2009 101 3 2010 101 8 2011 102 5 2001 102 -3 2002 102 7 2003 102 -1 2004 102 3 2005 102 -1 2006 102 1 2007 103 7 2001 103 2 2002 104 1 2000 end
Tags: None
Diedrich Friedrich

Join Date: Mar 2015

Posts: 8
#2

14 Apr 2015, 10:36

That's how I would calculate the standard deviation. Is this the correct way?

xtset gvkey fyear
gen movingAvrg = (L1.CFTA+L2.CFTA+L3.CFTA+L4.CFTA+L5.CFTA+L6.CFTA+L 7.CFTA+L8.CFTA+L9.CFTA+L10.CFTA)/10
gen SqrDiff=(CFTA-movingAvrg)^2
gen sigma=sqrt((L1.SqrDiff+L2.SqrDiff+L3.SqrDiff+L4.Sq rDiff+L5.SqrDiff+L6.SqrDiff+L7.SqrDiff+L8.SqrDiff+ L9.SqrDiff+L10.SqrDiff)/10)

Attached Files

Standard Deviation.do (285 Bytes, 1 view)
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29951

14 Apr 2015, 11:19

I don't think it is. Your SqrDiff variable calculates the difference between the observed value of CFTA and a different moving average in each year. I would use a different approach to a rolling standard deviation, relying on the general identity sigma^2 = E(x^2) - (E(x))^2.

Code:

xtset gvkey fyear
by gvkey (fyear), sort: gen running_sum = sum(CFTA)
by gvkey (fyear): gen running_sum_sq = sum(CFTA^2)
by gvkey (fyear): gen running_n = sum(!missing(CFTA))
by gvkey (fyear): gen moving_avrg = (L1.running_sum - L11.running_sum)/(L1.running_n - L11.running_n)
by gvkey (fyear): gen moving_avg_sq = (L1.running_sum_sq - L11.running_sum_sq)/(L1.running_n - L11.running_n)
gen moving_variance = moving_avg_sq - moving_avrg^2
gen sigma = sqrt(moving_variance)

// AND WHEN YOU'RE SURE YOU DON'T NEED THESE ANY MORE FOR ANYTHING ELSE
drop running_sum running_sum_xsq running_n moving_avg_xsq moving_variance

Comment

Diedrich Friedrich

Join Date: Mar 2015

Posts: 8
#4

14 Apr 2015, 14:03

Thank you so much. Just one stupid question on your procedure:
Why do I need "by gvkey (fyear): gen running_n = sum(!missing(CFTA))"? What does this command do?

Furthermore, I would like to get values for sigma, as soon as there are >= 3 years observed for a firm. So if there are 3 years observed for a firm, sigma should reveal the volatility of the previous two years. If there are observed 4 years, sigma should reveal the volatility of the previous 3 years and so on. As soon as 11 years are observed, sigma should reveal the volatility of the previous 10 years. For any further year observed, sigma should reveal the volatility of the previous 10 years as well.

Any idea how to get this?

Last edited by Diedrich Friedrich; 14 Apr 2015, 14:25.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29951

14 Apr 2015, 17:59

The command with sum(!missing(CFTA)) calculates a running total of the number of non-missing values of CFTA observations within gvkey. (Try running it and then look and see.) You need that to get the denominators for the means.

The problem is a bit more complicated as you describe it in the second paragraph of #4. To simplify the code needed to deal with this, I'm going to start by filling in any gaps in your panel.

Code:

xtset gvkey fyear
gen byte original = 1
tsfill // SUPPLY EMPTY OBSERVATIONS TO FILL TIME GAPS

// CALCULATE RUNNING TOTALS 7& counts
by gvkey (fyear), sort: gen running_sum = sum(CFTA)
by gvkey (fyear): gen running_sum_sq = sum(CFTA^2)
by gvkey (fyear): gen running_n = sum(!missing(CFTA))

// CALCULATE STARTING POINTS FOR CALCULATIONS
by gvkey (fyear): gen start_sum = cond(_n >= 12, L11.running_sum, running_sum[1])
by gvkey (fyear): gen start_sum_sq = cond(_n >= 12, L11.running_sum_sq, running_sum_sq[1])
by gvkey (fyear): gen start_n = cond(_n >= 12, L11.running_n, running_n[1])

// CALCULATE MOVING AVERAGES, AND VARIANCE, AND SIGMA
by gvkey (fyear): gen moving_avrg = (L1.running_sum - start_sum)/(L1.running_n - start_n)
by gvkey (fyear): gen moving_avg_sq = (L1.running_sum_sq - start_sum_sq)/(L1.running_n - start_n)
gen moving_variance = moving_avg_sq - moving_avrg^2
gen sigma = sqrt(moving_variance)

//  AND NOW ELIMINATE RESULT IF N Obs < 3
by gvkey (fyear): replace moving_avrg = . if L1.running_n - start_n < 3
by gvkey (fyear): replace sigma = . if L1.running_n - start_n < 3

// AND DROP THE OBSERVATIONS THAT WERE ADDED BY tsfill
drop if original != 1

// AND DROP INTERMEDIATE VARIABLES IF NOT NEEDED ELSEWHERE
drop running_sum running_sum_sq running_n start_sum start_sum_sq start_n variance original

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29951
#6

14 Apr 2015, 19:06

On second thought, my code above for the starting points is not correct because it means that the very first year of data is always excluded from the moving window. Those lines should read:

Code:

// CALCULATE STARTING POINTS FOR CALCULATIONS by gvkey (fyear): gen start_sum = cond(_n >= 12, L11.running_sum, 0) by gvkey (fyear): gen start_sum_sq = cond(_n >= 12, L11.running_sum_sq, 0) by gvkey (fyear): gen start_n = cond(_n >= 12, L11.running_n, 0)

Sorry for any confusion I have caused.
1 like
Comment
Diedrich Friedrich

Join Date: Mar 2015

Posts: 8
#7

16 Apr 2015, 03:18

Thank you so much Clyde!!
Comment
Jens VanStappen

Join Date: Mar 2016

Posts: 3
#8

07 Mar 2016, 03:15

Hi,

I'm currently facing the same issue, namely that I'm calculating the standard deviation of the CFTA variable over the ten previous fiscal years. The only difference with Diedrich's problem is that I'm trying to calculate the average of these standard deviations for an entire industry. My dataset contains a variable named 'IndustrySIC', which is a two digit variable representing the industry in which the firm operates (e.g. 22, 68, ...). More specifically, I'm trying to calculate the industry average of the above calculated standard deviation of CFTA.

For instance for IndustrySIC = 22, I'd like the code to take into account all the calculated standard deviations of CFTA over the ten previous fiscal years and take the average of all these values.

I'm sure it's not so hard to do this, but I'm rather new to Stata. Thank you for your help and let me know in case I need to explain my problem more elaborately. For now, I've added the following line of code to the previously defined one by Mr Schnechter:

CODE:
bys IndustrySIC DataYearFiscal : egen industrysigma = mean(sigma)

Cheers,
Jens

Last edited by Jens VanStappen; 07 Mar 2016, 03:24.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35417
#9

07 Mar 2016, 03:28

There are at least two parts to this, the SD calculation and averaging the SDs.

For the first, many threads here since the post you are replying -- starting with

http://www.statalist.org/forums/foru...egen-functions

-- explain how to use tsegen (SSC) for this purpose.

For the second, I am queasy about averaging SDs. I'd cross-check with taking their root mean square. But perhaps this is allowed witchcraft in your field.
Comment
Jens VanStappen

Join Date: Mar 2016

Posts: 3
#10

07 Mar 2016, 06:17

Hi Nick,

Thank you for your quick response. I've looked at the tsegen function last week. The problem with this function however is that we seem to be unable to impose a minimum number of observations needed to calculate the SD. For instance, if a firm that goes public in 2007, it will start releasing its financials from that year on. If we calculate the SD in 2009, the SD will go from 2007 to 2009 and will not be representative as the financial crisis has a strong impact on its cash flow volatility, which is exactly what we're trying to calculate. With the tsegen, we had the following code:

CODE:
ssc inst tsegen
tsset GlobalCompanyKey DataYearFiscal
tsegen firm_volatility = rowsd(L(1/10).FCFE_netassets)
bys IndustrySIC DataYearFiscal : egen industrysigma = mean(firm_volatility)

Maybe I'm missing some obvious option of this function, but I don't see how to impose a minimum of X (for instance 5) observations to calculate the standard deviation.

As for averaging SDs, I'm following a specific regression model and our thesis promotor agreed we'd follow the model as specified.

Cheers
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35417
#11

07 Mar 2016, 06:34

tsegen is a command, not a function.

As its second author, I am quite familiar with it, but the help does document, and prominently, precisely how to stipulate a required minimum number of values for each computation.

tsegen [type] newvar = fcn(tsvarlist[, min_obs]) [if] [in] [, options]

where fcn is an egen function that expects a varlist. Optionally, results
based on fewer than min_obs non-missing observations are replaced with
missing values.

So wherever min_obs is not satisfied, results will be returned as missing.

What's more, that syntax is illustrated in several of the code examples.
Comment
Jens VanStappen

Join Date: Mar 2016

Posts: 3
#12

07 Mar 2016, 07:22

Thank you Mr. Cox and my apologies for my noob comments, I'm still learning a lot
Comment
Thorben Puls

Join Date: Dec 2016

Posts: 1
#13

02 Dec 2016, 03:13

Hi,

I got a similar problem to Diedrich Friedrich as I was trying to show the cash flow volatility over the previous years. Thanks to your code it worked out.
But I got a new exercise to reduce the previous time horizon to the five previous years.
I thought that it must have been the same code while just adjusting L11 to L6 in the quoted code below:

Originally posted by Clyde Schechter View Post

Code:

// CALCULATE STARTING POINTS FOR CALCULATIONS by gvkey (fyear): gen start_sum = cond(_n >= 12, L11.running_sum, 0) by gvkey (fyear): gen start_sum_sq = cond(_n >= 12, L11.running_sum_sq, 0) by gvkey (fyear): gen start_n = cond(_n >= 12, L11.running_n, 0)

But while trying, stata told me that my time variable has not been set.
I was wondering if you could briefly tell me what I'm doing wrong.

Best regards,
Thorben from Hamburg
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35417
#14

02 Dec 2016, 03:29

Code:

help tsset

or

Code:

help xtset

What is the problem you are tackling? Since this thread started, tsegen and rangestat have been posted on SSC.

It is likely that whatever you want to do is easier with one of those.

First announcements

http://www.statalist.org/forums/foru...egen-functions (26 April 2015) [mentioned in 174 posts]

http://www.statalist.org/forums/foru...s-within-range (30 March 2016) [mentioned in 151 posts]
Comment
Lari Fari

Join Date: Nov 2017

Posts: 4
#15

27 Nov 2017, 07:01

Hi there,

I also want to apply the following code that was proposed by Clyde:
Code: xtset gvkey fyear
by gvkey (fyear), sort: gen running_sum = sum(CFTA)
by gvkey (fyear): gen running_sum_sq = sum(CFTA^2)
by gvkey (fyear): gen running_n = sum(!missing(CFTA))
by gvkey (fyear): gen moving_avrg = (L1.running_sum - L11.running_sum)/(L1.running_n - L11.running_n)
by gvkey (fyear): gen moving_avg_sq = (L1.running_sum_sq - L11.running_sum_sq)/(L1.running_n - L11.running_n)
gen moving_variance = moving_avg_sq - moving_avrg^2 gen sigma = sqrt(moving_variance)

// AND WHEN YOU'RE SURE YOU DON'T NEED THESE ANY MORE FOR ANYTHING ELSE drop running_sum running_sum_xsq running_n moving_avg_xsq moving_variance

But when I enter this it only works untill the generation of running_n. For the rest the results are only missing values generated. What's the error with it?
Comment

Announcement