Coefficient of variation, problem: standard deviation creates many missing values

Yannick Beunderman

Join Date: May 2022

Posts: 17
#1

Coefficient of variation, problem: standard deviation creates many missing values

04 May 2022, 06:15

Hi,

I want to calculate the top management team's (tmt) coefficient of variation of the value AGE_2 (standard deviation/mean). This is on the 'GVKEY Year' level of analysis.

I successfully created the mean age per GVKEY Year (tmt_average_age) with the command:

bysort GVKEY Year: egen tmt_average_age = mean(AGE_2)

Now I want to calculate the standard deviation of tmt_average_age (tmt_sd_age). I used the command:

bysort GVKEY Year: egen tmt_sd_age = sd(tmt_average_age)

This command worked, however, it generated 4,332 missing values (screenshot attached).

I don't understand why this command creates missing values since I expect it to just take the standard deviation of the tmt_average_age variable, which is present for each observation already (and thus per GVKEY Year).

The variable created for the standard deviation (tmt_sd_age) presents missing values when there is 1 tmt_members and a zero in case of multiple tmt_members (see screenshots: browser 1, 2 and 3), while I want it to present the standard deviation of the variable tmt_average_age.

Can anyone explain how I can calculate the standard deviation of tmt_average_age correctly?

Thanks in advance!
Attached Files
Tags: None
Yannick Beunderman

Join Date: May 2022

Posts: 17
#2

04 May 2022, 06:22

I want the standard deviation of tmt_average_age per GVKEY Year, on GVKEY Year level. Thus 'egen tmt_sd_age = sd(tmt_average_age)' is not the solution since this takes the standard deviation of tmt_average_age of the whole dataset
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#3

04 May 2022, 06:42

Hi Yannick
I think your question is not very clear.
Based on the data you show, you CANNOT estimate the standard deviation of tmt_averge_age by gvkey and year, because you have only 1 observation per group.
You need at least 2 observations to estimate the SD.
Perhaps you want something else?
HTH
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#4

04 May 2022, 06:47

egen, sd() uses sample size MINUS 1 in the denominator.

If you present subsamples of size 1 then the mean is well defined as the single value present but the SD is based on a calculation with division by zero at its heart and so you get missing.

You can insist on using sample size in the denominator of SD by writing your own code or adjusting what you got..

You need to calculate the count of non-missing values

Code:

bysort GVKEY Year: egen tmt_count_age = count(tmt_average_age)

Then you can adjust the SD estimates by

Code:

gen tmt_sd_age2 = cond(tmt_count_age == 1, 0, tmt_sd_age * sqrt((count - 1) / count))

You should certainly check that I got the fudge factor and the code correct.

The coefficient of variation works best if SD is to a good approximation a multiple of the mean. That is,

SD is approximately constant * mean

which is why you are calculating

constant = SD / mean

which is called the coefficient of variation, as you know, not that giving it a grand name makes it a bigger deal.

That's pretty much equivalent to saying (in this case) that age is best thought of on a logarithmic scale, which doesn't appeal much as an idea.

Last edited by Nick Cox; 04 May 2022, 06:49.
Comment
Yannick Beunderman

Join Date: May 2022

Posts: 17
#5

04 May 2022, 08:14

FernandoRios Thank you for your help. I understand that for data with only 1 observation per group I cannot calculate the SD (in a normal way). However, in my screenshots you can see that I also have multiple observations (tmt_members) per group. For these data Stata only provides a zero instead of the standard deviation. Do you know how to get the standard deviation for the data where I do have more than 1 observation per group (multiple tmt_members per observation) instead of the value zero?
Comment
Yannick Beunderman

Join Date: May 2022

Posts: 17
#6

04 May 2022, 08:26

Nick Cox Thank you for your help.
Do I understand you correctly that I can create a SD without division by zero, in your suggestion?
I tried your codes.
The second one firstly did not work like

gen tmt_sd_age2 = cond(tmt_count_age == 1, 0, tmt_sd_age * sqrt((count - 1) / count))

so I replaced 'count' at the end of the code with 'tmt_sd_age' like

gen tmt_sd_age2 = cond(tmt_count_age == 1, 0, tmt_sd_age * sqrt((tmt_sd_age -1) / tmt_sd_age))

Is that correctly?

This provided me zeros and missing values again as shown in the screenshot attached. I do not know how to interpret this or to calculate the SD of tmt_average_age per GVKEY Year from here. Do you know how to proceed from here? Thank you in advance.
Attached Files
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#7

04 May 2022, 08:45

Sorry: count in #3 meant tmt_count_age

The definition of SD was probably covered in your first course on statistics. You can divide sum of squared deviations by sample size MINUS 1 -- the reason being that variance with a divisor of sample size MINUS 1 is an unbiased estimator of population variance -- or by sample size. (Either way, you take square roots as the last step.)

I don't always pay much attention to screenshots given our firm request that people don't post them and because they are often hard to read.

tmt_count_age appears to be the same as your variable members. No harm done there.

Your fix was wrong, however. It needs to be:

Code:

gen tmt_sd_age2 = cond(tmt_count_age == 1, 0, tmt_sd_age * sqrt((tmt_count_age -1) / tmt_count_age))

This is just algebra. You have

square root of (sum of squared deviations / (sample size MINUS 1)) =: have

and you want

square root of (sum of squared deviations / (sample size)) =: want

so that means (writing sqrt = square root)

want = have x sqrt(sample size MINUS 1) / sqrt(sample size)

= have x sqrt(sample size MINUS 1 / sample size)

which is what my code is intended to give.

The adjustment for missings to zero has to be separate because no multiplication will take missing to zero.
Comment

Announcement

Coefficient of variation, problem: standard deviation creates many missing values

Comment

Comment

Comment

Comment

Comment

Comment