Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Coefficient of variation, problem: standard deviation creates many missing values

    Hi,

    I want to calculate the top management team's (tmt) coefficient of variation of the value AGE_2 (standard deviation/mean). This is on the 'GVKEY Year' level of analysis.

    I successfully created the mean age per GVKEY Year (tmt_average_age) with the command:

    bysort GVKEY Year: egen tmt_average_age = mean(AGE_2)

    Now I want to calculate the standard deviation of tmt_average_age (tmt_sd_age). I used the command:

    bysort GVKEY Year: egen tmt_sd_age = sd(tmt_average_age)

    This command worked, however, it generated 4,332 missing values (screenshot attached).

    I don't understand why this command creates missing values since I expect it to just take the standard deviation of the tmt_average_age variable, which is present for each observation already (and thus per GVKEY Year).

    The variable created for the standard deviation (tmt_sd_age) presents missing values when there is 1 tmt_members and a zero in case of multiple tmt_members (see screenshots: browser 1, 2 and 3), while I want it to present the standard deviation of the variable tmt_average_age.

    Can anyone explain how I can calculate the standard deviation of tmt_average_age correctly?

    Thanks in advance!
    Attached Files

  • #2
    I want the standard deviation of tmt_average_age per GVKEY Year, on GVKEY Year level. Thus 'egen tmt_sd_age = sd(tmt_average_age)' is not the solution since this takes the standard deviation of tmt_average_age of the whole dataset

    Comment


    • #3
      Hi Yannick
      I think your question is not very clear.
      Based on the data you show, you CANNOT estimate the standard deviation of tmt_averge_age by gvkey and year, because you have only 1 observation per group.
      You need at least 2 observations to estimate the SD.
      Perhaps you want something else?
      HTH

      Comment


      • #4
        egen, sd() uses sample size MINUS 1 in the denominator.

        If you present subsamples of size 1 then the mean is well defined as the single value present but the SD is based on a calculation with division by zero at its heart and so you get missing.

        You can insist on using sample size in the denominator of SD by writing your own code or adjusting what you got..

        You need to calculate the count of non-missing values

        Code:
        bysort GVKEY Year: egen tmt_count_age = count(tmt_average_age)
        Then you can adjust the SD estimates by

        Code:
        gen tmt_sd_age2 = cond(tmt_count_age == 1, 0, tmt_sd_age * sqrt((count - 1) / count))
        You should certainly check that I got the fudge factor and the code correct.

        The coefficient of variation works best if SD is to a good approximation a multiple of the mean. That is,

        SD is approximately constant * mean

        which is why you are calculating

        constant = SD / mean

        which is called the coefficient of variation, as you know, not that giving it a grand name makes it a bigger deal.

        That's pretty much equivalent to saying (in this case) that age is best thought of on a logarithmic scale, which doesn't appeal much as an idea.
        Last edited by Nick Cox; 04 May 2022, 07:49.

        Comment


        • #5
          FernandoRios Thank you for your help. I understand that for data with only 1 observation per group I cannot calculate the SD (in a normal way). However, in my screenshots you can see that I also have multiple observations (tmt_members) per group. For these data Stata only provides a zero instead of the standard deviation. Do you know how to get the standard deviation for the data where I do have more than 1 observation per group (multiple tmt_members per observation) instead of the value zero?

          Comment


          • #6
            Nick Cox Thank you for your help.
            Do I understand you correctly that I can create a SD without division by zero, in your suggestion?
            I tried your codes.
            The second one firstly did not work like

            gen tmt_sd_age2 = cond(tmt_count_age == 1, 0, tmt_sd_age * sqrt((count - 1) / count))

            so I replaced 'count' at the end of the code with 'tmt_sd_age' like

            gen tmt_sd_age2 = cond(tmt_count_age == 1, 0, tmt_sd_age * sqrt((tmt_sd_age -1) / tmt_sd_age))

            Is that correctly?

            This provided me zeros and missing values again as shown in the screenshot attached. I do not know how to interpret this or to calculate the SD of tmt_average_age per GVKEY Year from here. Do you know how to proceed from here? Thank you in advance.
            Attached Files

            Comment


            • #7
              Sorry: count in #3 meant tmt_count_age

              The definition of SD was probably covered in your first course on statistics. You can divide sum of squared deviations by sample size MINUS 1 -- the reason being that variance with a divisor of sample size MINUS 1 is an unbiased estimator of population variance -- or by sample size. (Either way, you take square roots as the last step.)

              I don't always pay much attention to screenshots given our firm request that people don't post them and because they are often hard to read.

              tmt_count_age appears to be the same as your variable members. No harm done there.

              Your fix was wrong, however. It needs to be:


              Code:
              gen tmt_sd_age2 = cond(tmt_count_age == 1, 0, tmt_sd_age * sqrt((tmt_count_age -1) / tmt_count_age))

              This is just algebra. You have

              square root of (sum of squared deviations / (sample size MINUS 1)) =: have

              and you want

              square root of (sum of squared deviations / (sample size)) =: want

              so that means (writing sqrt = square root)

              want = have x sqrt(sample size MINUS 1) / sqrt(sample size)

              = have x sqrt(sample size MINUS 1 / sample size)

              which is what my code is intended to give.

              The adjustment for missings to zero has to be separate because no multiplication will take missing to zero.

              Comment

              Working...
              X