Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating population standard deviation

    Hi everyone,

    I have a panel dataset with several countries and years. I am trying to calculate the standard deviation of one variable (growth) for each country over three-year windows (i.e. stdev in 1980 to 1983, 1984 to 1987 and so on). I successfully managed to do this with the following commands:

    Code:
    *Generating 3-year windows
    gen years = 1 + 3 * floor((year - 1)/3)
    
    *Calculating volatility
    bysort id years: egen sd=sd(growth)
    This all works fine. However, by the above command, Stata is calculating the sample standard deviation, while I want to have the population standard deviation. I had a look into Google and Stata manuals, but I cannot seem to find any command that would calculate what I need. Stata simply refers to "standard deviation", without really mentioning anything about the distinction between sample and population. I can see that I could achieve my goal with some manually input formula, but I struggle to believe that Stata doesn't have a command for doing this.
    Do you have any suggestions?

    EDIT: Actually, I would also like to hear your opinions on whether I am doing the right thing by looking for the population, rather than sample standard deviation. I am using the data on gdp growth for each 3-year period to calculate the standard deviation in those 3 years (i.e. not to infer the standard deviation of growth over a longer time-span). Do you agree that the population standard deviation is indeed the most correct statistic for this case?

    Thanks in advance,

    Giuseppe
    Last edited by Giuseppe Canzonieri; 30 Mar 2017, 09:20.

  • #2
    There are no built-in functions or -egen- functions that calculate the population standard deviation. In most situations, the sample standard deviation is what is wanted. And it is not hard to transform the sample estimate into the population standard deviation: just multiply by sqrt((N-1)/N). (I'm sure you knew that!)

    As for whether the population standard deviation is indeed the most correct statistic for this situation, that really depends on what you intend to do with it and how that relates to the scientific theory underlying your research. The theory should explicitly or implicitly state which measure is the most germane to what you're doing. If no inferences beyond your sample data are intended, as you say, and the statistic you calculate is supposed to be a pure description of in-sample variation, then, you are correct in preferring the population standard deviation calculation. Then again, in light of the sqrt((N-1)/N) correction factor, you can see that unless N is small it doesn't make a lot of difference.

    Digression: I have always disliked the terms "sample" and "population" standard deviation. They become confusing because, for example, you are in a situation where what you have is clearly a sample, but the appropriate statistic, it seems, is the "population" standard deviation. I think that it would be better to use different names. In my own mind I think of the "sample" standard deviation as the "inferential" standard deviation and the "population" standard deviation as the "descriptive" standard deviation. I think this terminology would better reflect the fact that the "sample" standard deviation is actually an estimate of the standard deviation in the full population inferred from the distribution observed in the sample, where as the "population" standard deviation is a pure measure of the variation observed in the sample, with no generalizability to the larger population..

    Comment


    • #3
      Thank you very much for your answer Clyde. As I said, I was just puzzled that Stata does not include a command for this.
      I followed your suggestions and am now running:

      Code:
      gen years = 1 + 3 * floor((year - 1)/3)
      
      bysort id years: egen sd=sd(growth)
      
      replace sd = sqrt((sd^2)*(2/3))
      this (of course!) works. However, it presents some problems as soon as one of my time-periods ha gaps. Then the value is not correct anymore. For example:

      HTML Code:
      year    years    growth        sd
      2011    2011    2.801323     .6762154
      2012    2011    1.566219     .6762154
      2013    2011    1.227951     .6762154
      2014    2014    1.887512     .5942119
      2015    2014    2.916718     .5942119
      2016    2014           .     .5942119
      I obtain the correct value for the years 2011-2013, but for the years 2014-2016 (with a gap in 2016) the value is wrong. It should be 0.514603 instead. This of course has to do with the fact that I am multiplying the sample stdev for that period by 2/3, while because of the gap it should be multiplied by just 1/2. Of course, fixing this manually for each period with gaps would be crazy. I guess I should come up with a loop to automatize this.
      Any ideas?

      Comment


      • #4
        -egen, count()- is your friend. No explicit loop needed.

        Code:
        gen years = 1 + 3 * floor((year - 1)/3)
        
        bysort id years: egen sd=sd(growth)
        by id years: egen N = count(growth) // NUMBER OF NON-MISSING OBS OF GROWTH FOR THIS ID-YEARS
        
        replace sd = sd*sqrt((N-1)/N)
        Note: No need to square sd and then put it in side sqrt(). (No harm from doing it, but unnecessary.)

        Comment


        • #5
          This did the job! Thank you very much, Clyde.

          Comment


          • #6
            Dear Clyde:
            I love, love, love this:

            " Digression: I have always disliked the terms "sample" and "population" standard deviation. They become confusing because, for example, you are in a situation where what you have is clearly a sample, but the appropriate statistic, it seems, is the "population" standard deviation. I think that it would be better to use different names. In my own mind, I think of the "sample" standard deviation as the "inferential" standard deviation and the "population" standard deviation as the "descriptive" standard deviation. I think this terminology would better reflect the fact that the "sample" standard deviation is actually an estimate of the standard deviation in the full population inferred from the distribution observed in the sample, whereas the "population" standard deviation is a pure measure of the variation observed in the sample, with no generalizability to the larger population."

            I've been teaching stats using the terminology you mention. It's been a lonely journey so far because practically everyone else uses incorrect terminology. I'm so glad to find someone who thinks about these concepts the right way.

            Comment


            • #7
              I missed this thread in 2017 and 2019. I too quite like Clyde Schechter's comments about descriptive vs inferential SDs. But I would add that while s2 is an unbiased estimator of the population variance s is not an unbiased estimator of the population SD. However, coming up with an unbiased estimator of the population SD is no simple matter, and for most purposes, the square root of s2 is good enough. I like this excerpt from the relevant Wikipedia page:

              The use of n − 1 instead of n in the formula for the sample variance is known as Bessel's correction, which corrects the bias in the estimation of the population variance, and some, but not all of the bias in the estimation of the population standard deviation.

              It is not possible to find an estimate of the standard deviation which is unbiased for all population distributions, as the bias depends on the particular distribution. Much of the following relates to estimation assuming a normal distribution.
              Cheers,
              Bruce
              --
              Bruce Weaver
              Email: [email protected]
              Version: Stata/MP 18.5 (Windows)

              Comment

              Working...
              X