Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Standard deviation using "egen"

    Hi,

    I want to create a variable which is defined as the standard deviation of population over the last 3 years (prior to the current year) for each country. Furthermore, there have to be only 2 observations for the population within the last three years for calculating the standard deviation. (I.e. if there is one missing value for population within the last 3 years, the standard deviation should be calculated as the standard deviation of the 2 years within the 3 years period, where a number for population was observed. If there are observations for all three years, the standard deviation should be calculated from those 3 values. If there is only one observation during the 3 year period, a missing value should be generated for the variable standard deviation.)

    In my attached example:
    The standard deviation of country 1 in year 1973 should be calculated as the standard deviation of the values for population in 1972 and 1971 (as only 2 observations are required and the value for 1970 is missing).
    The standard deviation of country 1 in year 1974 should be calculated as the standard deviation of the values for population in 1973, 1972 and 1971.

    I set the data to be Panel by: xtset country year

    I know that there is the command: egen standard_deviation = sd(expression) and that there is the option by

    Unfortunately, I don't know how to use this command for my problem.

    Does anyone have an idea how to solve my problem?

    Thanks.
    Attached Files

  • #2
    Unfortunately, -egen- won't do this job: it doesn't do moving window calculations.

    The key is to remember the fundamental identity: sigma^2 = E(x^2) - [E(x)]^2. Each of these can be calculated using simple functions and -by-:

    See the thread at http://www.statalist.org/forums/foru...ious-ten-years for detailed code on an almost identical problem.

    Comment


    • #3
      Very similar topics have been discussed in the last few days. I wrote a little program that automates this in #4 of this thread.

      The idea is to use the row egen functions to compute statistics on a rolling window. Unfortunately, these functions do not accept time-series varlist (see help tsvarlist). The workaround is to use tsrevar to create temporary variables and use those instead. This post is somewhat different in that the 3 year window appears to refer to prior years and the current observation is not part of the window. I still show how to do the same with my tsrollstat command.

      Code:
      clear
      input country year pop
      1 1971 111111
      1 1972 112222
      1 1973 113333
      1 1974 114444
      1 1975 115555
      end
      
      tsset country year
      
      *  ------------- do it manually -------------
      tsrevar L(1/3).pop
      local rollvars `r(varlist)'
      
      egen sdpop3 = rowsd(`rollvars')
      egen n = rownonmiss(`rollvars')
      gen mysd = sdpop3 if n > 1
      
      drop `rollvars'
      
      
      * ---------- use tsrollstat ------------------
      program drop _all
      program define tsrollstat
      
          version 9
          
          syntax varname, ///
              Generate(name) ///
              Window(integer) ///
              Minimum(integer) ///
              STAT(string) ///
              [ ///
              double ///
              ]
          
          local lastp = `window' - 1
      
          tsrevar L(1/`lastp').`varlist'
          local rollingvars `r(varlist)'
          
          qui egen `double' `generate' = row`stat'(`varlist' `rollingvars')
          
          tempvar n
          egen `n' = rownonmiss(`varlist' `rollingvars')
          qui replace `generate' = . if `n' < `minimum'
          
      end
      
      tsrollstat pop, window(3) minimum(2) gen(sdw3) stat(sd)
      gen mysd2 = L.sdw3
      
      list

      Comment


      • #4
        Sorry, I could not use the posted example dataset, I don't have Stata 14 yet.

        Comment


        • #5
          Thanks for your answer Robert. Unfortunately your code doesn't reveal the correct standard deviations. E.g. in your example the standard deviation of population for the years 1971, 1972 and 1973 should be 907.12. Your code reveals a result of 785.5956.

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            Unfortunately, -egen- won't do this job: it doesn't do moving window calculations.

            The key is to remember the fundamental identity: sigma^2 = E(x^2) - [E(x)]^2. Each of these can be calculated using simple functions and -by-:

            See the thread at http://www.statalist.org/forums/foru...ious-ten-years for detailed code on an almost identical problem.

            I have read this thread before. My problem is that I don't completely understand what you are doing in you code, so I don't know what numbers/code I have to change to obtain what I need.

            Comment


            • #7
              Well, in that thread, the role of your population variable is played by CFTA, the role of country is played by gvkey, and year is the same. Similarly your window of 3 years parallels the other window of 10 years, and your limit of at least 2 years parallels the other limit of 3 years. In all other respects, the problems are, as far as I can see, entirely the same.

              If you do not understand how -by-, sum(), cond(), missing(), and the L. operators work, I can only say that you will never get very far in Stata without learning those--so now is as good a time as any to crack open the online manual and master these workhorses that you will need over and over again to do this kind of work.

              Comment


              • #8
                Returning to #5 above. To begin, I think of sigma^2 = E[ (X - E[X])^2 ] and when we calculate it for a finite set of N values, E[X] is sum(X)/N and E[ (X - E[X])^2 ] is sum(X-sum(X)/N)/K where K is either N or N-1 depending. Then there's the whole thing with standard error, which is the standard deviation divided by the square root of N.

                I ran
                Code:
                summarize pop in 1/3
                on Robert's test data and it reported 1111 as the standard deviation. That's the standard deviation, using 2 in the denominator, of the values of pop for 1971, 1972, and 1973: ((-1111)2+02+11112)/(3-1). 907.13 is the value when the variance is calculated with n in the denominator. 785.59 = sqrt(2/3)*907.13.

                What this all means is left as an exercise for the reader.
                Last edited by William Lisowski; 18 Apr 2015, 15:36.

                Comment


                • #9
                  Sorry, I was away enjoying the Global Citizen 2015 Earth Day event in DC.

                  Upon further reflection, please discard from my code example in #3 the second part (using tsrollstat) because it would not work if the previous period is not in the data.

                  With respect to my code not revealing the correct value, you have a fully functional example that can hardly be made simpler. Your original post says:

                  I want to create a variable which is defined as the standard deviation of population over the last 3 years (prior to the current year) for each country.
                  So my code shows a standard deviation of 785.5956 for 1973. That number represents the standard deviation over the 3 previous year (excluding the current year of 1973). A simple way to check the number is to do

                  summarize pop in 1/2

                  For 1974, the standard deviation of the 3 previous years can be calculated using

                  summarize pop in 1/3

                  which yields 1111. Of course you are allowed to use another definition of the standard deviation.


                  Comment


                  • #10
                    A propos of #8, the code I have referred you to uses N, not N-1, in the denominator.

                    Comment


                    • #11
                      Hey man I had the same problem and Stata is making it complicated - just copy paste in excel and solve your problem easily there.

                      Comment

                      Working...
                      X