Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • standard deviation of a variable with missing observations, panel

    HI!

    I have a panel firm level dataset, and would like to inspect variable's standard deviation for the studied period. The issue is that during one year, some firms have missing and some have zeros which hard to justified (a separate issue..). In addition, during one year, most first have missing observations.Hence, I would like to do the following:
    • create stdv for variable employment, ignoring the year of 2011 (id=5, table below)
    • create std. for variable employment, ignoring any missing, no matter the year (id=6, table below)
    • create std. for variable employment, ignoring zeros values from the calculation, no matter the year (id=7, table)
    Code:
     
    year id employment
    2006 5 458
    2007 5 2
    2008 5 1
    2009 5 1
    2010 5 1
    2011 5
    2012 5 1
    2013 5 1
    2014 5 1
    year id employees
    2006 6 0
    2007 6 1
    2008 6 0
    2009 6 2
    2010 6 2
    2011 6
    2012 6
    2013 6
    2014 6
    year id employees
    2006 7 0
    2007 7 1
    2008 7 0
    2009 7 2
    2010 7 2
    2011 7 3
    2012 7 4
    2013 7 6
    2014 7 5
    The command taking into account all three conditions, would be the most welcomed. The code I use is the following:

    bysort id year: egen sd_emp=sd(employees) if year!=2011 & employees!=0 & employees!=.
    However, the variable created contains only missings.. What am I doing wrong?

    Thank you in advance.

  • #2
    In the data you show, year and id uniquely identify observations. So when you try to calculate standard deviations within id-year groups, there is only one observation in each group, and the standard deviation is undefined due to the n-1 in the denominator. It has nothing to do with your particular conditions.

    Did you perhaps mean to calculate the standard deviations just within id? If so, let me first point out that explicitly excluding missing values is unnecessary: that is automatic in the calculation of standard deviations. Exclusion of zeroes, however, does make a difference. So, it would be this:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(year id employees)
    2006 5 458
    2007 5   2
    2008 5   1
    2009 5   1
    2010 5   1
    2011 5   .
    2012 5   1
    2013 5   1
    2014 5   1
    2006 6   0
    2007 6   1
    2008 6   0
    2009 6   2
    2010 6   2
    2011 6   .
    2012 6   .
    2013 6   .
    2014 6   .
    2006 7   0
    2007 7   1
    2008 7   0
    2009 7   2
    2010 7   2
    2011 7   3
    2012 7   4
    2013 7   6
    2014 7   5
    end
    
    by id, sort: egen sd_emp = sd(cond(year!=2011 & employees != 0), employees, .)
    In the future, please post example data using the -dataex- command, as I have done above. It makes it much easier for those who want to help you to rapidly create a completely faithful replica of your data and try out code on it. -ssc install dataex- to get the command, and -help dataex- for simple instructions in its use.

    Comment


    • #3
      @Clyde, many thanks! As usual, problem is resolved! Next time I will improve my posts in a suggested manner.
      Best,
      Mina

      Comment


      • #4
        Related to my earlier post, how to i ignore firms, which have three consecutive (or more) consequtive missing and zeros, from the calculation of sd. the earlier (#1) criteria hold too!


        [CODE][
        17386000 2006 .
        17386000 2007 880
        17386000 2008 7
        17386000 2009 0
        17386000 2010 0
        17386000 2011 .
        17386000 2012 .
        17386000 2013 .
        17386000 2014 1
        17552040 2006 109
        17552040 2007 995
        17552040 2008 .
        17552040 2009 .
        17552040 2010 .
        17552040 2011 0
        17552040 2012 0
        17552040 2013 0
        17552040 2014 5
        /CODE]

        Many thanks in advance!

        Comment


        • #5
          So, first I would just do the same calculations as before, and then go back and reset the result to missing for firms with three or more consecutive zeroes.

          Code:
          clear*
          input long id int year int employees
          17386000 2006 .
          17386000 2007 880
          17386000 2008 7
          17386000 2009 0
          17386000 2010 0
          17386000 2011 .
          17386000 2012 .
          17386000 2013 .
          17386000 2014 1
          17552040 2006 109
          17552040 2007 995
          17552040 2008 .
          17552040 2009 .
          17552040 2010 .
          17552040 2011 0
          17552040 2012 0
          17552040 2013 0
          17552040 2014 5
          end
          
          by id, sort: egen sd_emp = sd(cond(year!=2011 & employees != 0), employees, .)
          
          //    NOW IDENTIFY FIRMS WITH THREE CONSECUTIVE ZEROES
          //    A RUN STARTS WITH A ZERO AFTER NON-ZERO, OR VICE VERSA
          by id (year), sort: gen run_num = sum((employees == 0) != (employees[_n-1] == 0))
          by id run_num (year), sort: gen run_length = _N
          //    IDENTIFY RUNS OF 3 OR LONGER THAT CONSIST OF ZEROES
          by id run_num: gen run_to_ignore = (run_length >= 3 & employees == 0)
          //    MARK IDS WITH ANY SUCH RUN
          by id: egen to_exclude = max(run_to_ignore)
          //    "UNCALCULATE" THE SD FOR THESE IDS
          replace sd_emp = . if to_exclude

          Comment


          • #6
            Clyde, thank you!
            This did work quite well. In fact, the code picked up firms with only two consecutive missing value of employment. What I did is the following (the code,earlier suggested by you):

            Code:
            by id, sort: egen to_browse1 = total(employees==0 & employees[_n+1]==0 & employees[_n+2]==0 & employees[_n-1]!=0 & employees[_n+3]!=0 & !missing(employees[_n-1],employees[_n+3]))
            by id, sort: egen to_browse2 = total(employees==0 & employees[_n+1]==0 & employees[_n+2]==0 & employees[_n+3]==0 & employees[_n-1]!=0 & employees[_n+4]!=0 & !missing(employees[_n-1],employees[_n+4]))
            by id, sort: egen to_browse3 = total(employees==0 & employees[_n+1]==0 & employees[_n+2]==0 & employees[_n+3]==0 & employees[_n+4]==0 & employees[_n-1]!=0 & employees[_n+5]!=0 & !missing(employees[_n-1],employees[_n+5]))
            by id, sort: egen to_browse4 = total(employees==0 & employees[_n+1]==0 & employees[_n+2]==0 & employees[_n+3]==0 & employees[_n+4]==0 & employees[_n+5]==0 & employees[_n-1]!=0 & employees[_n+6]!=0 & !missing(employees[_n-1],employees[_n+6]))
            by id, sort: egen to_browse5 = total(employees==0 & employees[_n+1]==0 & employees[_n+2]==0 & employees[_n+3]==0 & employees[_n+4]==0 & employees[_n+5]==0 & employees[_n+6]==0 & employees[_n-1]!=0 & employees[_n+7]!=0 & !missing(employees[_n-1],employees[_n+7]))
            Then I replaced sd_emp with . , in case any to_browse is equal to one. It cumbersome to write the code, but at least it works...

            Comment


            • #7
              In fact, the code picked up firms with only two consecutive missing value of employment.
              Really? I don't see how that would happen. There is a run of two zeroes in the first id in the sample data you posted in #5, and my code does not flag it. I also tried modifying the sample data to include some runs of two zeroes at the beginning or end of the data for an id, and those don't get picked up either. I'm really puzzled where the mistake is and I would appreciate if you would post some example data where this code picks up a run of just two zeroes. I would like to find my mistake and fix it, both for my own education, and to give you a chance to replace your code in #6 with something that is more flexible and less brittle. Thanks.

              Comment


              • #8
                Clyde, I went back to the original dataset, and it seems that the code works. I obviously did something wrong the first time.. Thanks again!

                Comment

                Working...
                X