Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identify consecutive observations (based on the year) within a sample

    Dear Stata Users,
    Can you, please help me to resolve the following issue. I need to create the following variables: “pv_1”, “pv_2”, “pv_3”, “pv_4”, “pv_5”. “pv_1” should equal to 1 (for all the observations of a given firm) if a given firm(“gvkey”) has “p_v_decile” equal to 5 for at most 1 year. “pv_2” should equal to 1 (for all the observations of a given firm) if a given firm(“gvkey”) has “p_v_decile” equal to 5 for 2 consecutive years. For “pv_3”, “pv_4” the logic is the same. “pv_5” should equal to 1 (for all the observations of a given firm) if a given firm(“gvkey”) has “p_v_decile” equal to 5 at least 5 consecutive years.
    In the original sample, “p_v_decile” has values from 1 to 5, but for brevity I just present “5” as an example.
    Please, help me with this issue.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str6 gvkey double fyear float p_v_decile
    "030029" 2012 5
    "030031" 1996 5
    "030034" 1995 5
    "030034" 1997 5
    "030066" 1996 5
    "030087" 1995 5
    "030087" 1996 5
    "030087" 1998 5
    "030087" 1999 5
    "030094" 1995 5
    "030094" 1996 5
    "030094" 1999 5
    "030095" 1995 5
    "030095" 1996 5
    "030095" 1997 5
    "030095" 1998 5
    "030130" 1996 5
    "030130" 1997 5
    "030131" 1996 5
    "030131" 1997 5
    "030131" 1998 5
    "030131" 1999 5
    "030137" 1997 5
    "030137" 1998 5
    "030137" 1999 5
    "030137" 2002 5
    "030137" 2003 5
    "030137" 2004 5
    "030137" 2005 5
    "030152" 2002 5
    "030152" 2003 5
    "030203" 1996 5
    "030203" 1998 5
    "030233" 1996 5
    "030233" 1998 5
    "030233" 1999 5
    "030259" 2004 5
    "030259" 2006 5
    "030259" 2007 5
    "030265" 2000 5
    "030277" 1999 5
    "030312" 1996 5
    "030312" 1998 5
    "030312" 1999 5
    "030312" 2012 5
    "030312" 2013 5
    "030312" 2014 5
    "030312" 2015 5
    "030344" 1996 5
    "030349" 1995 5
    "030349" 1996 5
    "030349" 1997 5
    "030352" 1999 5
    "030377" 1996 5
    "030398" 1996 5
    "030398" 2004 5
    "030398" 2005 5
    "030398" 2006 5
    "030398" 2007 5
    "030398" 2010 5
    "030400" 1998 5
    "030400" 1999 5
    "030422" 1996 5
    "030436" 1995 5
    "030477" 2002 5
    "030530" 1995 5
    "030530" 1997 5
    "030540" 1997 5
    "030545" 1996 5
    "030546" 2000 5
    "030554" 2000 5
    "030554" 2001 5
    "030554" 2002 5
    "030554" 2003 5
    "030576" 1995 5
    "030576" 1996 5
    "030576" 1997 5
    "030576" 1999 5
    "030034" 1998 .
    "030034" 1999 .
    "030087" 2000 .
    "030087" 2001 .
    "030087" 2002 .
    "030087" 2003 .
    "030094" 2000 .
    "030094" 2000 .
    "030094" 2001 .
    "030094" 2001 .
    "030094" 2002 .
    "030094" 2002 .
    "030203" 1999 .
    "030203" 2000 .
    "030312" 2000 .
    "030312" 2001 .
    "030312" 2002 .
    "030312" 2003 .
    "030576" 2000 .
    "030576" 2001 .
    "030576" 2002 .
    "030576" 2003 .
    end

  • #2
    How does this work for you?

    Code:
    sort gvkey fyear p_v_decile, stable
    
    bysort gvkey: egen missing = total(missing(p_v_decile))
    bysort gvkey: gen conseq = _N-missing
    summ conseq, meanonly
    
    forval i = 1/`r(max)'{
        gen pv_`i' = (conseq==`i')
    }
    
    drop missing conseq
    I noted there were instances where there more than 5 consecutive non-missing values for p_v_decile for a given gvkey, so the above adds in those binary vars. Of course you can change that by removing the summ conseq line and replacing the `r(max)' with 5 in the forval loop.
    Last edited by Chris Larkin; 11 Jul 2018, 08:07.

    Comment


    • #3
      There are no rules here for what to with gaps. For example, what happens with

      Code:
      "030034" 1995 5
      "030034" 1997 5
      ?

      Comment


      • #4
        Thank you very much for the replies! So, if there are gaps then it means that there are no consecutive sequnce, i.e. for the firm "030034" there is only one year of being “p_v_decile” equal to 5. If, for example, it happens that then there is a following case as below:

        Code:
         
         "030034" 2000 5 "030034" 2001 5 "030034" 2002 5
        Then this firm is expected to have “p_v_decile” equal to 5 for 3 consecutive years, thus “pv_3” should equal to 1 for this firm.

        Comment


        • #5
          In addition to gaps there are duplicates in your example. This may suggest some technique.


          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str6 gvkey double fyear float p_v_decile
          "030029" 2012 5
          "030031" 1996 5
          "030034" 1995 5
          "030034" 1997 5
          "030066" 1996 5
          "030087" 1995 5
          "030087" 1996 5
          "030087" 1998 5
          "030087" 1999 5
          "030094" 1995 5
          "030094" 1996 5
          "030094" 1999 5
          "030095" 1995 5
          "030095" 1996 5
          "030095" 1997 5
          "030095" 1998 5
          "030130" 1996 5
          "030130" 1997 5
          "030131" 1996 5
          "030131" 1997 5
          "030131" 1998 5
          "030131" 1999 5
          "030137" 1997 5
          "030137" 1998 5
          "030137" 1999 5
          "030137" 2002 5
          "030137" 2003 5
          "030137" 2004 5
          "030137" 2005 5
          "030152" 2002 5
          "030152" 2003 5
          "030203" 1996 5
          "030203" 1998 5
          "030233" 1996 5
          "030233" 1998 5
          "030233" 1999 5
          "030259" 2004 5
          "030259" 2006 5
          "030259" 2007 5
          "030265" 2000 5
          "030277" 1999 5
          "030312" 1996 5
          "030312" 1998 5
          "030312" 1999 5
          "030312" 2012 5
          "030312" 2013 5
          "030312" 2014 5
          "030312" 2015 5
          "030344" 1996 5
          "030349" 1995 5
          "030349" 1996 5
          "030349" 1997 5
          "030352" 1999 5
          "030377" 1996 5
          "030398" 1996 5
          "030398" 2004 5
          "030398" 2005 5
          "030398" 2006 5
          "030398" 2007 5
          "030398" 2010 5
          "030400" 1998 5
          "030400" 1999 5
          "030422" 1996 5
          "030436" 1995 5
          "030477" 2002 5
          "030530" 1995 5
          "030530" 1997 5
          "030540" 1997 5
          "030545" 1996 5
          "030546" 2000 5
          "030554" 2000 5
          "030554" 2001 5
          "030554" 2002 5
          "030554" 2003 5
          "030576" 1995 5
          "030576" 1996 5
          "030576" 1997 5
          "030576" 1999 5
          "030034" 1998 .
          "030034" 1999 .
          "030087" 2000 .
          "030087" 2001 .
          "030087" 2002 .
          "030087" 2003 .
          "030094" 2000 .
          "030094" 2000 .
          "030094" 2001 .
          "030094" 2001 .
          "030094" 2002 .
          "030094" 2002 .
          "030203" 1999 .
          "030203" 2000 .
          "030312" 2000 .
          "030312" 2001 .
          "030312" 2002 .
          "030312" 2003 .
          "030576" 2000 .
          "030576" 2001 .
          "030576" 2002 .
          "030576" 2003 .
          end
          
          egen id = group(gvkey), label 
          duplicates drop 
          tsset id fyear 
          tsfill 
          
          bysort id (fyear) : gen history = strofreal(p_v_decile) if _n == 1 
          by id : replace history = history[_n-1] + strofreal(p_v_decile) if _n > 1 
          by id : replace history = history[_N] 
          
          gen count5 = length(history) - length(subinstr(history, "5", "", .))
          
          gen pv_1 = count5 == 1 
          
          forval j = 2/4 { 
              gen pv_`j' = count5 == `j' & strpos(history, `j'*"5") 
          } 
          
          gen pv_5 = count5 >= 5 & strpos(history, "55555") 
          
          * install from Stata Journal website 
          groups id history count5 pv*  , show(none) sep(0) 
          
          
            +---------------------------------------------------------------------------+
            |     id                history   count5   pv_1   pv_2   pv_3   pv_4   pv_5 |
            |---------------------------------------------------------------------------|
            | 030029                      5        1      1      0      0      0      0 |
            | 030031                      5        1      1      0      0      0      0 |
            | 030034                  5.5..        2      0      0      0      0      0 |
            | 030066                      5        1      1      0      0      0      0 |
            | 030087              55.55....        4      0      0      0      0      0 |
            | 030094               55..5...        3      0      0      0      0      0 |
            | 030095                   5555        4      0      0      0      1      0 |
            | 030130                     55        2      0      1      0      0      0 |
            | 030131                   5555        4      0      0      0      1      0 |
            | 030137              555..5555        7      0      0      0      0      0 |
            | 030152                     55        2      0      1      0      0      0 |
            | 030203                  5.5..        2      0      0      0      0      0 |
            | 030233                   5.55        3      0      0      0      0      0 |
            | 030259                   5.55        3      0      0      0      0      0 |
            | 030265                      5        1      1      0      0      0      0 |
            | 030277                      5        1      1      0      0      0      0 |
            | 030312   5.55............5555        7      0      0      0      0      0 |
            | 030344                      5        1      1      0      0      0      0 |
            | 030349                    555        3      0      0      1      0      0 |
            | 030352                      5        1      1      0      0      0      0 |
            | 030377                      5        1      1      0      0      0      0 |
            | 030398        5.......5555..5        6      0      0      0      0      0 |
            | 030400                     55        2      0      1      0      0      0 |
            | 030422                      5        1      1      0      0      0      0 |
            | 030436                      5        1      1      0      0      0      0 |
            | 030477                      5        1      1      0      0      0      0 |
            | 030530                    5.5        2      0      0      0      0      0 |
            | 030540                      5        1      1      0      0      0      0 |
            | 030545                      5        1      1      0      0      0      0 |
            | 030546                      5        1      1      0      0      0      0 |
            | 030554                   5555        4      0      0      0      1      0 |
            | 030576              555.5....        4      0      0      0      0      0 |
            +---------------------------------------------------------------------------+
          
          .

          Comment


          • #6
            Thank you very much for the reply! The program does almost everything. The only thing is that it does not indicate a number of consecutive years if there is a gap, say in:

            Code:
            "030576" 1995, 1996, 1997, 1999.
            Thus, for this firm pv_3 = 1 (since there are 3 consecutive years, which is more that 1 after the gap).

            Is there is a way to define this condition in

            Code:
            forval j = 2/4 { 
                gen pv_`j' = count5 == `j' & strpos(history, `j'*"5") 
            }
            Please, help me with this issue.

            Comment


            • #7
              If your criterion is just

              Code:
              forval j = 2/4 {      
                   gen pv_`j' = strpos(history, `j'*"5")  
              }
              then that's the code.

              Comment


              • #8
                Below is another way to go. It should be noted that (as noticed by Nick) there are some duplicates (with gvkey and fyear) in your example, which are assumed as unexpected and to be dropped away. Otherwise, further clarification would be needed.
                Code:
                duplicates drop gvkey fyear, force
                
                gen tag5 = 1 if p_v_decile == 5
                bys gvkey (fyear): replace  tag5= tag5[_n-1]+ tag5 if  fyear == fyear[_n-1]+1 & tag5[_n-1] != .
                bys gvkey: egen consecutive5 = max(tag5)
                
                forval i = 1/4 {
                    gen pv_`i' = consecutive5 == `i'
                }
                gen pv_5 = consecutive5>=5
                Last edited by Romalpa Akzo; 16 Jul 2018, 22:07.

                Comment


                • #9
                  Thank you for the help!

                  Comment

                  Working...
                  X