Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a counter variable around a date (event-window)

    Hi,

    I am currently conducting a research project with Stata. As I ran into a problem which I wasn´t able to solve on my own (after browsing through the commands and the forum), I decided to ask the question in this community.

    I am working with a panel data set which contains several firms and daily stock prices.
    Each firm has several events over time (profit-warnings on a specific day). I have created a dummy variable called event, which takes the value of 0 on the day of the event and . otherwise.

    What I´d like to do is the following:
    I want to create an event window of 6 trading days (-2,3) around each event (where 0 is the day of each event) by the firm. As my data only contains trading days (Monday To Friday), each observation (before and after the event should count as a "day").
    I am mainly struggling with this, because each firm contains several events. I could not figure out how to solve this problem. I would be helpful for any guidance on the problem. Thank you.
    I have included a part of my data below:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float Company int Date double price byte event
    1 21479        18.78 .
    1 21480        18.57 .
    1 21481        18.35 .
    1 21482        18.68 .
    1 21483         18.7 0
    1 21486        18.92 .
    1 21487        18.69 .
    1 21488        18.74 .
    1 21489        19.22 .
    1 21490        19.52 .
    1 21493        19.17 .
    1 21494        19.13 .
    1 21495        19.29 .
    1 21496        18.94 .
    1 21497        18.94 .
    1 21500         18.7 .
    1 21501         20.2 .
    1 21502        20.08 .
    1 21503         20.2 .
    1 21504        19.86 .
    1 21507        18.99 .
    1 21508        19.17 .
    1 21509        19.25 .
    1 21510        19.33 .
    1 21511        19.49 .
    1 21514        20.02 .
    1 21515        19.83 0
    1 21516        19.64 .
    1 21517        19.52 .
    1 21518        19.36 .
    1 21521        19.28 .
    1 21522        19.11 .
    1 21523        18.89 .
    1 21524         18.5 .
    1 21525        18.69 .
    1 21528        18.46 .
    1 21529        18.49 .
    1 21530        18.69 .
    1 21531        18.66 .
    1 21532        18.18 .
    1 21535        17.89 .
    1 21536        17.51 .
    1 21537         17.7 .
    2 21479 15.048256504 .
    2 21480 14.239803163 .
    2 21481 14.239803163 .
    2 21482 13.881511341 .
    2 21483 13.789641644 .
    2 21486  13.96419407 .
    2 21487 14.276551042 .
    2 21488 14.708338622 .
    2 21489 15.121752262 .
    2 21490 15.011508625 0
    2 21493 14.855330138 .
    2 21494 15.158500141 .
    2 21495 15.434109235 .
    2 21496 15.608661661 .
    2 21497 15.755653177 .
    2 21500 15.709718328 .
    2 21501 15.663783479 .
    2 21502 15.314678628 .
    2 21503 14.653216803 .
    2 21504 14.717525592 .
    2 21507 14.157120435 .
    2 21508 14.046876798 .
    2 21509 14.037689828 .
    2 21510 14.056063767 .
    2 21511 14.175494375 .
    2 21514 14.313298921 .
    2 21515   13.9550071 .
    2 21516  13.93663316 .
    2 21517 14.304111951 .
    2 21518 13.835576493 .
    2 21521 14.395981649 .
    2 21522 13.155740729 .
    2 21523 13.174114668 .
    2 21524 13.256797396 .
    2 21525 13.311919215 0
    2 21528 12.935253454 .
    2 21529 13.284358306 .
    2 21530 13.348667094 .
    2 21531 13.174114668 .
    2 21532 13.229236487 .
    2 21535 13.045497091 .
    2 21536 12.586148602 .
    2 21537 12.898505575 .
    end
    format %td Date




  • #2
    Note:
    This is the code I have came up with. Is there a more direct way to achieve this?
    What I do not unterstand is, why event[_n--2] creates a datapoint before my event date. Shouldn´t it be after the event?
    Code:
    gen eventwindow=.
    forvalues i=-2/3 {
        
        bysort Company(Date): replace eventwindow=`i' if event[_n-`i']==0
        
        }
    Last edited by David Moeller; 18 Nov 2024, 14:10.

    Comment


    • #3
      _n--2 is the same as _n+2. If event[_n+2] == 0, that means that the observation two steps later has event == 0. That's the same as saying that the current observation is two steps earlier than the event.

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        _n--2 is the same as _n+2. If event[_n+2] == 0, that means that the observation two steps later has event == 0. That's the same as saying that the current observation is two steps earlier than the event.
        Thank you for your response. I got a little confused. So the logic of the code should be correct, right?

        Another question I would like to ask, is how I could calculate the returns for each event (over the 6 day period) in this case.

        My first intuition would be, to group by Company and event date, to get a unique event ID. As shown below:

        Code:
        replace event=Date if event==0    
        format event %td
        egen event_id=group(Company event)
        gen ret=log(price[_n]/price[_n-1])*100
        However I fail to understand how I can then calculate the cumulative returns over the 6 day period (-2,3) for each event under every Company ID.

        This is the rest which I have came up with:

        Code:
        forvalues i=-2/3 {
            
            bysort Company(Date): replace event_id=event_id[_n-`i'] if eventwindow==`i'
            
            }
            
        bysort event_id(eventwindow): egen cum_return=sum(ret) if event_id!=. 
        sort Company Date
        Last edited by David Moeller; 18 Nov 2024, 15:29.

        Comment


        • #5
          So the logic of the code should be correct, right?
          Yes, it's correct.

          As for the calculation of returns, your code proposes to use a formula that dates back to the days before calculators and computers were readily available. There is no justification for its continued use. And there are two good reasons against using it. First, and most important, it is only an approximation to the correct value, and the approximation gets worse as the actual return gets larger. Second, on a computer or calculator it is slower than the direct calculation. Your calculation of the period return by adding the individual daily returns is yet another approximation, and an even worse one than the first, in that not only does it perform poorly as the return itself grows, but it also deteriorates rapidly as the number of days in the period grows. Now, at least in the example data you show, your returns are pretty small, and your event windows are only 6 days, so you probably wouldn't go wrong to a material extent using the approximate formulas. But there is no advantage to doing it that way, and if there do happen to be some events in the full data set where the returns get large, you would have meaningfully wrong results for them.

          The direct calculation of the return over any time period is 100*(ending price/beginning price - 1).

          Here's some code that applies that to your data:
          Code:
          by Company (Date): gen event_num = ///
              sum(!missing(eventwindow) & missing(eventwindow[_n-1])) ///
              if !missing(eventwindow)
              
          by Company event_num (Date), sort: gen event_window_cum_return = 100*(price[_N]/price[1] - 1) ///
              if !missing(event_num)
          sort Company Date
          Note: My event variable, which I named event_num, is slightly different from yours in that mine restarts at 1 within each Company. This will prove more convenient for your further work with this data as, I assume, you will generally want to deal with each company's events separately and there will be little, if any, call to perform calculations involving events from all of the different companies.


          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            Yes, it's correct.

            As for the calculation of returns, your code proposes to use a formula that dates back to the days before calculators and computers were readily available. There is no justification for its continued use. And there are two good reasons against using it. First, and most important, it is only an approximation to the correct value, and the approximation gets worse as the actual return gets larger. Second, on a computer or calculator it is slower than the direct calculation. Your calculation of the period return by adding the individual daily returns is yet another approximation, and an even worse one than the first, in that not only does it perform poorly as the return itself grows, but it also deteriorates rapidly as the number of days in the period grows. Now, at least in the example data you show, your returns are pretty small, and your event windows are only 6 days, so you probably wouldn't go wrong to a material extent using the approximate formulas. But there is no advantage to doing it that way, and if there do happen to be some events in the full data set where the returns get large, you would have meaningfully wrong results for them.

            The direct calculation of the return over any time period is 100*(ending price/beginning price - 1).

            Here's some code that applies that to your data:
            Code:
            by Company (Date): gen event_num = ///
            sum(!missing(eventwindow) & missing(eventwindow[_n-1])) ///
            if !missing(eventwindow)
            
            by Company event_num (Date), sort: gen event_window_cum_return = 100*(price[_N]/price[1] - 1) ///
            if !missing(event_num)
            sort Company Date
            Note: My event variable, which I named event_num, is slightly different from yours in that mine restarts at 1 within each Company. This will prove more convenient for your further work with this data as, I assume, you will generally want to deal with each company's events separately and there will be little, if any, call to perform calculations involving events from all of the different companies.

            Thanks again for your help!
            I know that calculating cumulative returns (by adding them up) is mathematically incorrect, however if I understand the finance literature correctly, some people are still doing it (e.g. cumulative abnormal returns).

            One issue that somehow remains with your approach, is what to do if the last price datapoint in an eventwindow is missing. In this case the entire return over the event period would be missing as well. This would be unfortunate if the price one day before the event window end would be available. The same problem could arise if the first price datapoint would be missing.

            Would it be possible to construct a remedy for this?
            I have thought about the following but failed to implement it:

            Instead of calculating the returns by dividing the last price by the first price and then subtracting 1, I would first generate a return variable and then generate a variable which contains the returns+1. Then one could calculate the product of the returns over the eventwindow period and subtract 1.








            Comment


            • #7
              One issue that somehow remains with your approach, is what to do if the last price datapoint in an eventwindow is missing. In this case the entire return over the event period would be missing as well. This would be unfortunate if the price one day before the event window end would be available.
              You did not previously mention the possibility that returns could be missing, and the example data did not have any such situations, so I did not write code that would handle this problem.

              The the code I proposed earlier can be modified to use the last non-missing return in calculating the cumulative return.

              Code:
              by Company (Date), sort: gen event_num = ///
                  sum(!missing(eventwindow) & missing(eventwindow[_n-1])) ///
                  if !missing(eventwindow)
                  
              by Company event_num (Date), sort: gen final_price = price if _n == 1 ///
                  & !missing(event_num)
              by Company event_num (Date): replace final_price = ///
                  cond(missing(price), final_price[_n-1], price) if _n > 1 & !missing(event_num)
              by Company event_num (Date): gen window_cum_return = ///
                  100*(final_price[_N]/price[1] - 1)
              sort Company Date
              I know that calculating cumulative returns (by adding them up) is mathematically incorrect, however if I understand the finance literature correctly, some people are still doing it (e.g. cumulative abnormal returns).
              That may be. I'm not in finance and don't read that literature. But I think cumulative abnormal returns is something different from a cumulative return over a period.

              Comment


              • #8
                Thank you Clyde Schechter.
                I have one more question, that has come up over the recent days. As I want to carry out a DiD regression with my data, I would like to compare the values of several independent variables in the "pre" period (before a specific date) between the treatment and control group. E.g. Leverage between Treatment and Control before the specific event.

                I have already created a variable pre which takes 0 before the specific date (and 1 after) and a variable called Treatment which takes a value of 1 for treated firms and 0 for untreated firms. I have seen several papers in the finance field, which show the (mean) outcome of Treatment and Control groups for several variables and then also present the t-statistic including the significance as well as the difference between the means.

                Is there an easy way to compute such a table? And would this command be the correct approach? Thank you once again.
                Code:
                ttest Leverage if post==0, by (Treatment)

                Comment


                • #9
                  Yes, the -ttest command you show will do that. It will not organize the results into a table, but it will calculate the group means, restricted to post == 0, and calculate ttests.

                  If you are using version 18 of Stata you can also do
                  Code:
                  dtable varlist if post == 0, by(Treatment, tests)
                  which will also do those calculations and provide a nice table of all the results. You just replace varlist by a list o the variables you want to compare. To be sure that Stata treats them appropriately as continuous or discrete variables, I recommend that you prefix all the continuous ones with c. and all the discrete ones with i. in the varlist. (The discrete variables will be contrasted using a chi-square test.)

                  All of that said, if you are doing a DID analysis, comparison of the levels of the variables in the pre-intervention period is not crucial. In fact, the validity of the DID approach is unimpaired even if the pre-intervention level of variables are very different across the groups. What matters most for the validity of the DID approach is comparing the slope of the outcome over time in the pre-period: this is known as the parallel trends assumption.

                  Also, do you have a reason for doing the t-tests (other than that you have seen others do it)? Statistical tests are for the purpose of testing hypothesis. Does your research question include a hypothesis about these variables having equal means prior to the event? If so, then, of course, proceed. But if not, you are providing an answer to a question that has not been asked. I think a table of descriptive statistics for the variables in both groups before the event is always of interest. But in general, embellishing that with t-statistics and p-values serves no purpose (even though it is widely done.)

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    Also, do you have a reason for doing the t-tests (other than that you have seen others do it)? Statistical tests are for the purpose of testing hypothesis. Does your research question include a hypothesis about these variables having equal means prior to the event? If so, then, of course, proceed. But if not, you are providing an answer to a question that has not been asked. I think a table of descriptive statistics for the variables in both groups before the event is always of interest. But in general, embellishing that with t-statistics and p-values serves no purpose (even though it is widely done.)
                    Clyde Schechter Thanks for the additional comments. I was just trying to explore the differences between both groups and display them. There was no clear initial hypothesis.

                    I would like to ask one further question regarding a regression analysis, that I would like to conduct. What I would like to do, is to regress the returns of each stock on a set of variables during each event but only in the period (eventwindow) from day -10 to day +1. My goal is to extract the residuals of each regression and the variance of the residuals for each regression in a new variable (1 variable for each). I thought about using the bysort command (bysort ID event_num), however I can´t come up with a way to extract the residuals and the variance of the residuals in this case .I have also tried using a loop structure (forvalues for each ID) but I failed to take into account each event (event_num) and the day restriction (-10, +1).


                    I have included sample data below. The sample only includes one stock (for brevity). Each stock can have multiple events (event_num). In this short case, I would like to use the variable portfolio as the independent variable. Note: I excluded any data not included in the event range ranging from -41 to +1 days around the event. In my dataset the event_num variable takes a missing value if there is no event. The same holds true for the eventwindow variable.
                    I would appreciate any sort of help. Thanks again!


                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input float ID double return float(eventwindow event_num portfolio)
                    45   1.4621616281367358 -42 1   .57
                    45    .8958130477117835 -41 1   .25
                    45   -.3667245705462359 -40 1  -.18
                    45    .1356063541263082 -39 1   .35
                    45    -.851228477461787 -38 1   .22
                    45  -1.4439024390243942 -37 1 -1.13
                    45   -.3959611958028029 -36 1  -.13
                    45   1.4112502484595524 -35 1   .61
                    45   1.0780086240689868 -34 1   .13
                    45   1.0277292999806111 -33 1   .16
                    45  -.05758157389635535 -32 1   .26
                    45                    . -31 1   .06
                    45    .1344344152102944 -30 1  -.08
                    45   .23014959723819994 -29 1   .22
                    45   -.7271335629544499 -28 1  -.23
                    45   -.5011565150347053 -27 1  -.03
                    45  -.30995738086012514 -26 1   .13
                    45    2.001554605518852 -25 1   .27
                    45  -.17146123071061806 -24 1   .11
                    45   -.8206106870229003 -23 1   .15
                    45    .6542235905330064 -22 1   .05
                    45   .30586885872681435 -21 1   .04
                    45   -.7051648561082474 -20 1     0
                    45    .1727447216890524 -19 1  -.25
                    45    3.851312511975484 -18 1    .3
                    45   -.3690036900369056 -17 1   .15
                    45   -.7037037037037085 -16 1   .52
                    45     .335695635956732 -15 1   .63
                    45    2.416356877323428 -14 1   .24
                    45   -.5263157894736826 -13 1  -.25
                    45   1.0946907498631533 -12 1   .08
                    45                    . -11 1    .1
                    45  -1.4618299945858062 -10 1   .49
                    45   .16483516483515806  -9 1     0
                    45   -.3474126897056093  -8 1   .17
                    45   -.8073394495412802  -7 1  -.22
                    45  -.09248982611913478  -6 1   .16
                    45 -.055545269394558665  -5 1  -.33
                    45  -3.8532789922193373  -4 1   -.8
                    45    2.003853564547205  -3 1   .44
                    45  -.45334340763127107  -2 1  -.02
                    45   1.0815939278937385  -1 1   .56
                    45   -.5819410550028201   0 1  -.18
                    45   1.6616314199395819   1 1   .24
                    45   -1.388628260461629 -42 2  -.08
                    45   1.4081826831589002 -41 2   .22
                    45    2.852317507975236 -40 2   .33
                    45  -.40138660828315786 -39 2   .12
                    45  -1.3738779996336323 -38 2  -.13
                    45   -1.318722139673107 -37 2  -.25
                    45   -.8469791078486784 -36 2  -.13
                    45                    . -35 2   .03
                    45                    . -34 2  -.01
                    45   .37965072133637595 -33 2   -.1
                    45   -.9266263237518948 -32 2  -.04
                    45   1.0116434434052322 -31 2  -.28
                    45                    . -30 2   .01
                    45   -.4157218442932707 -29 2  -.14
                    45   1.3282732447817756 -28 2   .39
                    45   1.4981273408239781 -27 2   .31
                    45    .7380073800737981 -26 2   .05
                    45     2.19780219780219 -25 2   .07
                    45    .7168458781362109 -24 2   .03
                    45                    . -23 2     0
                    45   .35587188612098886 -22 2  -.01
                    45                    . -21 2  -.17
                    45  -.17730496453900962 -20 2  -.19
                    45  -1.0657193605683737 -19 2   .35
                    45    .7181328545780944 -18 2   .48
                    45    .8912655971479502 -17 2   .24
                    45    .3533568904593564 -16 2   .64
                    45   .35211267605634305 -15 2  -.16
                    45     2.28070175438596 -14 2  -.19
                    45  -3.2590051457975964 -13 2   -.4
                    45  -1.9503546099290805 -12 2  -.12
                    45    .5424954792043477 -11 2    .4
                    45   1.0791366906474844 -10 2   .23
                    45   1.4234875444839807  -9 2  -.05
                    45                    .  -8 2   .46
                    45   -2.631578947368421  -7 2   .02
                    45  -1.6216216216216188  -6 2  -.48
                    45  -1.0989010989011014  -5 2  -.21
                    45   -.5555555555555503  -4 2   .17
                    45   1.8621973929236497  -3 2   .71
                    45   -2.925045703839125  -2 2 -1.28
                    45   .18832391713747915  -1 2  -.34
                    45    6.390977443609019   0 2   .34
                    45    .3533568904593564   1 2   .12
                    end
                    Last edited by David Moeller; 11 Dec 2024, 19:01.

                    Comment


                    • #11
                      Assuming your real data set has a large number of firms and events, the following would be a very efficient way to do this:
                      Code:
                      capture program drop one_event
                      program define one_event
                          regress return /*whatever regressors*/ if inrange(eventwindow, -10, 1)
                          predict resid if e(sample), resid
                          egen resid_variance = sd(resid)
                          replace resid_variance = resid_variance^2
                          exit
                      end
                      
                      runby one_event, by(ID event_num)
                      -runby- is written by Robert Picard and me; it is available from SSC.

                      Evidently replace /*whatever regressors*/ by the set of variables you are interested in regressing return against.

                      Comment


                      • #12
                        Clyde Schechter Thank you for your quick response! I will try to implement it later and try to understand it. Is there also a way to do this without using a user-written command?
                        I have tried the following. But it didn´t work out.

                        Code:
                        gen Residual=.
                        gen SE_Resiudal=.
                        forvalues i=1/45 { 
                        forvalues y=1/6 { 
                        reg return portfolio if eventwindow>=-10 & eventwindow<=1 & ID==`i' & event_num==`y' 
                        predict residual, residuals
                        predict se_residual, stdr
                        replace Residual=residual if eventwindow>=-10 & eventwindow<=1 & ID==`i' & event_num==`y'  
                        replace SE_Residual=se_residual if eventwindow>=-10 & eventwindow<=1 & ID==`i' & event_num==`y' 
                        drop residual 
                        drop se_residual
                        }
                        }

                        Comment


                        • #13
                          Clyde Schechter In addition to my previous code: I have tried to implement your code. When I run your code, I loose observations, which do not include events. After the code is completed, I only retain observations with a non-missing event_num. Ideally I would also like to keep the other observations. Is this possible? Thanks again for your effort and help! I should have included this in my initil dataset, which I posted. Sorry for this. The results seem to be correct however.
                          I also seem to get an error message which states: by-groups with errors = 700. Is this cause for concern?

                          Comment


                          • #14
                            Your code has the gist of it, but it isn't quite right. It would work just fine if you will only use the code on data sets with exactly 45 ids and every id has exactly 6 events. But even your own data doesn't meet that last criterion. So you need to actually loop over the events that exist.

                            Code:
                            capture program drop one_event
                            program define one_event
                                capture regress return portfolio if inrange(eventwindow, -10, 1)
                                if inlist(c(rc), 2000, 2001) { // ID WITH NO WINDOW
                                    exit 0
                                }
                                else if c(rc) != 0 {
                                    gen error_code = c(rc)
                                    exit c(rc)
                                }
                                else {
                                    predict resid if e(sample), resid
                                    summ resid
                                    gen resid_variance = r(Var) if e(sample)
                                }
                                exit
                            end
                            
                            runby one_event, by(ID event_num)
                            The bold face code shows the changes from the previous response. What this does is allow -regress- to not call an error when there are no, or too few, observations to carry out the requested regression. Instead, program one_event just exits normally and -runby- does not count it as group with error. If, however, some other error were to arise during the execution of the -regress- command (which is extremely uncommon), program one_event will store the Stata error code for that event in a new variable called error_code, and then exit as an error--which -runby- will count as a group with an error in its final table. So, in the unlikely event that when you run this code the final report from -runby- indicates that there were groups with errors, you will want to check out that error code variable to see what those error codes mean and then try to figure out why those event windows led to trouble. This revised code also will not drop the observations associated with IDs that have no events, because program one_event no longer reports this as an error to -runby-.

                            I understand the preference for using native Stata commands and tend to prefer that myself. Moreover, if you data set really only has 45 IDs and a maximum of 6 events in each, then the code that uses -foreach- loops is a reasonable way to go. I was imagining a data set with thousands of IDs and perhaps dozens of events in some of them. In a large data set like that, the -foreach- loops would prove painfully slow, and -runby- would leave it in the dust. But, apparently you are not in that situation, so it is perfectly sensible to stack with native Stata commands for this application.
                            Last edited by Clyde Schechter; 11 Dec 2024, 21:10.

                            Comment


                            • #15
                              Clyde Schechter Thank you! This code resolved the problem with the error message and observations aren´t dropped anymore. Would this command also allow me to generate a variable with the R^2 of each regression? In this case I am using multiple independent variables.
                              I don´t think there is a Stata option for this, right? So would the best approach be, to predict the fitted values and continue from thereon?
                              Last edited by David Moeller; 14 Dec 2024, 07:43.

                              Comment

                              Working...
                              X