Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Conditional median in panel dataset

    Dear Statalist Members,

    I am currently trying to compute a conditional median of bond ratings in a panel dataset of announcements and transactions.

    For each announcement (identified by non-missing anndats_act), I want to compute the median rating_filled taking into account each bond (identified by cusip_id) of the announcing firm (by oftic) traded within the last three months. The rating_filled might be updated during the three month period, so only the latest rating_filled should be considered in the median.

    I found a solution that is much too slow for my dataset with over 13 million observations (of which over 50 thousand announcements).

    Code:
    * Create a variable to store the median firm rating
    gen median_rating_filled = .
    
    * Loop through each announcement
    forvalues i = 1/`=_N' {
        if !missing(anndats_act[`i']) {
            local oftic = oftic[`i']
            local anndate = anndats_act[`i']
    
            * Keep only the latest observation per cusip_id within 3 months prior to the announcement. Exclude announcement day because of possible induced rating changes.
            preserve
            keep if oftic == "`oftic'" & inrange(trd_exctn_dt, `anndate'-90, `anndate'-1)
            by cusip_id (trd_exctn_dt), sort: keep if _n == _N
    
            * Calculate the median rating_filled
            egen median_rating_temp = median(rating_filled)
            local median_rating = median_rating_temp[1]
    
            * Store the median rating in the original dataset
            restore
            replace median_rating_filled = `median_rating' in `i'
        }
    }
    My data set (with only the variables relevant to this problem) looks like this:

    Code:
    clear
    input str10 cusip_id float trd_exctn_dt str6 oftic long anndats_act float rating_filled
    "69349LAS7" 22386 "PNC"     . 7
    "693475AX3" 22386 "PNC"     . 7
    "693475AQ8" 22386 "PNC"     . 9
    "693475AK1" 22386 "PNC"     . 9
    "693476BN2" 22386 "PNC"     . 7
    "69349LAG3" 22386 "PNC"     . 7
    "69349LAK4" 22386 "PNC"     . 7
    "693475AW5" 22386 "PNC"     . 7
    "693475AM7" 22386 "PNC"     . 9
    ""          22386 "PNC" 22386 .
    "693475AL9" 22386 "PNC"     . 7
    "69349LAM0" 22386 "PNC"     . 7
    end
    format %td trd_exctn_dt
    format %td anndats_act
    Does anyone know of a solution to increase the efficiency of my code, or even to avoid the generally slow combination of looping and preserve-restore?

    Thank you very much,
    Laura

  • #2
    First, thanks for using -dataex- on your very first Statalist post.

    I don't know if the following code is a correct solution to your problem or not. This is partly because I'm not certain I understand the problem, and mostly because I cannot test it on the example data you did show. (See below for more on that.) But give this a try, and if it is wrong, see if you can adapt it.

    Code:
    //    IDENTIFY 90 DAY WINDOWS PRECEDING ANNOUNCEMENTS
    gsort -trd_exctn_dt anndats_act
    gen ref_date = anndats_act
    replace ref_date = ref_date[_n-1] if missing(ref_date)
    format ref_date %td
    gen byte in_window = inrange(ref_date-trd_exctn_dt, -90, -1)
    
    //    IDENTIFY LAST RATING WITHIN WINDOW FOR EACH CUSIP
    by ref_date cusip_id in_window (trd_exctn_dt) , sort: ///
        egen final_rating = rating_filled[_N] if in_window
        
    //    IDENTIFY MEDIAN FINAL RATING AMONG CUSIPS IN THE WINDOW
    egen byte tag = tag(ref_date oftic cusip_id)
    by ref_date, sort: egen median_rating = median(cond(tag, final_rating, .))
    If this is not useful, I think you can gain considerable speed by making some modifiations of the code of your -loop- into a -program- and then running that program under Robert Picard's -rangerun- command (available from SSC). Be sure to read the -rangerun- help file carefully and probably work the examples given there yourself to get a feel for it before you try applying it to your actual problem.

    If you want assistance in applying -rangerun- for this, please post back with different example data. The example shown has only one observation per cusip_id, and all of the observations are on a single date. Even more problematic, because that single date is also the anndats_act, none of the observations fall within the 90 day lagging window for the announcement! What is needed is an example data set with a range of dates that includes some that are within the 3 month window for an announcement and others that are not, as well as some cusips where rating_filled has been revised during the 3 month window. Also, you need to explain how you want to handle the situation where one announcement falls within the 90 day window of another announcement.

    Comment


    • #3
      Being relatively new to Stata, it took me some time to understand your code, but with only minimal tweaking it worked perfectly!
      Your advice on -rangerun- is also much appreciated, as it looks like it could optimise my current project in many different ways.

      Thank you for your help, Clyde.

      Comment

      Working...
      X