Conditional median in panel dataset

Laura Bode

Join Date: Nov 2024
Posts: 2

Conditional median in panel dataset

11 Nov 2024, 08:56

Dear Statalist Members,

I am currently trying to compute a conditional median of bond ratings in a panel dataset of announcements and transactions.

For each announcement (identified by non-missing anndats_act), I want to compute the median rating_filled taking into account each bond (identified by cusip_id) of the announcing firm (by oftic) traded within the last three months. The rating_filled might be updated during the three month period, so only the latest rating_filled should be considered in the median.

I found a solution that is much too slow for my dataset with over 13 million observations (of which over 50 thousand announcements).

Code:

* Create a variable to store the median firm rating
gen median_rating_filled = .

* Loop through each announcement
forvalues i = 1/`=_N' {
    if !missing(anndats_act[`i']) {
        local oftic = oftic[`i']
        local anndate = anndats_act[`i']

        * Keep only the latest observation per cusip_id within 3 months prior to the announcement. Exclude announcement day because of possible induced rating changes.
        preserve
        keep if oftic == "`oftic'" & inrange(trd_exctn_dt, `anndate'-90, `anndate'-1)
        by cusip_id (trd_exctn_dt), sort: keep if _n == _N

        * Calculate the median rating_filled
        egen median_rating_temp = median(rating_filled)
        local median_rating = median_rating_temp[1]

        * Store the median rating in the original dataset
        restore
        replace median_rating_filled = `median_rating' in `i'
    }
}

My data set (with only the variables relevant to this problem) looks like this:

Code:

clear
input str10 cusip_id float trd_exctn_dt str6 oftic long anndats_act float rating_filled
"69349LAS7" 22386 "PNC"     . 7
"693475AX3" 22386 "PNC"     . 7
"693475AQ8" 22386 "PNC"     . 9
"693475AK1" 22386 "PNC"     . 9
"693476BN2" 22386 "PNC"     . 7
"69349LAG3" 22386 "PNC"     . 7
"69349LAK4" 22386 "PNC"     . 7
"693475AW5" 22386 "PNC"     . 7
"693475AM7" 22386 "PNC"     . 9
""          22386 "PNC" 22386 .
"693475AL9" 22386 "PNC"     . 7
"69349LAM0" 22386 "PNC"     . 7
end
format %td trd_exctn_dt
format %td anndats_act

Does anyone know of a solution to increase the efficiency of my code, or even to avoid the generally slow combination of looping and preserve-restore?

Thank you very much,
Laura

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29453
#2

11 Nov 2024, 11:03

First, thanks for using -dataex- on your very first Statalist post.

I don't know if the following code is a correct solution to your problem or not. This is partly because I'm not certain I understand the problem, and mostly because I cannot test it on the example data you did show. (See below for more on that.) But give this a try, and if it is wrong, see if you can adapt it.

Code:

// IDENTIFY 90 DAY WINDOWS PRECEDING ANNOUNCEMENTS gsort -trd_exctn_dt anndats_act gen ref_date = anndats_act replace ref_date = ref_date[_n-1] if missing(ref_date) format ref_date %td gen byte in_window = inrange(ref_date-trd_exctn_dt, -90, -1) // IDENTIFY LAST RATING WITHIN WINDOW FOR EACH CUSIP by ref_date cusip_id in_window (trd_exctn_dt) , sort: /// egen final_rating = rating_filled[_N] if in_window // IDENTIFY MEDIAN FINAL RATING AMONG CUSIPS IN THE WINDOW egen byte tag = tag(ref_date oftic cusip_id) by ref_date, sort: egen median_rating = median(cond(tag, final_rating, .))

If this is not useful, I think you can gain considerable speed by making some modifiations of the code of your -loop- into a -program- and then running that program under Robert Picard's -rangerun- command (available from SSC). Be sure to read the -rangerun- help file carefully and probably work the examples given there yourself to get a feel for it before you try applying it to your actual problem.

If you want assistance in applying -rangerun- for this, please post back with different example data. The example shown has only one observation per cusip_id, and all of the observations are on a single date. Even more problematic, because that single date is also the anndats_act, none of the observations fall within the 90 day lagging window for the announcement! What is needed is an example data set with a range of dates that includes some that are within the 3 month window for an announcement and others that are not, as well as some cusips where rating_filled has been revised during the 3 month window. Also, you need to explain how you want to handle the situation where one announcement falls within the 90 day window of another announcement.
2 likes
Comment
Laura Bode

Join Date: Nov 2024

Posts: 2
#3

11 Nov 2024, 14:53

Being relatively new to Stata, it took me some time to understand your code, but with only minimal tweaking it worked perfectly!
Your advice on -rangerun- is also much appreciated, as it looks like it could optimise my current project in many different ways.

Thank you for your help, Clyde.
Comment

Announcement

Conditional median in panel dataset

Comment

Comment