Cross Sectional

Giovanni DellErba

Join Date: May 2024

Posts: 3
#1

Cross Sectional

13 Jun 2024, 06:25

Dear all, I would like to ask for your input on the following task. I have a monthly panel dataset of stock (identifier PERMNO) with variable date as a time variable. For each PERMNO and date, I have the following 2 variables: 1) market beta and 2) PVS beta (where PVS is a proxy for market sentiment). My goal is the following: each month, I have to calculate the cross-sectional Pearson product-moment correlation between market beta and PVS beta which I will call ρβ_βΔPVS. After calculating the cross-sectional correlation, I have to divide months in the sample into those with high and low cross-sectional correlation between market beta and PVS beta. High (low) ρβ_βΔPVS months are taken to be those months with values of ρβ_βΔPVS greater than or equal to (less than) the median ρβ_βΔPVS.
I have used the following code, but I am super unsure whether this is the correct way:

Code:
// stocks beta by PERMNO date
use "Beta Monthly new"
format ym %tm

// merge the beta PVS
merge 1:1 PERMNO ym using "bPVS FINAL"
drop if _merge == 1| _merge == 2

// clean dataset
collapse beta betaPVS, by(dateff PERMNO)
drop if beta == . | betaPVS == .

// panel dataset
egen firm = group(PERMNO)
egen time = group(date)
// Sort the data by date
sort date

// Create a variable to store the correlation results
gen ρβ_βΔPVS = .

// Generate a unique identifier for each month
egen month = group(date)

// Get the list of unique month identifiers
levelsof month, local(months)

// Loop over each month to calculate the correlation
foreach m of local months {
* Restrict the data to the current month
preserve
keep if month == `m'

* Check if there are enough observations to calculate correlation
quietly count
if r(N) > 1 {
* Debug: Print the current month and number of observations
di "Processing month: `m' with observations: " r(N)

* Calculate the correlation
corr beta betaPVS, means

* Debug: Print the correlation matrix
matrix list r(C)

* Extract the correlation coefficient
matrix C = r(C)
local corr = C[1,2]

* Restore the full dataset
restore

* Store the result in the dataset
replace ρβ_βΔPVS = `corr' if month == `m'
}
else {
* Debug: Print message if not enough observations
di "Not enough observations for month: `m'"

* Restore the full dataset if not enough observations
restore
}
}

// Check the results
list date ρβ_βΔPVS if !missing(ρβ_βΔPVS)

// Summarize the correlations
summarize ρβ_βΔPVS

// Calculate the median of the corr_beta_ivol
sum ρβ_βΔPVS, detail

// Split the sample based on median correlation
egen median_ρβ_βΔPVS = median(ρβ_βΔPVS)
gen high_low_corr = ρβ_βΔPVS >= median_ρβ_βΔPVS

// clean dataset
collapse high_low_corr, by (dateff PERMNO)

In case it is needed, this is a small portion of the dataset I have
PERMNO date beta betaPVS
10006 31aug1971 .79016378 -.00768161
10014 31aug1971 1.9341393 -.02881953
10057 31aug1971 .97684805 .00235655
10102 31aug1971 1.4654419 .00578056
10137 31aug1971 .80199136 -.00021375
10145 31aug1971 1.2391442 .0222873
10153 31aug1971 1.7417629 -.00638195
10161 31aug1971 .8160666 .01480798
10188 31aug1971 2.2849678 .0101683
10225 31aug1971 .79444095 -.00114427
10233 31aug1971 1.4605849 -.00501463
10241 31aug1971 .74535671 -.00248796
10276 31aug1971 .26219438 -.00930459
10321 31aug1971 1.3446186 -.01120439
10364 31aug1971 1.0343247 -.0062352
10372 31aug1971 .98055205 .01119765
10401 31aug1971 .7315204 .00118681
10460 31aug1971 1.7546463 .01137953
10479 31aug1971 1.1086954 -.01760248
10487 31aug1971 .50537227 -.01441397
10495 31aug1971 1.1820425 -.00525025
10516 31aug1971 1.1805342 -.00638053
10559 31aug1971 .76682362 .00182585
10604 31aug1971 1.3611181 -.00065288
10698 31aug1971 1.0273563 -.00911494
10727 31aug1971 .32257196 -.00793185
10743 31aug1971 1.0391832 .01016556
10751 31aug1971 1.6242846 -.01652368

Thank to everyone in advance for you answer,
Please let me know if something else is needed from my side

Best regards,
Giovanni Dell'Erba
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

13 Jun 2024, 10:36

I see one questionable aspect to your code. -egen month = group(date)-. This is fine if each month is always represented by the same date in the data set. Your example data conforms to this, but it only shows one date anyway, so it conforms in a rather vacuous way. But if your data set contains, in addition to observations with date = 31aug1971, observations with, say, date = 15aug1971, these will be given different value of the variable month that you created, which will then lead to incorrect calculations.

More generally, using -egen, group()- is not a very good way to group dates into months. It will only work under the restrictive conditions on the data referred to in the previous paragraph. And the resulting variable will be a set of consecutive integers that are not informative about what months they actually represent. Better to use Stata internal format monthly date variables for that purpose.

Also, you calculate your median correlation before you collapse the data set. But that means that each month gets multiply counted according to the number of distinct permno's instantiated in its data. Unless every month has exactly the same number of permno's, this will give an incorrect result. The simplest way around this is to only store the correlation in one observation per month.

So, putting this altogether, here's how I would approach this:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input int permno str9 date float(beta betapvs) 10006 "31aug1971" .7901638 -.00768161 10014 "31aug1971" 1.9341393 -.02881953 10057 "31aug1971" .9768481 .00235655 10102 "31aug1971" 1.465442 .00578056 10137 "31aug1971" .8019913 -.00021375 10145 "31aug1971" 1.2391442 .0222873 10153 "31aug1971" 1.741763 -.00638195 10161 "31aug1971" .8160666 .01480798 10188 "31aug1971" 2.284968 .0101683 10225 "31aug1971" .7944409 -.00114427 10233 "31aug1971" 1.460585 -.00501463 10241 "31aug1971" .7453567 -.00248796 10276 "31aug1971" .26219437 -.00930459 10321 "31aug1971" 1.3446186 -.01120439 10364 "31aug1971" 1.0343246 -.0062352 10372 "31aug1971" .9805521 .01119765 10401 "31aug1971" .7315204 .00118681 10460 "31aug1971" 1.7546463 .01137953 10479 "31aug1971" 1.1086954 -.01760248 10487 "31aug1971" .5053723 -.01441397 10495 "31aug1971" 1.1820425 -.00525025 10516 "31aug1971" 1.1805342 -.00638053 10559 "31aug1971" .7668236 .00182585 10604 "31aug1971" 1.361118 -.00065288 10698 "31aug1971" 1.0273563 -.00911494 10727 "31aug1971" .322572 -.00793185 10743 "31aug1971" 1.0391833 .01016556 10751 "31aug1971" 1.6242846 -.01652368 end // ASSUMING date IS A STRING VARIABLE DO THIS FIRST: gen ddate = daily(date, "DMY") format ddate %td assert missing(ddate) == missing(date) drop date rename ddate date // IF date WAS ALREADY A STATA INTERNAL FORMAT DATE VARIABLE SKIP THE ABOVE AND START HERE gen mdate = mofd(date) format mdate %tm egen selected = tag(mdate) gen correlation = . levelsof mdate, local(months) foreach m of local months { count if mdate == `m' & !missing(beta, betapvs) if r(N) > 1 { corr beta betapvs if mdate == `m' replace correlation = r(rho) if mdate == `m' & selected } } summ correlation, detail local median = r(p50) gen byte high_corr = correlation >= `median' if !missing(correlation)

Notice that to respond to your question, I had to guess whether date is a string variable, because the listing you showed contains no metadata. The code differs depending on this. That is why, to avoid ambiguity, or just getting code that won't work as a response, it is best to use the -dataex- command to show example data, as I have done here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Announcement

Comment