Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cross Sectional

    Dear all, I would like to ask for your input on the following task. I have a monthly panel dataset of stock (identifier PERMNO) with variable date as a time variable. For each PERMNO and date, I have the following 2 variables: 1) market beta and 2) PVS beta (where PVS is a proxy for market sentiment). My goal is the following: each month, I have to calculate the cross-sectional Pearson product-moment correlation between market beta and PVS beta which I will call ρβ_βΔPVS. After calculating the cross-sectional correlation, I have to divide months in the sample into those with high and low cross-sectional correlation between market beta and PVS beta. High (low) ρβ_βΔPVS months are taken to be those months with values of ρβ_βΔPVS greater than or equal to (less than) the median ρβ_βΔPVS.
    I have used the following code, but I am super unsure whether this is the correct way:

    Code:
    // stocks beta by PERMNO date
    use "Beta Monthly new"
    format ym %tm

    // merge the beta PVS
    merge 1:1 PERMNO ym using "bPVS FINAL"
    drop if _merge == 1| _merge == 2

    // clean dataset
    collapse beta betaPVS, by(dateff PERMNO)
    drop if beta == . | betaPVS == .

    // panel dataset
    egen firm = group(PERMNO)
    egen time = group(date)
    // Sort the data by date
    sort date

    // Create a variable to store the correlation results
    gen ρβ_βΔPVS = .

    // Generate a unique identifier for each month
    egen month = group(date)

    // Get the list of unique month identifiers
    levelsof month, local(months)

    // Loop over each month to calculate the correlation
    foreach m of local months {
    * Restrict the data to the current month
    preserve
    keep if month == `m'

    * Check if there are enough observations to calculate correlation
    quietly count
    if r(N) > 1 {
    * Debug: Print the current month and number of observations
    di "Processing month: `m' with observations: " r(N)

    * Calculate the correlation
    corr beta betaPVS, means

    * Debug: Print the correlation matrix
    matrix list r(C)

    * Extract the correlation coefficient
    matrix C = r(C)
    local corr = C[1,2]

    * Restore the full dataset
    restore

    * Store the result in the dataset
    replace ρβ_βΔPVS = `corr' if month == `m'
    }
    else {
    * Debug: Print message if not enough observations
    di "Not enough observations for month: `m'"

    * Restore the full dataset if not enough observations
    restore
    }
    }


    // Check the results
    list date ρβ_βΔPVS if !missing(ρβ_βΔPVS)

    // Summarize the correlations
    summarize ρβ_βΔPVS

    // Calculate the median of the corr_beta_ivol
    sum ρβ_βΔPVS, detail

    // Split the sample based on median correlation
    egen median_ρβ_βΔPVS = median(ρβ_βΔPVS)
    gen high_low_corr = ρβ_βΔPVS >= median_ρβ_βΔPVS

    // clean dataset
    collapse high_low_corr, by (dateff PERMNO)

    In case it is needed, this is a small portion of the dataset I have
    PERMNO date beta betaPVS
    10006 31aug1971 .79016378 -.00768161
    10014 31aug1971 1.9341393 -.02881953
    10057 31aug1971 .97684805 .00235655
    10102 31aug1971 1.4654419 .00578056
    10137 31aug1971 .80199136 -.00021375
    10145 31aug1971 1.2391442 .0222873
    10153 31aug1971 1.7417629 -.00638195
    10161 31aug1971 .8160666 .01480798
    10188 31aug1971 2.2849678 .0101683
    10225 31aug1971 .79444095 -.00114427
    10233 31aug1971 1.4605849 -.00501463
    10241 31aug1971 .74535671 -.00248796
    10276 31aug1971 .26219438 -.00930459
    10321 31aug1971 1.3446186 -.01120439
    10364 31aug1971 1.0343247 -.0062352
    10372 31aug1971 .98055205 .01119765
    10401 31aug1971 .7315204 .00118681
    10460 31aug1971 1.7546463 .01137953
    10479 31aug1971 1.1086954 -.01760248
    10487 31aug1971 .50537227 -.01441397
    10495 31aug1971 1.1820425 -.00525025
    10516 31aug1971 1.1805342 -.00638053
    10559 31aug1971 .76682362 .00182585
    10604 31aug1971 1.3611181 -.00065288
    10698 31aug1971 1.0273563 -.00911494
    10727 31aug1971 .32257196 -.00793185
    10743 31aug1971 1.0391832 .01016556
    10751 31aug1971 1.6242846 -.01652368

    Thank to everyone in advance for you answer,
    Please let me know if something else is needed from my side


    Best regards,
    Giovanni Dell'Erba

  • #2
    I see one questionable aspect to your code. -egen month = group(date)-. This is fine if each month is always represented by the same date in the data set. Your example data conforms to this, but it only shows one date anyway, so it conforms in a rather vacuous way. But if your data set contains, in addition to observations with date = 31aug1971, observations with, say, date = 15aug1971, these will be given different value of the variable month that you created, which will then lead to incorrect calculations.

    More generally, using -egen, group()- is not a very good way to group dates into months. It will only work under the restrictive conditions on the data referred to in the previous paragraph. And the resulting variable will be a set of consecutive integers that are not informative about what months they actually represent. Better to use Stata internal format monthly date variables for that purpose.

    Also, you calculate your median correlation before you collapse the data set. But that means that each month gets multiply counted according to the number of distinct permno's instantiated in its data. Unless every month has exactly the same number of permno's, this will give an incorrect result. The simplest way around this is to only store the correlation in one observation per month.

    So, putting this altogether, here's how I would approach this:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int permno str9 date float(beta betapvs)
    10006 "31aug1971"  .7901638 -.00768161
    10014 "31aug1971" 1.9341393 -.02881953
    10057 "31aug1971"  .9768481  .00235655
    10102 "31aug1971"  1.465442  .00578056
    10137 "31aug1971"  .8019913 -.00021375
    10145 "31aug1971" 1.2391442   .0222873
    10153 "31aug1971"  1.741763 -.00638195
    10161 "31aug1971"  .8160666  .01480798
    10188 "31aug1971"  2.284968   .0101683
    10225 "31aug1971"  .7944409 -.00114427
    10233 "31aug1971"  1.460585 -.00501463
    10241 "31aug1971"  .7453567 -.00248796
    10276 "31aug1971" .26219437 -.00930459
    10321 "31aug1971" 1.3446186 -.01120439
    10364 "31aug1971" 1.0343246  -.0062352
    10372 "31aug1971"  .9805521  .01119765
    10401 "31aug1971"  .7315204  .00118681
    10460 "31aug1971" 1.7546463  .01137953
    10479 "31aug1971" 1.1086954 -.01760248
    10487 "31aug1971"  .5053723 -.01441397
    10495 "31aug1971" 1.1820425 -.00525025
    10516 "31aug1971" 1.1805342 -.00638053
    10559 "31aug1971"  .7668236  .00182585
    10604 "31aug1971"  1.361118 -.00065288
    10698 "31aug1971" 1.0273563 -.00911494
    10727 "31aug1971"   .322572 -.00793185
    10743 "31aug1971" 1.0391833  .01016556
    10751 "31aug1971" 1.6242846 -.01652368
    end
    
    // ASSUMING date IS A STRING VARIABLE DO THIS FIRST:
    gen ddate = daily(date, "DMY")
    format ddate %td
    assert missing(ddate) == missing(date)
    drop date
    rename ddate date
    
    // IF date WAS ALREADY A STATA INTERNAL FORMAT DATE VARIABLE SKIP THE ABOVE AND START HERE
    gen mdate = mofd(date)
    format mdate %tm
    egen selected = tag(mdate)
    
    gen correlation = .
    levelsof mdate, local(months)
    foreach m of local months {
        count if mdate == `m' & !missing(beta, betapvs)
        if r(N) > 1 {
            corr beta betapvs if mdate == `m'
            replace correlation = r(rho) if mdate == `m' & selected
        }
    }
    
    summ correlation, detail
    local median = r(p50)
    gen byte high_corr = correlation >= `median' if !missing(correlation)
    Notice that to respond to your question, I had to guess whether date is a string variable, because the listing you showed contains no metadata. The code differs depending on this. That is why, to avoid ambiguity, or just getting code that won't work as a response, it is best to use the -dataex- command to show example data, as I have done here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.


    Comment

    Working...
    X