Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging SDC and Compustat Data

    I am trying to merge the mergers and acquisitions data I obtained from SDC database into my data. My company identifiers are 6, 8, and 9 digit CUSIP of firms. However, the CUSIP codes I got in SDC are 6 digit ones ending with a letter (for example: 43657H; 000045; 25406M). How is it possible to convert these into 8 or 9 digit CUSIP codes so that I can merge them into my data. I will appreciate comments and guidelines.

  • #2
    Generally it's easier to use the left 6 digits (characters) from the longer CUSIP to match between the two databases. For example, Google's 6-digit CUSIP is 38259P while it's 9-dgit CUSIP is 38259P50; similarly, IBM's 6-digit CUSIP is 459200, while it's 9-digit CUSIP is 459200101. In Stata you can select the left 6 characters by using the substr command. For example
    Code:
     gen cusip_6 = substr( cusip_9,1,6)
    . Note that the variable cusip_9 must be a string for this to work.

    Comment


    • #3
      Thenks, but the problem is that almost none of my cusip codes in Compustat data include any letters (that is, all 6, 8, or 9 digits are numbers); however, in the SDC data many of the cusip codes include 5 digits and a letter (for example: 25635n; 325m25). How can I handle that? Is there anything I can do???

      Comment


      • #4
        Ali, it would helpful if you could provide some examples. It's not clear to me whether the challenge is that in one case Stata may treat it like a string vs a numeric, or whether you mean they may not match (because some have letters and some don't). You might try spot-checking how many match on the left 5. A CUSIP is just a 6-digit identifier, a 2-digit issue number (for when company issues different stock or bond offerings, and the last digit (in a 9 digit CUSIP) is a checksum.

        Also, Compustat has a CUSIP converter that allows you to convert 8 or 9 digit ones into 6-digit ones (you can read more at http://bit.ly/2EtdsKm. See also http://bit.ly/2Ew17oD

        At one point there was an Excel converter tool for converting CUSIP's as well (that you can search for online).


        Comment


        • #5
          The problem is associated with matching. There is no issue regarding string or numeric. The problem is the CUSIPs in SDC are 6 digit ones, but they include a letter and 5 numbers (for example: 52365b); however, those of COMPUSTAT are all numbers and do not include any letters. This makes matching problematic.

          Comment


          • #6
            So here is a screenshot of a match I did (but probably initially in Excel using VLOOKUP or searching for them by hand) as a doctoral student.
            Variables: acq_nm permno acq_gvkey acq_cusip came from COMPUSTAT
            Variables: acq_nm_sdc and acq_cusip_sdc came from SDC (obviously)

            Note that both acq_cusip and acq_cusip_SDC are 6-character string variables in my Stata file, but when I pasted them to Excel it converted them to numbers and dropped the leading "0"'s. So anything with < 6 numbers is missing leading 000's. For example, ADC Telecommunications CUSIP is really "000886" and ASK's is "001903". I've highlighted in yellow observations where the cusip's differed between databases. Also note that for MCI and C-COR, SDC has a couple of different names for them.

            Click image for larger version

Name:	SDC Cusip.png
Views:	1
Size:	63.3 KB
ID:	1468416

            Comment


            • #7
              Thanks Davıd. So let's consider your last observation. The cusip in SDC is 30342A and the cusip for Compustat is 433578. How have these matched??????? They are not the same, are they?

              Comment


              • #8
                That one (as well as the Computer Assoc vs CA) I had to match manually. So I started with a list of acquirers and then was matching them to financial data in Compustat, patent data, etc. I did a first merge to see how many would be "easy" matches. That got about 80% of the way there. I then collapsed down Compustat to 1 line per firm, and then took my acquirers from SDC that didn't find a match on cusip, and then looked those up manually. I only had 1000 acquirers, so looking up the last 200 by hand was doable.

                A couple of ways to proceed (since if you are going to look at 50,000 obs, the manual method isn't doable)
                1) Try the initial match and see how many matches you get (i.e. convert both to string or destring both of them).

                2) For the unmatched, then try taking the 1st 5 characters of the cusip and matching on that. gen cusip5 = substr(cusip, 1, 5)

                3) I had to merge a list of startups (targets from SDC matched to VentureXpert) to CorpTech data, I used Stata's matchit (SSC) function, which allows for fuzzy matching, to match on company names (which is a pain)

                Code:
                matchit id target_nm using "Using.dta", idusing(CorpTech_id) txtusing(CorpTech_name)  gen(similarity_score)  threshold(0.70)
                4) To clean up the names for the match (to raise match scores for true matches), I also ran some clean up on the names to get rid of "-ADR" or "-CL A" or "-CL B" at the end of names. I think I also got rid of "CORP." "Corp" ", inc." etc at the end when using matchit. Although, based on the similarity of the names in the screenshot, just matching on name (after trying the cusip) might get you pretty close.

                For what it's worth, this is the code I used to clean up (obviously, you could run this as a loop)
                Code:
                replace CorpTech_name = subinstr(CorpTech_name, ", Inc.", "",.)
                replace CorpTech_name = subinstr(CorpTech_name, ", Inc", "",.)
                replace CorpTech_name = subinstr(CorpTech_name, "Corporation", "",.)
                replace CorpTech_name = subinstr(CorpTech_name, "Corp.", "",.)
                replace CorpTech_name = subinstr(CorpTech_name, " Corp", "",.)
                replace CorpTech_name = trim(CorpTech_name)
                replace CorpTech_name = itrim(CorpTech_name)

                Comment


                • #9
                  While this may be a late response, I have experienced something similar when trying to merge Compustat with SDC. Apparently, Compustat uses recent CUSIP while SDC uses historical CUSIP. The best way to overcome this is to download a dataset from CRSP called stocknames which has historical CUSIP. I used WRDS package in Python to grab the dataset stocknames from CRSP and here is how the link to how https://github.com/wharton/wrds

                  Comment


                  • #10
                    To elaborate on my previous post, stocknames from CRSP will have the name of historical CUSIP as ncusip. I'm including an example code for reference and demonstration

                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input str8(cusip ncusip)
                    "68391610" "68391610"
                    "36720410" "39040610"
                    "36720410" "29274A10"
                    "36720410" "29274A20"
                    "36720410" "29269V10"
                    "36720410" "29269V10"
                    "36720410" "36720410"
                    "05978R10" "60740110"
                    "05978R10" "83623410"
                    "05978R10" "05978R10"
                    end

                    Comment


                    • #11
                      @Salem Alsanousi Hi Salem, I'm trying to merge SDC M&A data with Compustat Fundamental Dataset by using -cusip year-. However, when I tried to merge stocknames data with Compustat Fundamental Dataset using -permno year-, there are many unmatched observations. Do you possibly know why? Many thanks!

                      Comment

                      Working...
                      X