Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging via CUSIP and missing matches

    Hello, I am trying to merge the Thomson 13F summary file on Ownership with a quarterly Fundamentals Compustat file. The 13F file only comes with CUSIP as an identifier and while I can merge the two datasets by reducing the CUSIP to either 6 or 8 digits, the data seems to miss certain matches where the firms are the same and observations should exist. I also get the message that: "variable cusip was str10, now str12 to accommodate using data's values"

    If I want to use the PERMNO linking table, I face the problem that I do not know how to match them with the right date range. In the linking file, I am provided with 2 variables: the start of the link and the end data. My idea was to duplicate each observation in the linking file on a quarterly basis within the date range. However, I do not know how to do this. But I think that after that, I can match the two files as CUSIP is unique for a particular point in time.

    Again, the main goal is to bring together the quarterly fundamentals from the Compustat file with the ownership data which is also quarterly.

    Thank you for your help!

    Compustat Data

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str6 GVKEY long datadate str10 cusip
    "001001" 8673 "000165100"
    "001001" 8765 "000165100"
    "001001" 8856 "000165100"
    "001001" 8947 "000165100"
    "001001" 9039 "000165100"
    "001001" 9131 "000165100"
    "001001" 9221 "000165100"
    "001001" 9312 "000165100"
    "001001" 9404 "000165100"
    "001001" 9496 "000165100"
    "001001" 9586 "000165100"
    "001003" 8765 "000354100"
    "001003" 8856 "000354100"
    "001003" 8947 "000354100"
    "001003" 9039 "000354100"
    "001003" 9131 "000354100"
    "001003" 9251 "000354100"
    "001003" 9343 "000354100"
    "001003" 9435 "000354100"
    "001003" 9527 "000354100"
    end
    format %td datadate
    13F File

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str12 cusip long rdate
    "00036110" 7395
    "00080010" 7395
    "00102910" 7395
    "00103810" 7395
    "00122010" 7395
    "00168810" 7395
    "00172310" 7395
    "00185510" 7395
    "00202410" 7395
    "00203010" 7395
    "00204010" 7395
    "00205010" 7395
    "00208010" 7395
    "00228010" 7395
    "00244010" 7395
    "00245510" 7395
    "00282410" 7395
    "00368010" 7395
    "00400910" 7395
    "00429510" 7395
    end
    format %td rdate
    Linking Table

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str6 gvkey str10 cusip long(LINKDT LINKENDDT)
    "001000" "000032102"  3969  6755
    "001001" "000165100"  8663  9708
    "001002" "000352104"  4731  4904
    "001003" "000354100"  8741 10820
    "001004" "000361105"  4497    .e
    "001005" "000370106"  4779  8431
    "001007" "000774109"  5022  6969
    "001007" "000774109"  6970  9037
    "001008" "000775106"  8637  9918
    "001009" "000781104"  8053 13221
    "001010" "00099V004"   761  8945
    "001010" "00099V004" -3532   760
    "001011" "000872309"  8480 13054
    "001012" "000874107"  6605 10955
    "001013" "000886309"  7014 18627
    "001015" "000893107"  8064  9800
    "001016" "001015106"  6543 10343
    "001016" "001015106"  5053  6542
    "001017" "001030105"  4731 13208
    "001018" "001032101"  6970  7733
    end
    format %td LINKDT
    format %td LINKENDDT

  • #2
    I'm not entirely sure what you want to do here. The main source of my confusion is the variable rdate in the 13F file. It is unclear what role, if any, it plays. Does it need to fall within LINKDT and LINKENDDT, just as datadate does in the COMPUSTAT file? Or is it relevant? In any case, as none of the cusip values that appear in the 13F example also occur in the linking table example, it isn't possible to truly test code for this process.

    In what follows, I will assume that in order to match, you do want rdate in the 13F file to fall between LINKDT and LINKENDDT.

    While the idea of expanding every observation in the linking data to cover every date between those given will work, it will require a very large amount of memory to do that, and if your files are large to start with, the result may very well exceed the memory available for your process. In any case, there is no need for that, because we have resort to the -rangejoin- command.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str6 GVKEY long datadate str10 cusip
    "001001" 8673 "000165100"
    "001001" 8765 "000165100"
    "001001" 8856 "000165100"
    "001001" 8947 "000165100"
    "001001" 9039 "000165100"
    "001001" 9131 "000165100"
    "001001" 9221 "000165100"
    "001001" 9312 "000165100"
    "001001" 9404 "000165100"
    "001001" 9496 "000165100"
    "001001" 9586 "000165100"
    "001003" 8765 "000354100"
    "001003" 8856 "000354100"
    "001003" 8947 "000354100"
    "001003" 9039 "000354100"
    "001003" 9131 "000354100"
    "001003" 9251 "000354100"
    "001003" 9343 "000354100"
    "001003" 9435 "000354100"
    "001003" 9527 "000354100"
    end
    format %td datadate
    tempfile compustat
    save `compustat'
    
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str12 cusip long rdate
    "00036110" 7395
    "00080010" 7395
    "00102910" 7395
    "00103810" 7395
    "00122010" 7395
    "00168810" 7395
    "00172310" 7395
    "00185510" 7395
    "00202410" 7395
    "00203010" 7395
    "00204010" 7395
    "00205010" 7395
    "00208010" 7395
    "00228010" 7395
    "00244010" 7395
    "00245510" 7395
    "00282410" 7395
    "00368010" 7395
    "00400910" 7395
    "00429510" 7395
    end
    format %td rdate
    tempfile 13f
    save `13f'
    
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str6 gvkey str10 cusip long(LINKDT LINKENDDT)
    "001000" "000032102"  3969  6755
    "001001" "000165100"  8663  9708
    "001002" "000352104"  4731  4904
    "001003" "000354100"  8741 10820
    "001004" "000361105"  4497    .e
    "001005" "000370106"  4779  8431
    "001007" "000774109"  5022  6969
    "001007" "000774109"  6970  9037
    "001008" "000775106"  8637  9918
    "001009" "000781104"  8053 13221
    "001010" "00099V004"   761  8945
    "001010" "00099V004" -3532   760
    "001011" "000872309"  8480 13054
    "001012" "000874107"  6605 10955
    "001013" "000886309"  7014 18627
    "001015" "000893107"  8064  9800
    "001016" "001015106"  6543 10343
    "001016" "001015106"  5053  6542
    "001017" "001030105"  4731 13208
    "001018" "001032101"  6970  7733
    end
    format %td LINKDT
    format %td LINKENDD
    tempfile linking
    save `linking'
    
    use `linking', clear
    rename gvkey GVKEY
    rangejoin datadate LINKDT LINKENDDT using `compustat', by(GVKEY)
    drop if missing(datadate) // NO MATCHES IN COMPUSTAT
    rename cusip_U cusip_compustat
    
    rangejoin rdate LINKDT LINKENDDT using `13f', by(cusip)
    Notes:
    1. -rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.
    2. I have stored all of the example data in tempfiles. That is just a convenience for me so I don't have to clean up my hard drive after every Statalist post. You should just use the data in the permanent data files that you already have.

    Comment


    • #3
      the data seems to miss certain matches where the firms are the same and observations should exist.
      "Seems" and "should" are not very definitive. Can you present an observation in your 13F data and an observation in your Computstat data that have the same 8-digit CUSIP and same rdate and were not matched in the output of the merge? Please again use dataex to present each of the examples. And with that present the merge command you used, and its output.

      I also get the message that: "variable cusip was str10, now str12 to accommodate using data's values"
      This is not an error. Your Computstat data has cusip stored as str10, and your 13F data has cusip stored as str12, so merge enlarged cusip from 10 to 12 in your Computstat data in case any 11- or 12-character cusip appeared in your 13F data. Such a cusip wouldn't match to anything in the Compustat data, of course, but it would be included in the output with _merge==2.

      Comment


      • #4
        Thank you both for you answers. rdate represents the filing date which happens quarterly and I also have other variables such as the percentage of institutional ownership. Later I want to perform a regression on this while controlling for fundamentals and stock data from Compustat and crsp. The 13F data was just sorted by date and not by date and cusip.
        In the end, I want a dataset where I have some fundamentals from compustat such as market value on a quarterly basis and for that same quarter the 13F data such as percentage of institutional ownership. The LINKDT and LINKENDDT are just serve as a reference for the time period where the one particular cusip is unique to one firm as the same cusip can be used later one. Other than to link the datasets, I do not really care about those.


        If I use the code you provided, I receive one dataset, however, the data from the 13F file is turned to missing. So I have no observations. The Joining of the Compustat file and the Link file worked.

        But I think I found the underlying problem. I took a deeper look at the case where firms have the same cusip but are not matched. I will enter the code and an excerpt down below. The issue is that the date does not match as the end of quarter dates are different. For the Fundamentals data from Compustat, I have a variable that shows the year and the quarter (datacqtr) but I do not have the same for the 13F file. If I would have this, then I should be able to merge the datasets and this would resolve the cusip issue from before as well.
        However, I am not sure how to create a variable such as datacqtr for the 13F file. Suggestions would be greatly appreciated.

        Result of:
        Code:
        use "In/13F summary.dta", clear
        rename rdate datadate
        gen ncusip = substr(cusip, 1, 6)
        save "Out/13F summary.dta", replace
        use "In/Fundamentals.dta", clear
        gen ncusip = substr(cusip, 1, 6)
        merge m:m ncusip datadate using "Out/13F summary.dta"

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str6 GVKEY long datadate str6(ncusip datacqtr) byte _merge
        "001004"  7182 "000361" "1979Q3" 1
        "001004"  7273 "000361" "1979Q4" 1
        "001004"  7364 "000361" "1980Q1" 1
        ""        7395 "000361" ""       2
        "001004"  7456 "000361" "1980Q2" 1
        ""        7486 "000361" ""       2
        "001004"  7548 "000361" "1980Q3" 1
        ""        7578 "000361" ""       2
        "001004"  7639 "000361" "1980Q4" 1
        ""        7670 "000361" ""       2
        "001004"  7729 "000361" "1981Q1" 1
        ""        7760 "000361" ""       2
        "001004"  7821 "000361" "1981Q2" 1
        ""        7851 "000361" ""       2
        "001004"  7913 "000361" "1981Q3" 1
        ""        7943 "000361" ""       2
        "001004"  8004 "000361" "1981Q4" 1
        ""        8035 "000361" ""       2
        "001004"  8094 "000361" "1982Q1" 1
        ""        8125 "000361" ""       2
        "001004"  8186 "000361" "1982Q2" 1
        ""        8216 "000361" ""       2
        "001004"  8278 "000361" "1982Q3" 1
        ""        8308 "000361" ""       2
        "001004"  8369 "000361" "1982Q4" 1
        ""        8400 "000361" ""       2
        "001004"  8459 "000361" "1983Q1" 1
        ""        8490 "000361" ""       2
        "001004"  8551 "000361" "1983Q2" 1
        ""        8581 "000361" ""       2
        "001004"  8643 "000361" "1983Q3" 1
        ""        8673 "000361" ""       2
        "001004"  8734 "000361" "1983Q4" 1
        ""        8765 "000361" ""       2
        "001004"  8825 "000361" "1984Q1" 1
        ""        8856 "000361" ""       2
        "001004"  8917 "000361" "1984Q2" 1
        ""        8947 "000361" ""       2
        "001004"  9009 "000361" "1984Q3" 1
        ""        9039 "000361" ""       2
        "001004"  9100 "000361" "1984Q4" 1
        ""        9131 "000361" ""       2
        "001004"  9190 "000361" "1985Q1" 1
        ""        9221 "000361" ""       2
        "001004"  9282 "000361" "1985Q2" 1
        ""        9312 "000361" ""       2
        "001004"  9374 "000361" "1985Q3" 1
        ""        9404 "000361" ""       2
        "001004"  9465 "000361" "1985Q4" 1
        ""        9496 "000361" ""       2
        "001004"  9555 "000361" "1986Q1" 1
        ""        9586 "000361" ""       2
        "001004"  9647 "000361" "1986Q2" 1
        ""        9677 "000361" ""       2
        "001004"  9739 "000361" "1986Q3" 1
        ""        9769 "000361" ""       2
        "001004"  9830 "000361" "1986Q4" 1
        ""        9861 "000361" ""       2
        "001004"  9920 "000361" "1987Q1" 1
        ""        9951 "000361" ""       2
        "001004" 10012 "000361" "1987Q2" 1
        ""       10042 "000361" ""       2
        "001004" 10104 "000361" "1987Q3" 1
        ""       10134 "000361" ""       2
        "001004" 10195 "000361" "1987Q4" 1
        ""       10226 "000361" ""       2
        "001004" 10286 "000361" "1988Q1" 1
        ""       10317 "000361" ""       2
        "001004" 10378 "000361" "1988Q2" 1
        ""       10408 "000361" ""       2
        "001004" 10470 "000361" "1988Q3" 1
        ""       10500 "000361" ""       2
        "001004" 10561 "000361" "1988Q4" 1
        ""       10592 "000361" ""       2
        "001004" 10651 "000361" "1989Q1" 1
        ""       10682 "000361" ""       2
        "001004" 10743 "000361" "1989Q2" 1
        ""       10773 "000361" ""       2
        "001004" 10835 "000361" "1989Q3" 1
        ""       10865 "000361" ""       2
        "001004" 10926 "000361" "1989Q4" 1
        ""       10957 "000361" ""       2
        "001004" 11016 "000361" "1990Q1" 1
        ""       11047 "000361" ""       2
        "001004" 11108 "000361" "1990Q2" 1
        ""       11138 "000361" ""       2
        "001004" 11200 "000361" "1990Q3" 1
        ""       11230 "000361" ""       2
        "001004" 11291 "000361" "1990Q4" 1
        ""       11322 "000361" ""       2
        "001004" 11381 "000361" "1991Q1" 1
        ""       11412 "000361" ""       2
        "001004" 11473 "000361" "1991Q2" 1
        ""       11503 "000361" ""       2
        "001004" 11565 "000361" "1991Q3" 1
        ""       11595 "000361" ""       2
        "001004" 11656 "000361" "1991Q4" 1
        ""       11687 "000361" ""       2
        "001004" 11747 "000361" "1992Q1" 1
        ""       11778 "000361" ""       2
        end
        format %td datadate
        label values _merge _merge
        label def _merge 1 "Master only (1)", modify
        label def _merge 2 "Using only (2)", modify

        Thank you again for your help!

        Comment


        • #5
          If you can match your data by either 6-character CUSIP or 8-character CUSIP, as you say in post #1, then I think you should use the 8-character CUSIP. I have no idea whatever about the characteristics of "the Thomson 13F summary file on Ownership" or the "quarterly Fundamentals Compustat file" (and in fact most Statalist members do not.) Are both datasets at the issuer/issue (8-character) level, or at the issuer (6-character) level, or is one of them at the issuer (6-character) level and the other at the issuer/issue (8-character) level?

          More importantly, the use of merge m:m is an incorrect choice. Even the Stata documentation asserts that merge m:m is almost never appropriate.

          The following is copied word-for-word from the documentation of the merge command in the Stata Data Management Reference Manual PDF included in the Stata installation and accessible from Stata's Help menu.

          m:m merges

          m:m specifies a many-to-many merge and is a bad idea. In an m:m merge, observations are matched within equal values of the key variable(s), with the first observation being matched to the first; the second, to the second; and so on. If the master and using have an unequal number of observations within the group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group. Thus m:m merges are dependent on the current sort order—something which should never happen.

          Because m:m merges are such a bad idea, we are not going to show you an example. If you think that you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.
          If you are thinking about using merge m:m, it is a near certainty that at least one of the following is true:

          1. Your merge key actually does uniquely identify the observations in one of your data sets, so you can use merge 1:m or merge m:1 or maybe even merge 1:1.

          2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, so you can use merge 1:m or merge m:1 or merge 1:1 with the expanded merge key.

          3. You are really trying to accomplish what joinby, a different command, does: create in the output dataset every possible combination of an observation from the first dataset and an observation from the second dataset, both having the same key. (SQL users take note! I fell for this myself: to Stata m-to-m does not mean m-by-m.)

          4. You actually need to append your datasets rather than merge them.

          5. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.

          With that said, perhaps the following is a starting point if both datasets are at the issuer/issue level.
          Code:
          use "In/13F summary.dta", clear
          rename rdate datadate
          gen ncusip = substr(cusip, 1, 8)
          save "Out/13F summary.dta", replace
          use "In/Fundamentals.dta", clear
          gen ncusip = substr(cusip, 1, 8)
          merge 1:1 ncusip datadate using "Out/13F summary.dta"
          Last edited by William Lisowski; 02 Aug 2022, 06:04.

          Comment


          • #6
            Thank you William for your quick response.

            The Cusip for the 13 F file is in 8 digit and for the compustat data it is in 9 digit format. The first 6 digits of the cusip code refer to the firm which is why I used it and the digits 7,8 refer to the instrument and digit 9 is a check digit. I do get the error term that my data does not uniquely identify in the master data when using 1:1 for the 6 digit cusip. But it does work with the 8 digit cusip. (I missed a line to drop duplicates, which is why it did not work earlier).

            But I have the same issue as mentioned in #4. I have the same ncusip (so same 8 digit cusip) but the date does not match as the quarters are differently defined. I have 2 variables that could be useful, the data in a quarter format such as 1990Q1 for a calendar view and the same for a fiscal view. Calendar would be preferred though. I am unsure on how to generate a variable in the 13F file that transforms the date variable such as 31.04.1990 into 1990Q1 for all variables by cusip number. The problem is that it differs for different firms (cusips). How can I incorporate this into the code so I don't mess up the data.

            Again, your help is greatly appreciated!

            This is an excerpt from the 13F file.
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str8 ncusip long datadate
            "00002110" 11687
            "00002110" 11778
            "00002110" 11869
            "00002110" 11961
            "00002110" 12418
            "00002110" 12508
            "00002110" 12873
            "00002110" 12964
            "00002110" 13148
            "00002110" 13239
            "00002110" 13514
            "00002110" 13604
            "00002110" 13695
            "00002T10" 13604
            "00002T10" 13695
            "00002T10" 13787
            "00002T10" 13879
            "00016510"  8673
            "00016510"  8765
            "00016510"  8856
            end
            format %td datadate

            Comment


            • #7
              In your extracts from the 13F file in posts #1 and #6 the quarters all end on the last day of March 31, June 30, September 30, or December 31. This suggests that the 13F file reports calendar quarters.

              In your extract from the Compustat file in post #1 we see that fiscal quarters are apparently used, and
              Code:
              . list cusip datadate if cusip=="000354100", sep(5)
              
                   +-----------------------+
                   |     cusip    datadate |
                   |-----------------------|
               12. | 000354100   31dec1983 |
               13. | 000354100   31mar1984 |
               14. | 000354100   30jun1984 |
               15. | 000354100   30sep1984 |
               16. | 000354100   31dec1984 |
                   |-----------------------|
               17. | 000354100   30apr1985 |
               18. | 000354100   31jul1985 |
               19. | 000354100   31oct1985 |
               20. | 000354100   31jan1986 |
                   +-----------------------+
              illustrates your problem - apparently this cusip changed its fiscal quarterly reporting date in 1985. And one fiscal quarter encompassed four months.

              Your PERMNO file does not seem to help assign dates to calendar quarters in the Computstat file, because entries like
              Code:
                   +-----------------------------------+
                   |     cusip      LINKDT   LINKENDDT |
                   |-----------------------------------|
               17. | 001015106   30nov1977   26apr1988 |
               18. | 001015106   01nov1973   29nov1977 |
                   +-----------------------------------+
              lead me to think 30nov1977 suggests a fiscal quarter effectively began with December 1977, and 26apr1988 suggests a quarter ended with April 1988, beginning with February 1988, which implies the previous quarter began with November 1987 rather than December 1987.

              So fundamentally you do not have a Stata programming problem, nor a problem linking by CUSIP, you have a problem understanding how to link data reported on two different calendars - one calendar and the other fiscal. For that I cannot offer advice.




              Comment


              • #8
                Thank your for your answer. I gained clarity on what my actual problem is or more on how to work around it.
                Since your post I tried to merge the compustat file with the 13F file as before but using the datacqtr variable. However that does not work as I combine a string and a numerical in an illogical way. In the 13F file I have the date stored as DDMMMYYYY. I converted it to YYYYQQ but I cannot match this with the datacqtr variable from the Compustat file as this is in a string format. I already looked at the help file on datetime and searched though the forum but could not manage to make the code work for me.
                Converting the datadate variable from Compustat to quarterly format is not an option as it would represent the data in a wrong way.

                Below you find the Compustat file with the string variable and the 13F file with the date variable and the one I already converted. Which one is converted to match the other and what format they have does not really matter to me. And don't mind the identifier GVKEY or CUSIP, they just serve for your orientation.

                But again ultimately I want to a date variable on a quarterly basis that matches between both datasets.

                Thank you so much for your help!

                Compustat:

                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input str6(GVKEY datacqtr)
                "001001" "1983Q3"
                "001001" "1983Q4"
                "001001" "1984Q1"
                "001001" "1984Q2"
                "001001" "1984Q3"
                "001001" "1984Q4"
                "001001" "1985Q1"
                "001001" "1985Q2"
                "001001" "1985Q3"
                "001001" "1985Q4"
                "001001" "1986Q1"
                "001003" "1983Q4"
                "001003" "1984Q1"
                "001003" "1984Q2"
                "001003" "1984Q3"
                "001003" "1984Q4"
                "001003" "1985Q1"
                "001003" "1985Q2"
                "001003" "1985Q3"
                "001003" "1985Q4"
                end
                13F

                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input str12 cusip long rdate float dateqtr
                "00002110" 11687 127
                "00002110" 11778 128
                "00002110" 11869 129
                "00002110" 11961 130
                "00002110" 12418 135
                "00002110" 12508 136
                "00002110" 12873 140
                "00002110" 12964 141
                "00002110" 13148 143
                "00002110" 13239 144
                "00002110" 13514 147
                "00002110" 13604 148
                "00002110" 13695 149
                "00002T10" 13604 148
                "00002T10" 13695 149
                "00002T10" 13787 150
                "00002T10" 13879 151
                "00016510"  8673  94
                "00016510"  8765  95
                "00016510"  8856  96
                end
                format %td rdate
                format %tq dateqtr


                Comment


                • #9
                  Code:
                  generate dateqtr = quarterly(datacqtr,"YQ")
                  format dateqtr %tq
                  will convert the string variable datacqtr in your Computstat data into a Stata numeric quarterly date like dateqtr in your 13F data.

                  Stata's "date and time" variables are complicated and there is a lot to learn. If you have not already read the very detailed Chapter 24 (Working with dates and times) of the Stata User's Guide PDF, do so now. If you have, it's time for a refresher. After that, the help datetime documentation will usually be enough to point the way. You can't remember everything; even the most experienced users end up referring to the help datetime documentation or back to the manual for details, as I did in writing the code presented above. But at least you will get a good understanding of the basics and the underlying principles. An investment of time that will be amply repaid.

                  All Stata manuals are included as PDFs in the Stata installation and are accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu.

                  Comment


                  • #10
                    Thank you William. I messed up with the quarterly part. That helped a lot, thank you for your patience.

                    But, I face the same problem from earlier only in a different disguise.

                    I have a dataset where start (from) and end (thru) dates are given for a company (GVKEY) are provided that indicate a time period where the firm was part of the S&P 500. Before I can merge this dataset with my Fundamentals dataset from earlier via rangejoin, I need to clear the dataset. As you can see in the code below, for some GVKEY the date ranges overlap, which prevents Stata from matching the Informationen correctly. The only problem is that one GVKEY can have multiple ranges where some overlap and some do not e.g. GVKEY = 001045. This is where I have trouble to find the right code to not collapse them into one range with the min and max value as the gaps would then not be reflected.

                    If this is adjusted for, in my understanding, I should be able to use range join as the low and high points are within the Fundamentals data and the uniquely identify with GVKEY and the date.

                    Thanks a lot for dedicating your patience and time to my issue.

                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input str6 GVKEY long(from thru)
                    "001013" 12784 16060
                    "001045"  6755 18996
                    "001045" 20079 22025
                    "001045" 20376 22645
                    "001055" 12784 13134
                    "001111" 16791 17722
                    "001161" 21546 22645
                    "001164" 12784 15544
                    "001164" 16425 16811
                    "001213"  6755 15935
                    "001230"  6755 14324
                    "001230" 18997 22645
                    "001254" 12784 13134
                    "001254" 13831 22645
                    "001300"  6755 17580
                    "001300" 22158 22645
                    "001300" 22482 22645
                    "001327" 20305 22645
                    "001356"  6755 19623
                    "001440"  6755 22645
                    "001440" 22270 22645
                    "001447"  8277 22645
                    end
                    format %td from
                    format %td thru

                    Comment


                    • #11
                      You can reduce lists like this to non-overlapping spells using the following algorithm:

                      Code:
                      assert from <= thru & !missing(from, thru)
                      rename (from thru) date=
                      gen long obs_no = _n
                      reshape long date, i(obs_no) j(_j) string
                      by GVKEY (date), sort: gen depth = sum(_j == "from") - sum(_j == "thru")
                      by GVKEY (date): gen episode = sum(depth[_n-1] == 0 | _n == 1)
                      
                      collapse (first) begin = date (last) end = date, by(GVKEY episode)

                      Comment


                      • #12
                        Thank you very much. That work exactly as it should!
                        When I applied the range join code, I receive the error r.198: was expecting a numeric variable, a number, or a system missing valu
                        > e for the interval low: begin_q
                        I transformed the begin and end dates to quarterly levels and use a quarterly date variable in the master data (Fundamentals). I used the following code for rangejoin:
                        Code:
                        rangejoin dateqtr begin_q end_q using "Out/S&P constituents.dta",
                        > by (GVKEY)
                        These are the datasets I used for the command:

                        Fundamentals:
                        Code:
                        * Example generated by -dataex-. For more info, type help dataex
                        clear
                        input str6 GVKEY float dateqtr
                        "001045" 73
                        "001045" 74
                        "001045" 75
                        "001045" 76
                        "001045" 77
                        "001045" 78
                        "001045" 79
                        "001045" 80
                        "001045" 81
                        "001045" 82
                        "001045" 83
                        "001045" 84
                        "001045" 85
                        "001045" 86
                        "001045" 87
                        "001045" 88
                        "001045" 89
                        "001045" 90
                        "001045" 91
                        "001045" 92
                        "001045" 93
                        "001045" 94
                        "001045" 95
                        "001045" 96
                        "001045" 97
                        end
                        format %tq dateqtr


                        S&P File

                        Code:
                        * Example generated by -dataex-. For more info, type help dataex
                        clear
                        input str6 GVKEY float(begin_q end_q)
                        "001045"  73 208
                        "001045" 219 247
                        end
                        format %tq begin_q
                        format %tq end_q

                        Thank you again for your help!

                        Last edited by Leo Werner; 05 Aug 2022, 02:43.

                        Comment


                        • #13
                          This is a common problem that trips up users of -rangejoin-. You have reversed the roles of the two data sets. You have the Fundamental data set in active memory and have the S&P data set as -using-. But that's backwards from what -rangejoin- requires. The correct code would be:

                          Code:
                          use "Out/S&P constituents.dta", clear
                          rangejoin dateqtr begin_q end_q using Fundamentals, by(GVKEY)
                          And, unfortunately, the error message isn't very suggestive of what the problem was.

                          Comment


                          • #14
                            Thank you Clyde. The code did work but it did not have the intended result. Would it be possible to have missing values whenever the S&P constituents dates cannot be matched with a GVKEY or with data gaps. As my sample in the Fundamentals also includes many firms that are not part of the S&P. How can I reflect this in the code. I later want to create a dummy variable if a firm was in the S&P or not.

                            Thank you for your help.

                            Comment


                            • #15
                              The way to do that is to -merge- the results you have gotten with the code so far with the original data. That will even get you your indicator for whether a firm was in the S&P or not because it will correspond to certain values of the _merge variable that -merge- creates.

                              Comment

                              Working...
                              X