Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging Weekly data with Monthly data

    Dear Statalisters,

    I need to do merge a file containing weekly data to a file containing monthly data (my Master Data set).

    My Master Data set looks like this: (first column is the Date and second column is the unique identifier)

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int Date str7 Sedol
    14488 "2002587"
    14518 "2002587"
    14549 "2002587"
    14579 "2002587"
    14610 "2002587"
    14641 "2002587"
    14670 "2002587"
    14701 "2002587"
    14731 "2002587"
    14762 "2002587"
    14792 "2002587"
    14823 "2002587"
    14854 "2002587"
    14884 "2002587"
    14915 "2002587"
    14945 "2002587"
    14976 "2002587"
    15007 "2002587"
    15035 "2002587"
    15066 "2002587"
    15096 "2002587"
    15127 "2002587"
    end
    format %tdnn/dd/CCYY Date
    The Data set with the weekly data looks like this: (first column is the Date and second column is the unique identifier)

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int Date str7 Sedol
    10927 "2002587"
    10934 "2002587"
    10941 "2002587"
    10948 "2002587"
    10955 "2002587"
    10962 "2002587"
    10969 "2002587"
    10976 "2002587"
    10983 "2002587"
    10990 "2002587"
    10997 "2002587"
    11004 "2002587"
    11011 "2002587"
    11018 "2002587"
    11025 "2002587"
    11032 "2002587"
    11039 "2002587"
    11046 "2002587"
    11053 "2002587"
    11060 "2002587"
    11067 "2002587"
    11074 "2002587"
    end
    format %tdnn/dd/CCYY Date
    What I want Stata to do is to attach the weekly data to the monthly data and the number of observations of the first file to stay as it was.

    The code I tried to apply is:

    Code:
    use firstfile, clear
    
    merge m:m Sedol Date using second file
    keep if _merge==1
    drop _merge
    but it doesn't attaches the weekly data correctly to the monthly. Do you know how I can solve this issue?

    Thanks in advance,
    Angelos


  • #2
    I think you need to explain what those weekly and monthly dates mean and how you expect to combine weekly and monthly data.

    It really won't be a m:m merge !

    It looks as if weeks are defined by the Fridays (that end them?).

    Similarly, or rather dissimilarly, it looks as if months are defined by the days that begin them.

    A crude merge will thus only work when the Friday that ends a week is also the first of the month. That's utterly hopeless as a criterion

    However, you can just

    1. crudely: extract mofd() for each daily date and merge on the corresponding monthly dates. It's a 1:m or m:1 merge

    2. finely: use fractions of weeks for weeks that span two distinct months. Not explained here, but more at http://www.stata-journal.com/sjpdf.h...iclenum=dm0065

    In your case, you also need to keep track of panel identifiers.


    Comment


    • #3
      but it doesn't attaches the weekly data correctly to the monthly
      Yes, that is almost always what happens when people use -merge m:m- in Stata. I've been using Stata since 1994 and only once in that time have I found a situation where -merge m:m- was appropriate to use.

      In your situation, the solution is to create an actual Stata internal format monthly date in each file, and then use that in the merge key:

      Code:
      use second, clear
      gen monthly_date = mofd(Date)
      format monthly_date %tm
      tempfile holding
      save `holding'
      
      use first, clear
      gen monthly_date = mofd(Date)
      format monthly_date %tm
      merge 1:m Sedol monthly_date using `holding'
      Note: Not tested on your example data, because your example data doesn't actually contain any observations in the two data sets that come from the same month, so there is nothing to actually -merge- there.

      Do read -help datetime- to learn about the various ways that Stata can represent dates and times. It is a hefty read, and you won't remember it all. Even experts have to refer back to it for details from time to time. But it will familiarize you with the choices that are available to you.

      Comment


      • #4
        Dear both,

        The first data set contains variables such as shares outstanding, total assets etc. The second one contains weekly prices that are used to create specific measures.

        Nick, I read through the link you sent me but this didn't help me.

        Clyde, I converted dates in both to monthly dates and tried the 1:m merge, but Stata responses: "variables dm Sedol do not uniquely identify observations in the master data"

        Comment


        • #5
          but Stata responses: "variables dm Sedol do not uniquely identify observations in the master data"
          Well, I have never known Stata to be wrong about this. Your first data set evidently must contain more than one observation having the same Sedol and monthly date. So, there are two possibilities here:

          1. That date aren't supposed to contain more than one observation for any combination of Sedol and month. (I had assumed this was the case because your example data only contain one observation per month per Sedol.) In this case, there is something wrong with your data. Either it was incorrectly assembled by whoever gave it to you, or you have somehow introduced these spurious observations in the course of data management. You need to work back and identify the source of the observations that shouldn't be there and remove them from your data set. How this will actually play out depends on the history of your data set. A good starting point is to just identify the offending observations by running:
          Code:
          duplicates tag dm Sedol, gen(flag)
          browse if flag
          After looking at those, you can decide how to identify how they might have gotten there.

          2. The data are indeed allowed to contain more than one observation for combinations of Sedol and month. In that case there are two further possibilities:

          2A. You want to associated each observation from the first dataset involving a given Sedol and month with every observation in the second datset that has the same Sedol and month. In that case, replace the -merge- command shown in #2 with:
          Code:
          joinby dm Sedol using second
          This is likely to lead to an extremely large resulting data set and will probably take a long time to run. Be patient. You might want to do some back of the envelope calculations before you do this to make sure that you have enough RAM on your computer to hold the result. For example, if there are three observations in the first data set for May 2017 and Sedol 2002587 and there are 7 such observations in the second data set you will end up with 21 observations in your resulting data set.

          2B. You actually need to associate specific month-Sedol observations in the first data set with specific month-Sedol observations in the second data set. Then there must be some other variable(s) in these two datasets that enable you to pick out which goes with which. The solution in this case is to modify the -merge- command shown in #2 by adding the additional variable(s) to the list of variable in the merge key.

          Comment


          • #6
            I removed the duplicates in the second file:
            Code:
            duplicates drop dm Sedol, force
            and then used joinby to merge the two files as suggested in #5:
            Code:
            joinby dm Sedol using second
            This type of merging seems to be what I want. However instead of having the same number of observations after the merging procedure (since I removed the duplicates), I go from 53118 observations to 51682. I still haven't figured out where this 1436 observations difference comes from.

            Comment


            • #7
              What you did in #6 may be a bad idea. If -duplicates drop- alone won't remove the observations that are duplicates on dm and Sedol, then that means that those observations disagree on some other variables. By using -duplicates drop dm Sedol, force- you have arbitrarily (and, I should add, irreproducibly) selected one of the disagreeing records to retain and dropped the others. If and when you need to re-run this code, you will get different results each time. Moreover, you have thrown away data without knowing what is going on there. Before just arbitrarily selecting one record for each dm Sedol combination, you need to learn about all those conflicting records and either figure out which one is the correct one to keep, or how you might possibly want to combine the values of the conflicting variables for your purposes. But just picking one arbitrarily and dropping the others is a recipe for generating garbage. The only circumstance under which this is a reasonable way to proceed is if the variables that these observations disagree on will not be used at all in subsequent calculations or analysis.

              The second point I want to make is that after you did remove the duplicates in the second file, then the -merge 1:m- command that gave you an error message before should work properly now.

              The reason you have a smaller data set after running -joinby- is that, by default, -joinby-'s results omit any observations that do not match in both data sets. If you want to retain those observations, you have to specify which of the unmatched observations you want to keep using the -unmatched()- option. See -help joinby- and figure out which -unmatched()- option makes sense for your situation.

              Comment


              • #8
                The reason why I used -duplicates drop dm Sedol, force- is because the two main variables that I want to merge have the same values for each year (they are the coefficients of a regression run by year by cusip) so they are also the same for each month. This is why I thought this would be a good idea.

                The -merge m:1- indeed works and it results in the same number of observations as the joinby command.

                When using the unmatched command Stata returns -command unmatched is unrecognized- and when trying to install it -ssc install unmatched- it returns -ssc install: "unmatched" not found at SSC, type.

                Comment


                • #9
                  To expand slightly on what Clyde explained in #7. unmatched() is an option of the joinby command, not a command in itself. See http://www.stata.com/help.cgi?joinby

                  Comment


                  • #10
                    OK, your reasoning in dropping the duplicates sounds fair, although it does seem to contradict your original description of the master data set as having monthly data. I took that to mean that there is only one observation per month for a given Sedol (and your example data exhibits that constraint). To avoid unintended errors down the road, it would be a good idea to -drop- any other variables that you don't know for sure are unvarying within month for every Sedol.

                    However, I cannot explain why your sample size goes down after the -merge- command. By default, -merge- will retain all of the observations from both data sets (unless you specify the -keep()- option to eliminate some). So, if anything, I would expect the number of observations to increase or remain the same. If you can post an example of your data that exhibits this behavior, I would be interested to see it and figure out what's going on.

                    Comment


                    • #11
                      Thank you both for your answers,

                      The sample size decreased after using -joinby- because the date range of the two files was different. The first file had dates until 01/2017 while the second file had dates until 09/2016. It was my mistake, I need to be more careful when having to deal with large datasets.

                      Comment

                      Working...
                      X