Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching and Expanding Data

    Dear Statalist Members,

    I am working with the following dataset:


    quietly{
    clear
    input str9 gvkey year str10 fqenddt bhr a
    31 2005 "2002/3/31" 0.3565015 0.02116004
    31 2005 "2002/6/30" 0.4100328 0.02116004
    31 2005 "2002/9/30" -0.2517873 0.02116004
    31 2005 "2002/12/31" -0.173169 0.02116004
    31 2005 "2003/3/31" 0.0910931 0.02116004
    31 2005 "2003/6/30" 0.011535 0.02116004
    31 2005 "2003/9/30" -0.0306233 0.02116004
    31 2005 "2003/12/31" -0.1218563 0.02116004
    608 2005 "2002/6/30" -0.0563066 0.034307506
    608 2005 "2002/9/30" -0.1087845 0.034307506
    608 2005 "2002/12/31" -0.1365316 0.034307506
    608 2005 "2003/3/31" 0.0479617 0.034307506
    608 2005 "2003/6/30" -0.082426 0.034307506
    608 2005 "2003/9/30" -0.1919279 0.034307506
    608 2005 "2003/12/31" -0.0780903 0.034307506
    end
    gen fqenddt_date = date(fqenddt, "YMD")
    format %tdCCYY-NN-DD fqenddt_date
    drop fqenddt
    rename(fqenddt_date )(fqenddt)
    save "test.dta", replace
    }


    Using the following code, I expand and match the data based on fqenddt:


    use "test.dta", clear
    tempfile raw
    save `raw'
    rename (*) (*2)
    * keep the merging vars same
    rename (fqenddt2)(fqenddt)
    joinby fqenddt using `raw'
    drop if gvkey==gvkey2
    sort gvkey gvkey2 fqenddt
    drop year2
    order gvkey year fqenddt bhr a gvkey2 bhr2 a2



    This produces the desired output, but the result excludes some rows from the original dataset where there are no matches:

    gvkey year fqenddt bhr a gvkey2 bhr2 a2
    31 2005 2002/6/30 0.4100328 0.02116 608 -0.0563066 0.0343075
    31 2005 2002/9/30 -0.2517873 0.02116 608 -0.1087845 0.0343075
    31 2005 2002/12/31 -0.173169 0.02116 608 -0.1365316 0.0343075
    31 2005 2003/3/31 0.0910931 0.02116 608 0.0479617 0.0343075
    31 2005 2003/6/30 0.011535 0.02116 608 -0.082426 0.0343075
    31 2005 2003/9/30 -0.0306233 0.02116 608 -0.1919279 0.0343075
    31 2005 2003/12/31 -0.1218563 0.02116 608 -0.0780903 0.0343075
    608 2005 2002/6/30 -0.0563066 0.0343075 31 0.4100328 0.02116
    608 2005 2002/9/30 -0.1087845 0.0343075 31 -0.2517873 0.02116
    608 2005 2002/12/31 -0.1365316 0.0343075 31 -0.173169 0.02116
    608 2005 2003/3/31 0.0479617 0.0343075 31 0.0910931 0.02116
    608 2005 2003/6/30 -0.082426 0.0343075 31 0.011535 0.02116
    608 2005 2003/9/30 -0.1919279 0.0343075 31 -0.0306233 0.02116
    608 2005 2003/12/31 -0.0780903 0.0343075 31 -0.1218563 0.02116


    The following are the desired output I aim to achieve:

    gvkey year fqenddt bhr a gvkey2 bhr2 a2
    31 2005 2002/3/31 0.3565015 0.02116004 608 0.0343075
    31 2005 2002/6/30 0.4100328 0.02116 608 -0.0563066 0.0343075
    31 2005 2002/9/30 -0.2517873 0.02116 608 -0.1087845 0.0343075
    31 2005 2002/12/31 -0.173169 0.02116 608 -0.1365316 0.0343075
    31 2005 2003/3/31 0.0910931 0.02116 608 0.0479617 0.0343075
    31 2005 2003/6/30 0.011535 0.02116 608 -0.082426 0.0343075
    31 2005 2003/9/30 -0.0306233 0.02116 608 -0.1919279 0.0343075
    31 2005 2003/12/31 -0.1218563 0.02116 608 -0.0780903 0.0343075
    608 2005 2002/6/30 -0.0563066 0.0343075 31 0.4100328 0.02116
    608 2005 2002/9/30 -0.1087845 0.0343075 31 -0.2517873 0.02116
    608 2005 2002/12/31 -0.1365316 0.0343075 31 -0.173169 0.02116
    608 2005 2003/3/31 0.0479617 0.0343075 31 0.0910931 0.02116
    608 2005 2003/6/30 -0.082426 0.0343075 31 0.011535 0.02116
    608 2005 2003/9/30 -0.1919279 0.0343075 31 -0.0306233 0.02116
    608 2005 2003/12/31 -0.0780903 0.0343075 31 -0.1218563 0.02116


    Question:
    Is there a way to modify the approach so that all rows from the original dataset are retained, even if they don’t find matches during the expansion process? I’d like to achieve this without altering the original dataset structure.

    Thank you in advance for your help!


  • #2
    Unmatched entries are singletons. You want to exclude these when dropping observations. In this example, 'gvkey' and 'bhr' identify groups. This may not be true in the larger dataset.

    Code:
    use "test.dta", clear
    tempfile raw
    save `raw'
    rename (*) (*2)
    * keep the merging vars same
    rename (fqenddt2)(fqenddt)
    joinby fqenddt using `raw'
    bys gvkey2 bhr2: drop if gvkey==gvkey2 & _N>1
    sort gvkey gvkey2 fqenddt
    drop year2
    order gvkey year fqenddt bhr a gvkey2 bhr2 a2
    Res.:

    Code:
    . l, sep(0)
    
         +----------------------------------------------------------------------------------+
         | gvkey   year      fqenddt         bhr          a   gvkey2        bhr2         a2 |
         |----------------------------------------------------------------------------------|
      1. |    31   2005   2002-03-31    .3565015     .02116       31    .3565015     .02116 |
      2. |    31   2005   2002-06-30    .4100328     .02116      608   -.0563066   .0343075 |
      3. |    31   2005   2002-09-30   -.2517873     .02116      608   -.1087845   .0343075 |
      4. |    31   2005   2002-12-31    -.173169     .02116      608   -.1365316   .0343075 |
      5. |    31   2005   2003-03-31    .0910931     .02116      608    .0479617   .0343075 |
      6. |    31   2005   2003-06-30     .011535     .02116      608    -.082426   .0343075 |
      7. |    31   2005   2003-09-30   -.0306233     .02116      608   -.1919279   .0343075 |
      8. |    31   2005   2003-12-31   -.1218563     .02116      608   -.0780903   .0343075 |
      9. |   608   2005   2002-06-30   -.0563066   .0343075       31    .4100328     .02116 |
     10. |   608   2005   2002-09-30   -.1087845   .0343075       31   -.2517873     .02116 |
     11. |   608   2005   2002-12-31   -.1365316   .0343075       31    -.173169     .02116 |
     12. |   608   2005   2003-03-31    .0479617   .0343075       31    .0910931     .02116 |
     13. |   608   2005   2003-06-30    -.082426   .0343075       31     .011535     .02116 |
     14. |   608   2005   2003-09-30   -.1919279   .0343075       31   -.0306233     .02116 |
     15. |   608   2005   2003-12-31   -.0780903   .0343075       31   -.1218563     .02116 |
         +----------------------------------------------------------------------------------+

    Comment


    • #3
      Alternatively, do what you are doing and then merge back with the original dataset.

      Comment


      • #4
        Originally posted by Andrew Musau View Post
        Alternatively, do what you are doing and then merge back with the original dataset.
        Dear Andrew,

        Thank you for your response. The issue I am dealing with cannot be categorized using "gvkey2 bhr2." Moreover, running the process after merging takes more time compared to when I modify the original dataset, "test.dta."

        Comment


        • #5
          I do not follow your points. Consider any of the three options below:

          1. Find what variables uniquely identify groups in your dataset and use those. My guess is:

          Code:
          bys gvkey fqenddate: drop if gvkey==gvkey2 & _N>1
          as "fqenddate" is your matching variable in joinby.


          2. Create a dataset with singletons to begin with and append those to your result.

          3. merge the result with your initial dataset.

          If you mean that #3 is time consuming, then #2 (appending) will take no time at all.
          Last edited by Andrew Musau; 03 Dec 2024, 05:55.

          Comment


          • #6
            Dear Andrew,
            Thank you very much for your patient reply. It seems that I may not have expressed my point clearly, so please allow me to restate my question for further clarification.


            quietly{
            clear
            input gvkey year str10 fqenddt IndYear bhr a
            31 2005 "2002/3/31" 525 0.3565015 0.02116
            31 2005 "2002/6/30" 525 0.4100328 0.02116
            31 2005 "2002/9/30" 525 -0.2517873 0.02116
            31 2005 "2002/12/31" 525 -0.173169 0.02116
            31 2005 "2003/3/31" 525 0.0910931 0.02116
            31 2005 "2003/6/30" 525 0.011535 0.02116
            31 2005 "2003/9/30" 525 -0.0306233 0.02116
            31 2005 "2003/12/31" 525 -0.1218563 0.02116
            608 2005 "2002/6/30" 525 -0.0563066 0.0343075
            608 2005 "2002/9/30" 525 -0.1087845 0.0343075
            608 2005 "2002/12/31" 525 -0.1365316 0.0343075
            608 2005 "2003/3/31" 525 0.0479617 0.0343075
            608 2005 "2003/6/30" 525 -0.082426 0.0343075
            608 2005 "2003/9/30" 525 -0.1919279 0.0343075
            608 2005 "2003/12/31" 525 -0.0780903 0.0343075
            end
            gen fqenddt_date = date(fqenddt, "YMD")
            format %tdCCYY-NN-DD fqenddt_date
            drop fqenddt
            rename(fqenddt_date )(fqenddt)
            order gvkey year fqenddt IndYear bhr a
            save "test.dta", replace
            }


            * perform a right join on all companies, where the join is based on matching values of year, fqenddt, and IndYear for each pair of companies.
            use "test.dta", clear
            tempfile raw
            save `raw'
            rename (*) (*2)
            * keep the merging vars same
            rename (IndYear2 fqenddt2)(IndYear fqenddt)
            joinby IndYear fqenddt using `raw'

            drop if gvkey==gvkey2
            sort gvkey gvkey2 fqenddt
            drop year2
            order gvkey year fqenddt IndYear bhr a gvkey2 bhr2 a2











            Last edited by Bob Chen; 03 Dec 2024, 09:11.

            Comment


            • #7
              gvkey year fqenddt IndYear bhr a gvkey2 bhr2 a2
              31 2005 2002/3/31 525 0.3565015 0.02116 608 0.0343075
              31 2005 2002/6/30 525 0.4100328 0.02116 608 -0.05631 0.0343075
              31 2005 2002/9/30 525 -0.2517873 0.02116 608 -0.10878 0.0343075
              31 2005 2002/12/31 525 -0.173169 0.02116 608 -0.13653 0.0343075
              31 2005 2003/3/31 525 0.0910931 0.02116 608 0.047962 0.0343075
              31 2005 2003/6/30 525 0.011535 0.02116 608 -0.08243 0.0343075
              31 2005 2003/9/30 525 -0.0306233 0.02116 608 -0.19193 0.0343075
              31 2005 2003/12/31 525 -0.1218563 0.02116 608 -0.07809 0.0343075
              608 2005 2002/6/30 525 -0.0563066 0.034308 31 0.410033 0.02116
              608 2005 2002/9/30 525 -0.1087845 0.034308 31 -0.25179 0.02116
              608 2005 2002/12/31 525 -0.1365316 0.034308 31 -0.17317 0.02116
              608 2005 2003/3/31 525 0.0479617 0.034308 31 0.091093 0.02116
              608 2005 2003/6/30 525 -0.082426 0.034308 31 0.011535 0.02116
              608 2005 2003/9/30 525 -0.1919279 0.034308 31 -0.03062 0.02116
              608 2005 2003/12/31 525 -0.0780903 0.034308 31 -0.12186 0.02116
              Last edited by Bob Chen; 03 Dec 2024, 09:13.

              Comment

              Working...
              X