Matching and Expanding Data

Bob Chen

Join Date: Jul 2024

Posts: 4
#1

Matching and Expanding Data

03 Dec 2024, 00:28

Dear Statalist Members,

I am working with the following dataset:

quietly{
clear
input str9 gvkey year str10 fqenddt bhr a
31 2005 "2002/3/31" 0.3565015 0.02116004
31 2005 "2002/6/30" 0.4100328 0.02116004
31 2005 "2002/9/30" -0.2517873 0.02116004
31 2005 "2002/12/31" -0.173169 0.02116004
31 2005 "2003/3/31" 0.0910931 0.02116004
31 2005 "2003/6/30" 0.011535 0.02116004
31 2005 "2003/9/30" -0.0306233 0.02116004
31 2005 "2003/12/31" -0.1218563 0.02116004
608 2005 "2002/6/30" -0.0563066 0.034307506
608 2005 "2002/9/30" -0.1087845 0.034307506
608 2005 "2002/12/31" -0.1365316 0.034307506
608 2005 "2003/3/31" 0.0479617 0.034307506
608 2005 "2003/6/30" -0.082426 0.034307506
608 2005 "2003/9/30" -0.1919279 0.034307506
608 2005 "2003/12/31" -0.0780903 0.034307506
end
gen fqenddt_date = date(fqenddt, "YMD")
format %tdCCYY-NN-DD fqenddt_date
drop fqenddt
rename(fqenddt_date )(fqenddt)
save "test.dta", replace
}

Using the following code, I expand and match the data based on fqenddt:

use "test.dta", clear
tempfile raw
save `raw'
rename (*) (*2)
* keep the merging vars same
rename (fqenddt2)(fqenddt)
joinby fqenddt using `raw'
drop if gvkey==gvkey2
sort gvkey gvkey2 fqenddt
drop year2
order gvkey year fqenddt bhr a gvkey2 bhr2 a2

This produces the desired output, but the result excludes some rows from the original dataset where there are no matches:

gvkey year fqenddt bhr a gvkey2 bhr2 a2
31 2005 2002/6/30 0.4100328 0.02116 608 -0.0563066 0.0343075
31 2005 2002/9/30 -0.2517873 0.02116 608 -0.1087845 0.0343075
31 2005 2002/12/31 -0.173169 0.02116 608 -0.1365316 0.0343075
31 2005 2003/3/31 0.0910931 0.02116 608 0.0479617 0.0343075
31 2005 2003/6/30 0.011535 0.02116 608 -0.082426 0.0343075
31 2005 2003/9/30 -0.0306233 0.02116 608 -0.1919279 0.0343075
31 2005 2003/12/31 -0.1218563 0.02116 608 -0.0780903 0.0343075
608 2005 2002/6/30 -0.0563066 0.0343075 31 0.4100328 0.02116
608 2005 2002/9/30 -0.1087845 0.0343075 31 -0.2517873 0.02116
608 2005 2002/12/31 -0.1365316 0.0343075 31 -0.173169 0.02116
608 2005 2003/3/31 0.0479617 0.0343075 31 0.0910931 0.02116
608 2005 2003/6/30 -0.082426 0.0343075 31 0.011535 0.02116
608 2005 2003/9/30 -0.1919279 0.0343075 31 -0.0306233 0.02116
608 2005 2003/12/31 -0.0780903 0.0343075 31 -0.1218563 0.02116

The following are the desired output I aim to achieve:

gvkey year fqenddt bhr a gvkey2 bhr2 a2
31 2005 2002/3/31 0.3565015 0.02116004 608 0.0343075
31 2005 2002/6/30 0.4100328 0.02116 608 -0.0563066 0.0343075
31 2005 2002/9/30 -0.2517873 0.02116 608 -0.1087845 0.0343075
31 2005 2002/12/31 -0.173169 0.02116 608 -0.1365316 0.0343075
31 2005 2003/3/31 0.0910931 0.02116 608 0.0479617 0.0343075
31 2005 2003/6/30 0.011535 0.02116 608 -0.082426 0.0343075
31 2005 2003/9/30 -0.0306233 0.02116 608 -0.1919279 0.0343075
31 2005 2003/12/31 -0.1218563 0.02116 608 -0.0780903 0.0343075
608 2005 2002/6/30 -0.0563066 0.0343075 31 0.4100328 0.02116
608 2005 2002/9/30 -0.1087845 0.0343075 31 -0.2517873 0.02116
608 2005 2002/12/31 -0.1365316 0.0343075 31 -0.173169 0.02116
608 2005 2003/3/31 0.0479617 0.0343075 31 0.0910931 0.02116
608 2005 2003/6/30 -0.082426 0.0343075 31 0.011535 0.02116
608 2005 2003/9/30 -0.1919279 0.0343075 31 -0.0306233 0.02116
608 2005 2003/12/31 -0.0780903 0.0343075 31 -0.1218563 0.02116

Question:
Is there a way to modify the approach so that all rows from the original dataset are retained, even if they don’t find matches during the expansion process? I’d like to achieve this without altering the original dataset structure.

Thank you in advance for your help!
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 9945

03 Dec 2024, 04:40

Unmatched entries are singletons. You want to exclude these when dropping observations. In this example, 'gvkey' and 'bhr' identify groups. This may not be true in the larger dataset.

Code:

use "test.dta", clear
tempfile raw
save `raw'
rename (*) (*2)
* keep the merging vars same
rename (fqenddt2)(fqenddt)
joinby fqenddt using `raw'
bys gvkey2 bhr2: drop if gvkey==gvkey2 & _N>1
sort gvkey gvkey2 fqenddt
drop year2
order gvkey year fqenddt bhr a gvkey2 bhr2 a2

Res.:

Code:

. l, sep(0)

     +----------------------------------------------------------------------------------+
     | gvkey   year      fqenddt         bhr          a   gvkey2        bhr2         a2 |
     |----------------------------------------------------------------------------------|
  1. |    31   2005   2002-03-31    .3565015     .02116       31    .3565015     .02116 |
  2. |    31   2005   2002-06-30    .4100328     .02116      608   -.0563066   .0343075 |
  3. |    31   2005   2002-09-30   -.2517873     .02116      608   -.1087845   .0343075 |
  4. |    31   2005   2002-12-31    -.173169     .02116      608   -.1365316   .0343075 |
  5. |    31   2005   2003-03-31    .0910931     .02116      608    .0479617   .0343075 |
  6. |    31   2005   2003-06-30     .011535     .02116      608    -.082426   .0343075 |
  7. |    31   2005   2003-09-30   -.0306233     .02116      608   -.1919279   .0343075 |
  8. |    31   2005   2003-12-31   -.1218563     .02116      608   -.0780903   .0343075 |
  9. |   608   2005   2002-06-30   -.0563066   .0343075       31    .4100328     .02116 |
 10. |   608   2005   2002-09-30   -.1087845   .0343075       31   -.2517873     .02116 |
 11. |   608   2005   2002-12-31   -.1365316   .0343075       31    -.173169     .02116 |
 12. |   608   2005   2003-03-31    .0479617   .0343075       31    .0910931     .02116 |
 13. |   608   2005   2003-06-30    -.082426   .0343075       31     .011535     .02116 |
 14. |   608   2005   2003-09-30   -.1919279   .0343075       31   -.0306233     .02116 |
 15. |   608   2005   2003-12-31   -.0780903   .0343075       31   -.1218563     .02116 |
     +----------------------------------------------------------------------------------+

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 9945
#3

03 Dec 2024, 04:50

Alternatively, do what you are doing and then merge back with the original dataset.
Comment
Bob Chen

Join Date: Jul 2024

Posts: 4
#4

03 Dec 2024, 05:04

Originally posted by Andrew Musau View Post

Alternatively, do what you are doing and then merge back with the original dataset.

Dear Andrew,

Thank you for your response. The issue I am dealing with cannot be categorized using "gvkey2 bhr2." Moreover, running the process after merging takes more time compared to when I modify the original dataset, "test.dta."
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9945
#5

03 Dec 2024, 05:39

I do not follow your points. Consider any of the three options below:

1. Find what variables uniquely identify groups in your dataset and use those. My guess is:

Code:

bys gvkey fqenddate: drop if gvkey==gvkey2 & _N>1

as "fqenddate" is your matching variable in joinby.

2. Create a dataset with singletons to begin with and append those to your result.

3. merge the result with your initial dataset.

If you mean that #3 is time consuming, then #2 (appending) will take no time at all.

Last edited by Andrew Musau; 03 Dec 2024, 05:55.
Comment
Bob Chen

Join Date: Jul 2024

Posts: 4
#6

03 Dec 2024, 09:04

Dear Andrew,
Thank you very much for your patient reply. It seems that I may not have expressed my point clearly, so please allow me to restate my question for further clarification.

quietly{
clear
input gvkey year str10 fqenddt IndYear bhr a
31 2005 "2002/3/31" 525 0.3565015 0.02116
31 2005 "2002/6/30" 525 0.4100328 0.02116
31 2005 "2002/9/30" 525 -0.2517873 0.02116
31 2005 "2002/12/31" 525 -0.173169 0.02116
31 2005 "2003/3/31" 525 0.0910931 0.02116
31 2005 "2003/6/30" 525 0.011535 0.02116
31 2005 "2003/9/30" 525 -0.0306233 0.02116
31 2005 "2003/12/31" 525 -0.1218563 0.02116
608 2005 "2002/6/30" 525 -0.0563066 0.0343075
608 2005 "2002/9/30" 525 -0.1087845 0.0343075
608 2005 "2002/12/31" 525 -0.1365316 0.0343075
608 2005 "2003/3/31" 525 0.0479617 0.0343075
608 2005 "2003/6/30" 525 -0.082426 0.0343075
608 2005 "2003/9/30" 525 -0.1919279 0.0343075
608 2005 "2003/12/31" 525 -0.0780903 0.0343075
end
gen fqenddt_date = date(fqenddt, "YMD")
format %tdCCYY-NN-DD fqenddt_date
drop fqenddt
rename(fqenddt_date )(fqenddt)
order gvkey year fqenddt IndYear bhr a
save "test.dta", replace
}

* perform a right join on all companies, where the join is based on matching values of year, fqenddt, and IndYear for each pair of companies.
use "test.dta", clear
tempfile raw
save `raw'
rename (*) (*2)
* keep the merging vars same
rename (IndYear2 fqenddt2)(IndYear fqenddt)
joinby IndYear fqenddt using `raw'

drop if gvkey==gvkey2
sort gvkey gvkey2 fqenddt
drop year2
order gvkey year fqenddt IndYear bhr a gvkey2 bhr2 a2

Last edited by Bob Chen; 03 Dec 2024, 09:11.
Comment

Bob Chen

Join Date: Jul 2024
Posts: 4

03 Dec 2024, 09:11

gvkey	year	fqenddt	IndYear	bhr	a	gvkey2	bhr2	a2
31	2005	2002/3/31	525	0.3565015	0.02116	608		0.0343075
31	2005	2002/6/30	525	0.4100328	0.02116	608	-0.05631	0.0343075
31	2005	2002/9/30	525	-0.2517873	0.02116	608	-0.10878	0.0343075
31	2005	2002/12/31	525	-0.173169	0.02116	608	-0.13653	0.0343075
31	2005	2003/3/31	525	0.0910931	0.02116	608	0.047962	0.0343075
31	2005	2003/6/30	525	0.011535	0.02116	608	-0.08243	0.0343075
31	2005	2003/9/30	525	-0.0306233	0.02116	608	-0.19193	0.0343075
31	2005	2003/12/31	525	-0.1218563	0.02116	608	-0.07809	0.0343075
608	2005	2002/6/30	525	-0.0563066	0.034308	31	0.410033	0.02116
608	2005	2002/9/30	525	-0.1087845	0.034308	31	-0.25179	0.02116
608	2005	2002/12/31	525	-0.1365316	0.034308	31	-0.17317	0.02116
608	2005	2003/3/31	525	0.0479617	0.034308	31	0.091093	0.02116
608	2005	2003/6/30	525	-0.082426	0.034308	31	0.011535	0.02116
608	2005	2003/9/30	525	-0.1919279	0.034308	31	-0.03062	0.02116
608	2005	2003/12/31	525	-0.0780903	0.034308	31	-0.12186	0.02116

Last edited by Bob Chen; 03 Dec 2024, 09:13.

Announcement

Matching and Expanding Data

Comment

Comment

Comment

Comment

Comment

Comment