Joining data that has no unique identifier

Clarice Martins

Join Date: Apr 2014

Posts: 22
#1

Joining data that has no unique identifier

20 Apr 2014, 16:26

Dear members,

I am using Stata 12.1 for Mac.

My research requires a lot of data management before getting to the statistical part. And this is where I am stuck.

I have two datasets: (-describe- shown at end of message)

Dataset 1 has the following variables:
- Ticker
- Group
- Year
- this dataset has companies (ticker) which belong to different risk factor groups (group) each year. (The companies can possibly change from a group to another each year.)
- # of obs.: 4082 observations

Dataset 2 has:
- Group
- data (YMD)
- returns
- this dataset has daily returns from 02Jan2000 thru 29Dec2012
- # of obs.: 17712 observations

I have tried a few things:
1) -merge-
But have learned that to use this function i would need a unique identifier. Given that it is inherent to the data that no one field is uniquely identified, I did not find a way to use this command.

2) -joinby-
joinby group using [dataset 2]
result: the command copied data without respect to date/year

3) used these line of code with dataset 1 in the memory:
quietly levelsof year, local(levs)
quietly foreach lev of local levs {
joinby grupo using fipe_ajustes
}
result: Stata stopped the operation and issue the following message:
"sum of expand values exceed 2,147,483,647
The dataset may not contain more than 2,147,483,647 observations."

Here, I am not sure if this 3rd code is working, and if the problem is really size of dataset.

I read the guide for posting, and I hope to have been clear enough.

Thanks,
Clarice

------------------------------------------------

Dataset 1:

. describe

Contains data from stocks_ajustes.dta
obs: 4,082
vars: 3 20 Apr 2014 18:20
size: 81,640
-----------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-----------------------------------------------------------------------------------------------------------------------------
ticker str16 %16s ticker
year int %10.0g year
group str2 %9s
-----------------------------------------------------------------------------------------------------------------------------

Dataset 2:

. describe

Contains data from fipe_ajustes.dta
obs: 17,712
vars: 3 20 Apr 2014 17:55
size: 247,968
-----------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-----------------------------------------------------------------------------------------------------------------------------
grupo str2 %9s grupo
ret_ajust double %10.0g
period float %td
-----------------------------------------------------------------------------------------------------------------------------
Sorted by:
Note: dataset has changed since last saved
Tags: None
Joe Canner

Join Date: Mar 2014

Posts: 580
#2

20 Apr 2014, 16:55

Clarice,

Before deciding between merge and joinby, we would need a better idea of what you are trying to accomplish (i.e., what variables do you want to match observations between the two data sets? Are there any combinations of variables that uniquely identify observations in either data set?).

From what I can tell so far, it looks like data set 1 has variables group and year which appear to uniquely identify observations, so that's a good start. Data set 2 has variables grupo and data which could be modified to accomplish the merge. Accordingly, this is my best guess as to what you might want to do:

Code:

use fipe_ajustes, clear rename grupo group // Assuming the grupo variable is defining the same thing as the group variable in the other data set gen year=year(period) merge m:1 group year using stocks_ajustes

This will match every observation in data set 2 with the observation in data set 1 that has the same year and the same group. If this is not what you want, we will probably need more information.

Regards,
Joe
Comment
Roberto Liebscher

Join Date: Mar 2014

Posts: 92
#3

20 Apr 2014, 17:01

I assume your return data is captured at the firm level and not on an industry level? And you want to merge some company data with returns data, right? Then you definitely need an identifier variable for your "fipe_ajustes.dta" dataset what seem to be missing. In this case you need to go back to the source of your data and add this variable. Further your second dataset seems to be lacking a time variable (What is the time period where the returns were generated?).

In case you actually want to merge the datasets on the group level you need to rename the variable first. As it stands now in one dataset the variable is called "grupo" and in the other one "group". If you further have a time variable in your second dataset you can easily match one return observation per year group with the data in your first mentioned dataset using the merge command:

merge 1:1 group year using fipe_ajustes.dta

HTH,
Roberto
Comment
Clarice Martins

Join Date: Apr 2014

Posts: 22
#4

21 Apr 2014, 06:45

Thank you all... I believe I need to explain myself better. My apologies.

Nevertheless, before doing that I tried both suggestions and still got the error message:
"variables group year do not uniquely identify observations in the using data
r(459);"
Dataset 1 - stocks_ajustes:
- Data is at the firm level, but the list of firms repeat each year (2000-2012)
- Each year a firm can be placed in a different risk group.
- Variables: ticker, group, year
- Sample data:

ticker group year
PETR4 BH 2000
ABEV3 BL 2000
USIM5 SM 2000
PETR4 BM 2001
ABEV3 BH 2001
USIM5 SM 2001

Dataset 2 - fipe_ajustes
- These are daily returns separated by group.
- There are 6 groups which are the same groups used in Dataset 1
- Variables: group, period (YDM), ret_ajust
- Sample data:

group period ret_ajust
BH 1aug2000 -0.01
BL 1aug2000 -0.02
BM 1aug2000 -0.03
SM 1aug2000 -0.04
BH 2aug2000 0.05
BL 2aug2000 0.02
BM 2aug2000 0.03
SM 2aug2000 0.02
BH 2jan2001 -0.03
BL 2jan2001 0.02
BM 2jan2001 -0.03
SM 2jan2001 0.01
BH 3jan2001 -0.001
BL 3jan2001 0.03
BM 3jan2001 -0.07
SM 3jan2001 0.08

All the different daily returns need to be copied to one master database with the Firms and their corresponding group each year.

For instance in 2000, PETR4 belongs to the group BH, but on year 2001, PETR4 changed to group BM.

So, the final result should like this (sample of desired result):

ticker group year period ret_ajust
PETR4 BH 2000 1aug2000 -0.01 <== from dataset 2, data corresponding to returns for firms in BH category for 01aug2000
PETR4 BH 2000 2aug2000 0.05 <== from dataset 2, data corresponding to returns for firms in BH category for 02aug2000
PETR4 BM 2001 2jan2001 -0.03 <== from dataset 2, data corresponding to returns for firms in BM category for 02jan2001
PETR4 BM 2001 3jan2001 -0.07 <== from dataset 2, data corresponding to returns for firms in BM category for 03jan2001

I hope this clarifies.

Thank you for your comments!

Regards,
Clarice
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#5

21 Apr 2014, 07:36

Clarice,

Thanks; that helps. Based on your clarification, I think joinby might indeed be what is required, but with group and year, not just group:

Code:

use fipe_ajustes, clear rename grupo group gen year=year(period) joinby group year using stocks_ajustes

This may help avoid the error you got previously with joinby. If you still get that error, then you may want to re-think your strategy, or you may have to split your data sets into smaller pieces in order to make it work.

Regards,
Joe
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#6

21 Apr 2014, 08:43

There is more than one possible observation for - group year -. For example, BH 1aug2000 - 0.01 BH 2aug2000 0.05 After extracting the year you end up with: BH 2000 - 0.01 BH 2000 0.05 Stata doesn't know which one to use. Hence those two variables do not uniquely identify observations.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#7

21 Apr 2014, 08:51

Something wrong with formatting and my smartphone. Can't even get line breaks. Sorry.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Clarice Martins

Join Date: Apr 2014

Posts: 22
#8

21 Apr 2014, 10:00

Joe,

Thank you!! It worked perfectly...!!!

I was so close... but, didn't try the use of -joinby- with the two variables. Fantastic!

Thank you all for comments.

Regards,
Clarice
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#9

21 Apr 2014, 10:12

I don't see why Joe Canner's first solution (post #2) doesn't work for you. A simple merge m:1 works in the following example.

Code:

clear all set more off input /// str6 ticker str2 group year PETR4 BH 2000 ABEV3 BL 2000 USIM5 SM 2000 PETR4 BM 2001 ABEV3 BH 2001 USIM5 SM 2001 end tempfile stocks save "`stocks'" clear all input /// str2 group str11 period ret BH 1aug2000 -0.01 BL 1aug2000 -0.02 BM 1aug2000 -0.03 SM 1aug2000 -0.04 BH 2aug2000 0.05 BL 2aug2000 0.02 BM 2aug2000 0.03 SM 2aug2000 0.02 BH 2jan2001 -0.03 BL 2jan2001 0.02 BM 2jan2001 -0.03 SM 2jan2001 0.01 BH 3jan2001 -0.001 BL 3jan2001 0.03 BM 3jan2001 -0.07 SM 3jan2001 0.08 end gen period2 = date(period,"DMY") format period2 %td gen year=year(period2) merge m:1 group year using "`stocks'" list if ticker == "PETR4"

Maybe you were using merge 1:1 which gives the error

variables group year do not uniquely identify observations in the master data
r(459);

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Clarice Martins

Join Date: Apr 2014

Posts: 22
#10

21 Apr 2014, 10:16

Hello Roberto,

sorry, I didn't see your comment. So, I went back to double check.

Well, it seems that, with -joinby- suggeted above by Joe, Stata knew to get all the entries with the year BH 2000 ticker into the correct place.

Here is a snapshot:

ticker group year ret_ajust period
ABCB4 BZ Equity BH 2007 .0076035 02jan2007
ABCB4 BZ Equity BH 2007 -.0107174 03jan2007
ABCB4 BZ Equity BH 2007 -.0040778 04jan2007
ABCB4 BZ Equity BH 2007 -.0327368 05jan2007
ABCB4 BZ Equity BH 2007 .0061015 08jan2007
ABCB4 BZ Equity BH 2007 -.017616 09jan2007
ABCB4 BZ Equity BH 2007 .0110438 10jan2007
ABCB4 BZ Equity BH 2007 .0100308 11jan2007
ABCB4 BZ Equity BH 2007 .0016244 12jan2007
ABCB4 BZ Equity BH 2007 -.0038603 15jan2007
ABCB4 BZ Equity BH 2007 -.0076374 16jan2007
ABCB4 BZ Equity BH 2007 .0071772 17jan2007
ABCB4 BZ Equity BH 2007 .0048582 18jan2007
ABCB4 BZ Equity BH 2007 .01307 19jan2007
ABCB4 BZ Equity BH 2007 .0031272 22jan2007
ABCB4 BZ Equity BH 2007 .0099545 23jan2007
ABCB4 BZ Equity BH 2007 .0106728 24jan2007
ABCB4 BZ Equity BH 2007 .0035398 26jan2007
ABCB4 BZ Equity BH 2007 -.0148391 29jan2007
ABCB4 BZ Equity BH 2007 .0027485 30jan2007
ABCB4 BZ Equity BH 2007 .0082529 31jan2007

It seems to me it is ok, unless I am missing something.

Thanks for the comment.

Regards,
Clarice
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#11

21 Apr 2014, 10:27

Roberto,

I think the disconnect is that Clarice's example didn't reveal that there are multiple entries in the first data set for each group in a given year. The example gives the impression that there is a 1:1 correspondence between firm and group, and that was my assumption when I first suggested merge. However, based on Clarice's explanation and upon further reflection, I realized that there would be multiple firms in a given group in any given year. There are also multiple records per year for a given group (one per day) in the second data set, thus requiring joinby.

All's well that end's well!

Regards,
Joe
Comment
Clarice Martins

Join Date: Apr 2014

Posts: 22
#12

21 Apr 2014, 10:31

Hello, Roberto!
Sorry, it seems I am always missing your comments.

I tried now your suggestion with -merge- :

1) with the sample code you suggested, I am having trouble with "`stocks'"

Code:

. merge m:1 group year using "`stocks'" invalid file specification r(198);

I don't know if you notice (I mentioned on original post) I am using Stata 12.1 for Mac.

Maybe it has to do with the version?

2) I tried -merge- as you suggested, but with the file name in place of "`stocks'"

Code:

. merge m:1 group year using stocks_ajustes variables group year do not uniquely identify observations in the using data r(459);

As you see I still get the error message.

Am I doing something really dumb and not seeing it??

Thanks again!

These exchanges are very valuable to my learning with Stata.
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#13

21 Apr 2014, 11:38

Originally posted by Clarice Martins View Post

1) with the sample code you suggested, I am having trouble with "`stocks'"

Code:

. merge m:1 group year using "`stocks'" invalid file specification r(198);

This probably has to do with you running the code by parts. The file name is saved to a tempfile, which is just a local macro. Local macros have local scope so you need to run the complete code for it to work.

Originally posted by Clarice Martins View Post

2) I tried -merge- as you suggested, but with the file name in place of "`stocks'"

Code:

. merge m:1 group year using stocks_ajustes variables group year do not uniquely identify observations in the using data r(459);

This error is slightly different than the one you would get changing the merge m:1 to merge 1:1 in my previous example. You have trouble with the using data while I reported a possible error with the master data.

If I were to modify the using data in my previous example adding a NEW firm:

Code:

input /// str6 ticker str2 group year PETR4 BH 2000 ABEV3 BL 2000 USIM5 SM 2000 PETR4 BM 2001 ABEV3 BH 2001 USIM5 SM 2001 NEW BH 2000 NEW BL 2001 end

and then use merge m:1, then I would get the same error as you. It finally hit me with Joe's clarification in post #11. That speaks of the importance of trying our best when posting example databases. We always want them to be as representative as possible.

Originally posted by Clarice Martins View Post

These exchanges are very valuable to my learning with Stata.

Indeed they are, for many of us. My advice is you keep experimenting with the different versions of merge to get a better understanding.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Clarice Martins

Join Date: Apr 2014

Posts: 22
#14

21 Apr 2014, 12:06

My apologies for the sample not being as representative, I am really, really a beginner and certain things I still don't have "an eye" for it. ( I really thought my sample was so good!)

But, thank you for the advice and... Yes, learned a few nuances that did not catch my eye before. I will indeed keep experimenting away.
Comment

Announcement

Joining data that has no unique identifier

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment