Appending and merging in a single loop?

Ahmed Khan

Join Date: Aug 2018

Posts: 43
#1

Appending and merging in a single loop?

17 Apr 2020, 12:21

Hi, I am stuck in making a loop where I can append and merge in a same loop.

I have the following files which are all 250 MB in size each with identical variable names/type.

1) I have named them as "1.dta", "2.dta", "3.dta" .... "784.dta".

2) I have another file "PEad_ret_test2.dta"

File "1.dta" has only one identifier variable named as "localid" which has a value of "1", File "2.dta" has only one identifier variable named as "localid" which has a value of "2" and so on. Futhermore, the other variable on the basis of which I will be merging is "date".

File "PEad_ret_test2.dta" has all the identifiers i.e., localid from 1-784.

3) I want to merge all the files in (1) with (2) and get a final file "company_N&S.dta"

Now what I want to do is the make a loop in which every time Stata picks up a file like "1.dta", drops certain data and then merges it with "PEad_ret_test2.dta" and then saves it to "company_N&S.dta". Then it should move onto "2.dta", merges it with "PEad_ret_test2.dta" and so on.

I also want to keep a limited number of variables in the final file in order to keep the final size of the file small.

I have tried to write the following code for just "1.dta" and "2.dta" in order to test my loop.

Code:

clear all cd "D:\test" foreach num of numlist 1/2 { use `num'.dta // this step with use the "1.dta" and then "2.dta" in this case. keep if datatype_1 ==1 drop date // dropping this because this is the incorrect date clonevar date = date_stata // this is the correct date merge m:m localid date using "D:\test\PEad_ret_test2.dta", force keep ticker dayofweek folder_number assetcode localid date_stata date keep if _merge==3 save company_N&S.dta, replace }

The problem with this code is that it overwrites the previous file.

For example, by the end of this loop, I only have "2.dta" merged with "PEad_ret_test2.dta". Contents of "1.dta" was not there after the merge.

Is there any way to solve this issue?

Thanks
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

17 Apr 2020, 12:42

Well, of course that's what you have. Each time through the loop you are saving to the same final, company_N&S.dta, which overwrites whatever was there before.

I cannot tell what you actually want to have from what you say in #1. Do you want to end up with 784 files each of which contains the data from one of the original 784 files, merged with PEad_ret_test2? Or do you want a single file that combines all of the 784 files.

If the former, you need to change your -save- command to provide a different filename each time through the loop. One good way would be:

Code:

save company_N&S`num'.dta, replace

If the latter, then you need to incorporate -append-ing into your loop.

Code:

clear all cd "D:\test" tempfild building save `building', emptyok foreach num of numlist 1/2 { use `num'.dta // this step with use the "1.dta" and then "2.dta" in this case. keep if datatype_1 ==1 drop date // dropping this because this is the incorrect date clonevar date = date_stata // this is the correct date merge
m:m
localid date using "D:\test\PEad_ret_test2.dta",
force

keep ticker dayofweek folder_number assetcode localid date_stata date keep if _merge==3 append using `building' save `"`building'"', emptyok } save company_N&S.dta, replace

That said, none of this matters until you fix the two horrendous red flag errors in your -merge- command.

-merge m:m- produces only data salad. Only 1:1, 1:m, or m:1 merges produce usable information. If you think you need to use m:m merging then it just means that you don't understand your data. Either you have sufficient identifying variables to use a valid merge type, or, if you don't you need to use -joinby- or, less likely, -cross-. You should never use -merge m:m-, and when you are tempted to do so you should go back to the drawing boards because it definitely means you are misunderstanding your data.

Next, -force- options are, in general dangerous. In the case of -merge-, what it does is allow you to overlook the incompatibility of data types (string vs numeric) with variables having the same name but different types in the data sets being merged. You can do this, of course, but it results in the data from the using data set being discarded. If that's actually appropriate, then fine. But usually it just means that you have been working with a bunch of data sets that have incompatibilities, and you are ignoring them rather than fixing them. It is seldom appropriate to just discard data wholesale, sight-unseen. Moreover, you explicitly state in your post that your data set are all conformed with regard to variable names and storage types--if that is the case, then you are in good shape and the -force- option is unneeded. But you should remove it in that case: you may find out that your data are not what you think they are! In fact, with 784 data sets, unless you have personally conformed them all, it is most unlikely that there are no inconsistencies among them. Data quality that high is rare in real life.
1 like
Comment
Daniel Shin

Join Date: Mar 2020

Posts: 146
#3

17 Apr 2020, 13:06

Agree with Clyde, using merge m:m should not be used (I'd argue it shouldn't even be an option). Once you have addressed that issue and if you have Stata 16, I would try using frames which might save a lot of disk read/write:

Code:

frame create final_data tempfile temp foreach num of numlist 1/174 { use `num'.dta your fixed code save `temp', replace frame final_data: append using `temp' } frame change final_data
1 like
Comment
Ahmed Khan

Join Date: Aug 2018

Posts: 43
#4

17 Apr 2020, 13:27

Hi Mr. Schechter,

Thanks for the reply.

"do you want a single file that combines all of the 784 files."

I want a single file that combines the 784 files.

You are correct about m:m and force. I have corrected them.

However, when I use the 2nd code that you have provided. It gives me the following error.

file C:\Users\itsah\AppData\Local\Temp\ST_00000001.tmp already exists

Any suggestion who I might correct it?
Comment
Ahmed Khan

Join Date: Aug 2018

Posts: 43
#5

17 Apr 2020, 13:35

Originally posted by Daniel Shin View Post

Agree with Clyde, using merge m:m should not be used (I'd argue it shouldn't even be an option). Once you have addressed that issue and if you have Stata 16, I would try using frames which might save a lot of disk read/write:

Code:

frame create final_data tempfile temp foreach num of numlist 1/174 { use `num'.dta your fixed code save `temp', replace frame final_data: append using `temp' } frame change final_data

Thanks Danial, It worked perfectly.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#6

17 Apr 2020, 15:06

Re #4. Sorry, my error. Where I wrote -save `"`building'"', emptyok- near the end of the code, that should have been -save `"`building'"', replace-. (But don't change the -emptyok- near the top of the code--that one's correct.)

Another thought that will speed things up a bit, I think. Since you are putting the files all together into one large file, there is no need to keep -merge- inside the loop. You can just run the loop to append all the files, and then just do a single -merge- with the combined file at the end.
2 likes
Comment
Ahmed Khan

Join Date: Aug 2018

Posts: 43
#7

20 Apr 2020, 01:29

Originally posted by Clyde Schechter View Post

Re #4. Sorry, my error. Where I wrote -save `"`building'"', emptyok- near the end of the code, that should have been -save `"`building'"', replace-. (But don't change the -emptyok- near the top of the code--that one's correct.)

Another thought that will speed things up a bit, I think. Since you are putting the files all together into one large file, there is no need to keep -merge- inside the loop. You can just run the loop to append all the files, and then just do a single -merge- with the combined file at the end.

Thank you. It worked perfectly.
Comment

Announcement

Appending and merging in a single loop?

Comment

Comment

Comment

Comment

Comment

Comment