Run a command for multiple files that I can't append

Minh Thu

Join Date: Oct 2021

Posts: 3
#1

Run a command for multiple files that I can't append

08 Oct 2021, 11:19

Hello,

I have 30 files of data with the same variables and format. Each file is the data for the different time period (for example for year 1,2,3). I want to run a command (for example: a regression) and get the result for the whole period of three years. However, I could not append these files because they are extremely heavy.

I am looking for another solution and I see some old posts here discussing about loop function. As I understand with loop function, you can run a command for multiple files at the same time but it will produce separate result for each file.

So could you please help me with this problem? is there any solution without appending these files. Thank you very much!
Tags: None
Minh Thu

Join Date: Oct 2021

Posts: 3
#2

08 Oct 2021, 15:16

Please is there anyone knowing about this!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#3

08 Oct 2021, 16:01

Sometimes a problem simply exceeds the computational resources in hand and you have to either find a larger-capacity system to work on, or formulate a different plan. But before giving up, here are a few thoughts that may or may not be helpful.

1. Is the problem that the number of observations (rows) in the combined file would exceed Stata's limit. That's hard to imagine since, except for Stata/BE you can have over 1 trillion observations (if your available memory can cope with that.) But if that's the problem, and you really have no choice but to use the entire combined data set, then I think you are stuck. You cannot do a regression in pieces and then put those pieces together. You would have to use some other statistical software, and perhaps a different computer, that can handle problems that large.

2. More hopefully, the combined data set is too large for you to get memory allocated to it, even though it doesn't violate Stata's limit on number of observations. In that case, the solution is to eliminate from the data sets any variables that you don't need from the regression. When data sets have thousands of variables, it is likely that most of them will play no role: how would you make sense out of the results of a regression with thousands of variables anyway? So loop over those files and append them together, but at each step of the way -drop- all the variables that you will not be using for the regression. That may well leave you with a data set that you can fit when everything is combined.

3. Another alternative if the number of observations is too large is to pull a random sample from the data that is small enough to work with. This wouldn't be a full-sample regression, but if the sampling is done so as to respect any hierarchical structure in the data, you will get results that are very close to what the full data set would give you.
2 likes
Comment
Minh Thu

Join Date: Oct 2021

Posts: 3
#4

09 Oct 2021, 04:45

Thanks a lot Clyde. Really appreciate this!
I will look at and consider what is the best option among these!
Comment

Announcement

Run a command for multiple files that I can't append

Comment

Comment

Comment