Working with large datasets, saving the dta files with different names in a loop

Shadi Shad

Join Date: Apr 2021

Posts: 33
#1

Working with large datasets, saving the dta files with different names in a loop

06 Sep 2021, 09:08

Dear all,

I am working with several csv. file each having a size between 700 MB to 2 GIG. I need to do some commands on each file and save them separately, otherwise the size would we even more huge and Stata crashes on my laptop, then I cannot see any results. So I look to append these results but I want to only save them in a single dta file. Here's is my code:

Code:

global path "E:/Comtradet1" cd "$path" foreach file of local datafiles { import delimited "`file'" , clear some commands save "$path/`file'.dta", replace }

I'm not sure if my save command has the correct format. I could not get any result from running the code above because Stata keeps adding observations and loading stuff and then stops responding. I tried this only with one csv file in my directory so that I can check the code and see if it works and I hoped it would be fast. But there's something that slows it down and I can't figure what. I even checked if the data is only imported and that alone also cannot be done. Please tell me if I wrote the code properly.
Additionally, since this is also a problem of large data I would appreciate your suggestions about how to handle such a situation. The code works when I don't use a loop and simply import one csv file and do the commands on it and save it. But that's just too manual. I look for more automation and that's why I tried the loop above.

Looking forward to your solutions,
Shadi
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

06 Sep 2021, 12:05

First try importing and saving just one dataset, but in your loop. Does this succeed?

Code:

global path "E:/Comtradet1" cd "$path" macro list _datafiles local datafiles : word 1 of datafiles macro list _datafiles foreach file of local datafiles { import delimited "`file'" , clear save "$path/`file'.dta", replace }

If it doesn't succeed, copy the code and all of Stata's output from the Results window and paste it into a new post here. Reading "Stata keeps adding observations and loading stuff and then stops responding" doesn't tell us anything that helps guess what the problem might be, and in particular doesn't allow us to see if Stata told you something you didn't understand to be a report of a problem.

The code works when I don't use a loop and simply import one csv file and do the commands on it and save it.

When you import just one csv file, is it the first file in the list in datafiles? And are you using the import delimited command in Stata's command window or a do-file, or did you use Stata's import delimited dialog from the File menu?
Comment
Shadi Shad

Join Date: Apr 2021

Posts: 33
#3

07 Sep 2021, 01:18

Originally posted by William Lisowski View Post

First try importing and saving just one dataset, but in your loop. Does this succeed?

Code:

global path "E:/Comtradet1" cd "$path" macro list _datafiles local datafiles : word 1 of datafiles macro list _datafiles foreach file of local datafiles { import delimited "`file'" , clear save "$path/`file'.dta", replace }

If it doesn't succeed, copy the code and all of Stata's output from the Results window and paste it into a new post here. Reading "Stata keeps adding observations and loading stuff and then stops responding" doesn't tell us anything that helps guess what the problem might be, and in particular doesn't allow us to see if Stata told you something you didn't understand to be a report of a problem.

When you import just one csv file, is it the first file in the list in datafiles? And are you using the import delimited command in Stata's command window or a do-file, or did you use Stata's import delimited dialog from the File menu?

Thank you for your answer. I have to say that when I posted the topic yesterday I missed come lines from my code unfortunately, but I added your extra commands to both versions and here is what I got:

1)

Code:

global path "E:\Comtradet1" cd "$path" macro list _datafiles local datafiles : word 1 of datafiles macro list _datafiles foreach file of local datafiles { import delimited "`file'" , clear some commands save "$path/`file'.dta", replace }

Result:

. do "C:\Users\Shadi\AppData\Local\Temp\STDaf4_000000.t mp"

global path "E:\Comtradet1"

. cd "$path"
E:\Comtradet1

. macro list _datafiles
local macro `datafiles' not found
r(111);

end of do-file

r(111);

in the second case that is my actual code I got a different error:
2)

Code:

global path "E:/Comtradet1" cd "$path" local datafiles: dir "." files "*.csv" dir macro list _datafiles local datafiles : word 1 of datafiles macro list _datafiles foreach file of local datafiles { import delimited "`file'" , clear some commands save "$path/`file'.dta", replace }

Result:
do "C:\Users\Shadi\AppData\Local\Temp\STDaf4_000000.t mp"

. global path "E:/Comtradet1"
. cd "$path"
E:\Comtradet1

. local datafiles: dir "." files "*.csv"

. dir
<dir> 9/06/21 15:15 .
<dir> 9/06/21 15:15 ..
1576.3M 5/12/19 19:42 type-C_r-ALL_ps-1998_freq-A_px-S2_pub-20060110_fmt-c
> sv_ex-20190512.csv

. macro list _datafiles
_datafiles: "type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20
> 190512.csv"

. local datafiles : word 1 of datafiles

. macro list _datafiles
_datafiles: datafiles

. foreach file of local datafiles {
2. import delimited "`file'" , clear
3.
. keep ïclassification year period perioddesc aggregatelevel
> isleafcode tradeflowcode tradeflow reporter partner commodity commoditycode
> tradevalueus
4. keep if partner == "World" & tradeflow == "Export"
5. drop partner tradeflow tradeflowcode
6. compress
7.
. save "$path/`file'.dta", replace
. }
file datafiles.csv not found
r(601);

end of do-file

r(601);

I was not sure if I should adjust this line of your suggested code:

Code:

local datafiles : word 1 of datafiles

I'm not familiar with the combination. So please let me know and a brief on what the purpose of these codes would be nice.

When you import just one csv file, is it the first file in the list in datafiles? And are you using the import delimited command in Stata's command window or a do-file, or did you use Stata's import delimited dialog from the File menu?

First: No, it's not the first file. To make thing faster I made a folder with two files for the case when I use the loop to see if it works for that I used 1998 and 1999 csv files. But overall I have one cvs file for each year during 1977 and 2020. For the case where I don't use a loop I used the first file from the complete folder, i.e., 1977 which is also the smallest in size. I hope this was clear. Second: No, I didn't use the menu. Here is the code I used for that:

Code:

global path "E:/Comtradet" import delimited "type-C_r-ALL_ps-1977_freq-A_px-S2_pub-20041207_fmt-csv_ex-20190512", clear cd "$path" some commands save "$path/1977.dta", replace

But after reading your question I also tried the same first file that I use in the "two file" case and it's again taking too long. The size for this file is 1.53 GB. IS that the reason? What do you suggest?
Is my loop in general theoretically correct? I mean in the second case (without the commands you suggested):

Code:

global path "E:/Comtradet1" cd "$path" local datafiles: dir "." files "*.csv" dir foreach file of local datafiles { import delimited "`file'" , clear some commands save "$path/`file'.dta", replace }

Do you see mistakes in my script?
Looking forward,
Shadi
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

07 Sep 2021, 05:33

When presenting output copied from the Results window, please surround it with code delimiters [CODE] and [/CODE], as you present your code.

Apologies for my mistake, the line to select just the first datafile should have been

Code:

local datafiles : word 1 of `datafiles'

but that is no longer important, because your test of the first file in the two-file case shows that the problem is not in the loop but in the data.

It could be that the file is too large. It could also be that there are problems with the data in the file that cause Stata to take longer than it should.

Try starting smaller: rerun the 1998 file, adding the rowrange() option to your import delimited command. Suppose you know there are 1,000,000 observations in the file. Try importing just the first 1000 and see how long it takes.

Code:

import delimited "`file'" , clear rowrange(1:1000)

Then, use Stata's Data Browser window to look at the data you imported. Does it look correct?

Then try increasing the rowrange to 1:10000, then 1:100000, and see if the time goes up proportionally.

I see nothing in what you showed in post #3 of the "Stata keeps adding observations and loading stuff and then stops responding" that you described in your first post.

I do believe that you should reverse the order of your two keep commands, first keeping observations and then keeping variables, since I expect you are reducing the number of observations substantially.

Added in edit: I am growing convinced that the "some commands" that you downplay are in fact the cause of the problem. I encourage you to return to the spirit of my suggestion in post #2 and leave out everything between the import delimited and save commands and see if those two commands complete in reasonable time.

Last edited by William Lisowski; 07 Sep 2021, 06:18.
Comment

Shadi Shad

Join Date: Apr 2021
Posts: 33

07 Sep 2021, 07:40

Apologies for my mistake, the line to select just the first datafile should have been

I adjusted the code like this and got the same error:

Code:

    global path "E:\Comtrade"
    cd "$path"
    save "all.dta", replace emptyok
    macro list _datafiles
local datafiles : word 1 of `datafiles'
macro list _datafiles
    
    *local datafiles: dir "." files "*.csv"
    *dir
    
    foreach file of local datafiles {
        import delimited "`file'" , clear
        save "$path/`file'.dta", replace
    }

Since you did not mention if I should the command lines that I missed in post #1 (the stared ones above) I also ran the code once after including them but I get the same error, but please tell me if I should eliminate my lines generally:

Code:

. do "C:\Users\Shadi\AppData\Local\Temp\STD2f14_000000.tmp"

.         global path "E:\Comtradet"
.         cd "$path"
E:\Comtradet
.         macro list _datafiles
local macro `datafiles' not found
r(111);

end of do-file

r(111);

Thank you for the row limit suggestion, stata managed to do it, for both cases 1:10000 and 1:100000 and it took more time in the second case (4 minutes) but I have saved dta files in my directory now. It also worked with "some command". This time I just leave them out so that you can see they are not complicated or problematic:

Code:

    global path "E:\Comtrade"
    cd "$path"   
    local datafiles: dir "." files "*.csv"
    dir
    foreach file of local datafiles  {
        import delimited "`file'" , clear rowrange(1:100000)
        keep ïclassification year period perioddesc aggregatelevel isleafcode tradeflowcode tradeflow reporter partner commodity commoditycode tradevalueus
        keep if partner == "World" & tradeflow == "Export"
        drop partner tradeflow tradeflowcode
        compress
        save "$path/`file'.dta", replace
    }

Result

Code:

. do "C:\Users\Shadi\AppData\Local\Temp\STD2f14_000000.tmp"

.         global path "E:\Comtrade"

.         *global path "F:\UN Comtrade"
.         cd "$path"
E:\Comtrade

.         *save "all.dta", replace emptyok
.         *macro list _datafiles
.         *local datafiles : word 1 of `datafiles'
.         *macro list _datafiles
.         
.         local datafiles: dir "." files "*.csv"

.     dir
  <dir>   9/07/21 15:26  .                 
  <dir>   9/07/21 15:26  ..                
 272.7M   5/12/19  8:44  type-C_r-ALL_ps-1977_freq-A_px-S2_pub-20041207_fmt-c
> sv_ex-20190512.csv
  44.4k   9/07/21 15:26  type-c_r-all_ps-1977_freq-a_px-s2_pub-20041207_fmt-c
> sv_ex-20190512.csv.dta
 587.7M   5/11/19 20:53  type-C_r-ALL_ps-1978_freq-A_px-S2_pub-20050214_fmt-c
> sv_ex-20190511.csv
  29.4k   9/07/21 15:26  type-c_r-all_ps-1978_freq-a_px-s2_pub-20050214_fmt-c
> sv_ex-20190511.csv.dta
 667.1M   5/12/19 22:53  type-C_r-ALL_ps-1979_freq-A_px-S2_pub-20051212_fmt-c
> sv_ex-20190512.csv
  25.6k   9/07/21 15:26  type-c_r-all_ps-1979_freq-a_px-s2_pub-20051212_fmt-c
> sv_ex-20190512.csv.dta
 716.1M   5/11/19 21:49  type-C_r-ALL_ps-1980_freq-A_px-S2_pub-20051212_fmt-c
> sv_ex-20190511.csv
  42.4k   9/07/21 15:26  type-c_r-all_ps-1980_freq-a_px-s2_pub-20051212_fmt-c
> sv_ex-20190511.csv.dta
 740.6M   5/13/19  6:30  type-C_r-ALL_ps-1981_freq-A_px-S2_pub-20051212_fmt-c
> sv_ex-20190513.csv
  29.2k   9/07/21 15:26  type-c_r-all_ps-1981_freq-a_px-s2_pub-20051212_fmt-c
> sv_ex-20190513.csv.dta
 740.4M   5/12/19 11:42  type-C_r-ALL_ps-1982_freq-A_px-S2_pub-20091029_fmt-c
> sv_ex-20190512.csv
  31.9k   9/07/21 15:26  type-c_r-all_ps-1982_freq-a_px-s2_pub-20091029_fmt-c
> sv_ex-20190512.csv.dta
 775.1M   5/13/19 11:09  type-C_r-ALL_ps-1983_freq-A_px-S2_pub-20091029_fmt-c
> sv_ex-20190513.csv
  32.0k   9/07/21 15:26  type-c_r-all_ps-1983_freq-a_px-s2_pub-20091029_fmt-c
> sv_ex-20190513.csv.dta
 764.2M   5/12/19 16:03  type-C_r-ALL_ps-1984_freq-A_px-S2_pub-20091029_fmt-c
> sv_ex-20190512.csv
  34.1k   9/07/21 15:26  type-c_r-all_ps-1984_freq-a_px-s2_pub-20091029_fmt-c
> sv_ex-20190512.csv.dta
 800.3M   5/12/19  0:30  type-C_r-ALL_ps-1985_freq-A_px-S2_pub-20091029_fmt-c
> sv_ex-20190512.csv
  32.3k   9/07/21 15:26  type-c_r-all_ps-1985_freq-a_px-s2_pub-20091029_fmt-c
> sv_ex-20190512.csv.dta
 848.5M   5/12/19 20:56  type-C_r-ALL_ps-1986_freq-A_px-S2_pub-20091029_fmt-c
> sv_ex-20190512.csv
  35.1k   9/07/21 15:26  type-c_r-all_ps-1986_freq-a_px-s2_pub-20091029_fmt-c
> sv_ex-20190512.csv.dta
 867.5M   5/12/19  3:16  type-C_r-ALL_ps-1987_freq-A_px-S2_pub-20051214_fmt-c
> sv_ex-20190512.csv
  39.0k   9/07/21 15:26  type-c_r-all_ps-1987_freq-a_px-s2_pub-20051214_fmt-c
> sv_ex-20190512.csv.dta
 897.8M   5/13/19 13:51  type-C_r-ALL_ps-1988_freq-A_px-S2_pub-20051214_fmt-c
> sv_ex-20190513.csv
  45.8k   9/07/21 15:26  type-c_r-all_ps-1988_freq-a_px-s2_pub-20051214_fmt-c
> sv_ex-20190513.csv.dta
 968.6M   5/12/19  7:39  type-C_r-ALL_ps-1989_freq-A_px-S2_pub-20051219_fmt-c
> sv_ex-20190512.csv
  41.1k   9/07/21 15:26  type-c_r-all_ps-1989_freq-a_px-s2_pub-20051219_fmt-c
> sv_ex-20190512.csv.dta
1284.8M   5/12/19 10:21  type-C_r-ALL_ps-1994_freq-A_px-S2_pub-20060110_fmt-c
> sv_ex-20190512.csv
  50.4k   9/07/21 15:26  type-c_r-all_ps-1994_freq-a_px-s2_pub-20060110_fmt-c
> sv_ex-20190512.csv.dta
1380.5M   5/13/19  6:51  type-C_r-ALL_ps-1995_freq-A_px-S2_pub-20060110_fmt-c
> sv_ex-20190513.csv
  46.1k   9/07/21 15:26  type-c_r-all_ps-1995_freq-a_px-s2_pub-20060110_fmt-c
> sv_ex-20190513.csv.dta
1472.2M   5/12/19  5:20  type-C_r-ALL_ps-1996_freq-A_px-S2_pub-20060110_fmt-c
> sv_ex-20190512.csv
  43.4k   9/07/21 15:26  type-c_r-all_ps-1996_freq-a_px-s2_pub-20060110_fmt-c
> sv_ex-20190512.csv.dta
1541.4M   5/11/19 21:07  type-C_r-ALL_ps-1997_freq-A_px-S2_pub-20060110_fmt-c
> sv_ex-20190511.csv
  12.4k   9/07/21 15:26  type-c_r-all_ps-1997_freq-a_px-s2_pub-20060110_fmt-c
> sv_ex-20190511.csv.dta
1576.3M   5/12/19 19:42  type-C_r-ALL_ps-1998_freq-A_px-S2_pub-20060110_fmt-c
> sv_ex-20190512.csv
  12.1k   9/07/21 15:26  type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-c
> sv_ex-20190512.csv.dta
1648.0M   5/12/19  1:51  type-C_r-ALL_ps-1999_freq-A_px-S2_pub-20051228_fmt-c
> sv_ex-20190512.csv
  12.2k   9/07/21 15:26  type-c_r-all_ps-1999_freq-a_px-s2_pub-20051228_fmt-c
> sv_ex-20190512.csv.dta

.     
.     foreach file of local datafiles  {
  2.         import delimited "`file'" , clear rowrange(1:100000)
  3.                 keep ïclassification year period perioddesc aggregatelev
> el isleafcode tradeflowcode tradeflow reporter partner commodity commodityc
> ode tradevalueus
  4.                 keep if partner == "World" & tradeflow == "Export"
  5.                 drop partner tradeflow tradeflowcode
  6.                 compress
  7.                 save "$path/`file'.dta", replace
  8.         }
(22 vars, 99,999 obs)
(97,180 observations deleted)
  commodity is strL now coalesced
  (206,928 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1977_freq-a_px-s2_pub-20041207_fmt-csv_ex-20
> 190512.csv.dta saved
(22 vars, 99,999 obs)
(98,044 observations deleted)
  commodity is strL now coalesced
  (118,714 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1978_freq-a_px-s2_pub-20050214_fmt-csv_ex-20
> 190511.csv.dta saved
(22 vars, 99,999 obs)
(98,157 observations deleted)
  commodity is strL now coalesced
  (104,464 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1979_freq-a_px-s2_pub-20051212_fmt-csv_ex-20
> 190512.csv.dta saved
(22 vars, 99,999 obs)
(97,490 observations deleted)
  commodity is strL now coalesced
  (165,619 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1980_freq-a_px-s2_pub-20051212_fmt-csv_ex-20
> 190511.csv.dta saved
(22 vars, 99,999 obs)
(97,682 observations deleted)
  commodity is strL now coalesced
  (146,194 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1981_freq-a_px-s2_pub-20051212_fmt-csv_ex-20
> 190513.csv.dta saved
(22 vars, 99,999 obs)
(97,641 observations deleted)
  commodity is strL now coalesced
  (152,949 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1982_freq-a_px-s2_pub-20091029_fmt-csv_ex-20
> 190512.csv.dta saved
(22 vars, 99,999 obs)
(97,583 observations deleted)
  commodity is strL now coalesced
  (159,509 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1983_freq-a_px-s2_pub-20091029_fmt-csv_ex-20
> 190513.csv.dta saved
(22 vars, 99,999 obs)
(97,600 observations deleted)
  commodity is strL now coalesced
  (157,595 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1984_freq-a_px-s2_pub-20091029_fmt-csv_ex-20
> 190512.csv.dta saved
(22 vars, 99,999 obs)
(97,719 observations deleted)
  commodity is strL now coalesced
  (144,359 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1985_freq-a_px-s2_pub-20091029_fmt-csv_ex-20
> 190512.csv.dta saved
(22 vars, 99,999 obs)
(97,598 observations deleted)
  commodity is strL now coalesced
  (159,689 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1986_freq-a_px-s2_pub-20091029_fmt-csv_ex-20
> 190512.csv.dta saved
(22 vars, 99,999 obs)
(97,662 observations deleted)
  commodity is strL now coalesced
  (152,709 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1987_freq-a_px-s2_pub-20051214_fmt-csv_ex-20
> 190512.csv.dta saved
(22 vars, 99,999 obs)
(98,147 observations deleted)
  commodity is strL now coalesced
  (117,413 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1988_freq-a_px-s2_pub-20051214_fmt-csv_ex-20
> 190513.csv.dta saved
(22 vars, 99,999 obs)
(97,953 observations deleted)
  commodity is strL now coalesced
  (139,598 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1989_freq-a_px-s2_pub-20051219_fmt-csv_ex-20
> 190512.csv.dta saved
(22 vars, 99,999 obs)
(98,030 observations deleted)
  commodity is strL now coalesced
  (124,193 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1994_freq-a_px-s2_pub-20060110_fmt-csv_ex-20
> 190512.csv.dta saved
(22 vars, 99,999 obs)
(97,883 observations deleted)
  commodity is strL now coalesced
  (142,891 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1995_freq-a_px-s2_pub-20060110_fmt-csv_ex-20
> 190513.csv.dta saved
(22 vars, 99,999 obs)
(96,742 observations deleted)
  (0 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1996_freq-a_px-s2_pub-20060110_fmt-csv_ex-20
> 190512.csv.dta saved
(22 vars, 99,999 obs)
(97,060 observations deleted)
  (0 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1997_freq-a_px-s2_pub-20060110_fmt-csv_ex-20
> 190511.csv.dta saved
(22 vars, 99,999 obs)
(97,118 observations deleted)
  (0 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20
> 190512.csv.dta saved
(22 vars, 99,999 obs)
(97,028 observations deleted)
  (0 bytes saved)
file E:\Comtrade/type-c_r-all_ps-1999_freq-a_px-s2_pub-20051228_fmt-csv_ex-20
> 190512.csv.dta saved

.
end of do-file

I see nothing in what you showed in post #3 of the "Stata keeps adding observations and loading stuff and then stops responding" that you described in your first post.

I'm not sure what you mean here. But when I said that I meant that down the data browser and stata command window I see observations piling up and I can see that stata is running because there's the red STOP button that could break the process but this goes on for several hours and I get only "not responding" in the end which is prompted by my PC Windows when any program stops responding. Does this answer your question?

I do believe that you should reverse the order of your two keep commands, first keeping observations and then keeping variables, since I expect you are reducing the number of observations substantially.

Since it worked above for the limited rows I did not try that yet. But I'll keep that in mind as a possibility.
So I think we can sum up that the problem is that the size is too large. Or do you have other conclusions in mind? Please let me know. What do you think I can do to best deal with this. I am also not sure why this is happening because in another post here: https://www.statalist.org/forums/for...large-csv-file I read that they could upload files also with GB size. My laptop has an 8 GB RAM btw. So I'm confused why I have this problem and would appreciate your ideas. Thanks again.

Looking forward,
Shadi

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

07 Sep 2021, 09:19

So I think we can sum up that the problem is that the size is too large.

To be precise, it is too large for the code you have written, given the configuration of your laptop. A 1.5GB CSV might occupy more memory when it is imported. It is possible that changes to your code will overcome the problem.

First try this: importing your 1.5GB CSV file, and providing some diagnostic information

Code:

clear all
global path "E:\Comtrade"
cd "$path"
local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20190512.csv
memory
import delimited "`file'" , clear
memory
describe

Does that succeed? How long does it take? Please show the code and output from your Results window.

Next try this: importing CSV and saving a Stata dataset

Code:

clear all
global path "E:\Comtrade"
cd "$path"
local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20190512.csv
import delimited "`file'" , clear
save "$path/`file'.dta", replace

Does that succeed? How long does it take?

Next try this: importing, reducing the number of observations, and saving

Code:

clear all
global path "E:\Comtrade"
cd "$path"
local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20190512.csv
import delimited "`file'" , clear
keep if partner == "World" & tradeflow == "Export"
save "$path/`file'.dta", replace

Does that succeed? How long does it take?

Finally try this: importing, reducing the number of observations, then reducing the number of variables, and saving

Code:

clear all
global path "E:\Comtrade"
cd "$path"
local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20190512.csv
import delimited "`file'" , clear
keep if partner == "World" & tradeflow == "Export"
keep ïclassification year period perioddesc aggregatelevel isleafcode reporter commodity commoditycode tradevalueus
save "$path/`file'.dta", replace

Does that succeed? How long does it take?

Comment

Shadi Shad

Join Date: Apr 2021
Posts: 33

08 Sep 2021, 04:34

First try this: importing your 1.5GB CSV file, and providing some diagnostic information

Code:

 clear all
global path "E:\Comtrade"
cd "$path"
local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20190512.csv
memory
import delimited "`file'" , clear
memory
describe

Code:

 do "C:\Users\Shadi\AppData\Local\Temp\STDac8_000000.tmp"
 

. clear all

. global path "E:\Comtrade"

. cd "$path"
E:\Comtrade

. local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-
> 20190512.csv

. memory

  Memory usage
                                            used                allocat
> ed
    -------------------------------------------------------------------
> --
    data                                       0               67,108,8
> 64
    strLs                                      0                      
>  0
    -------------------------------------------------------------------
> --
    data & strLs                               0               67,108,8
> 64

    -------------------------------------------------------------------
> --
    data & strLs                               0               67,108,8
> 64
    var. names, %fmts, ...                     4                   65,0
> 00
    overhead                           1,081,344                1,082,1
> 36

    Stata matrices                             0                      
>  0
    ado-files                              3,263                    3,2
> 63
    stored results                             0                      
>  0

    Mata matrices                              0                      
>  0
    Mata functions                             0                      
>  0

    set maxvar usage                   5,281,738                5,281,7
> 38

    other                                  3,231                    3,2
> 31
    -------------------------------------------------------------------
> --
    grand total                        6,365,976               73,544,2
> 32

. import delimited "`file'" , clear

I had to break the process after 45 minutes, because my laptop was slowed down and I could not work with it properly. Maybe my laptop could afford to finish it in half a day but that's not what I look for. Since this is only one of the many files I have to adjust. I know you suggested this to get some diagnostics though. So I hope the result is helpful.

Same thing happened for the second case:

Code:

clear all
global path "E:\Comtrade"
cd "$path" l
ocal file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20190512.csv import delimited "`file'" , clear
save "$path/`file'.dta", replace

Code:

. do "C:\Users\Shadi\AppData\Local\Temp\STD330_000000.tmp"

. clear all

. global path "E:\Comtrade"

. cd "$path"
E:\Comtrade

. local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-201905
> 12.csv

. import delimited "`file'" , clear
--Break--
r(1);

end of do-file

--Break--
r(1);

Also for the third case:

Code:

 clear all
global path "E:\Comtrade"
cd "$path"
local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20190512.csv
import delimited "`file'" , clear
keep if partner == "World" & tradeflow == "Export"
keep ïclassification year period perioddesc aggregatelevel isleafcode reporter commodity commoditycode tradevalueus
save "$path/`file'.dta", replace

Code:

. do "C:\Users\Shadi\AppData\Local\Temp\STD330_000000.tmp"

. clear all

. global path "E:\Comtrade"

. cd "$path"
E:\Comtrade

. local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-201905
> 12.csv

. import delimited "`file'" , clear
--Break--
r(1);

end of do-file

--Break--
r(1);

So I get stuck with importing the file in the first place and then cannot proceed to drop and keep commands. But I also believe that the order of commands and how the code is organized could make a difference. But now the problem is one level before this, i.e., when importing the huge file. This file has 22 variables and 10,188,935 observations (I opened it with EmEditor). I would share the file here but I don't have the permission to do this based on the download policy for UN Comtrade for bulk data.
Do you have any suggestions? Thank you for your effort, it's been very encouraging.

Looking forward,
Shadi

Last edited by Shadi Shad; 08 Sep 2021, 04:44.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

08 Sep 2021, 18:47

Here is an example of how to import a dataset in batches, drop unwanted observations, reducing the size of the data, and then appending the reduced datasets together, hoping that the resulting dataset will not itself be large enough to bog your machine down.

Code:

// generate example data

set seed 666
set obs 2042
generate x = runiformint(1,10)
generate y = rnormal()
export delimited using gnxl.csv, replace

// create an empty dataset to accumulate observations

clear
save mydata, emptyok replace

// read data 1000 observations at a time

local batch 1000
local i = 2
local j = 1 + `batch'
while 1 {
    display "reading rows `i' to `j'"
    clear
    import delimited using gnxl.csv, varnames(1) rowrange(`i':`j')
    if `r(N)'==0 continue, break
    keep if x==1
    append using mydata
    save mydata, replace
    local i = `i' + `batch'
    local j = `j' + `batch'
    }
display "done"
describe using mydata

Code:

. // generate example data
. set seed 666

. set obs 2042
Number of observations (_N) was 0, now 2,042.

. generate x = runiformint(1,10)

. generate y = rnormal()

. export delimited using gnxl.csv, replace
file gnxl.csv saved

.
. // create an empty dataset
. clear

. save mydata, emptyok replace
(dataset contains 0 observations)
file mydata.dta saved

.
. // read data 1000 observations at a time
.
. local batch 1000

. local i = 2

. local j = 1 + `batch'

. while 1 {
  2.     display "reading rows `i' to `j'"
  3.     clear
  4.     import delimited using gnxl.csv, varnames(1) rowrange(`i':`j')
  5.     if `r(N)'==0 continue, break
  6.     keep if x==1
  7.     append using mydata
  8.     save mydata, replace
  9.     local i = `i' + `batch'
 10.     local j = `j' + `batch'
 11.     }
reading rows 2 to 1001
(encoding automatically selected: ISO-8859-1)
(2 vars, 1,000 obs)
(890 observations deleted)
file mydata.dta saved
reading rows 1002 to 2001
(encoding automatically selected: ISO-8859-1)
(2 vars, 1,000 obs)
(900 observations deleted)
file mydata.dta saved
reading rows 2002 to 3001
(encoding automatically selected: ISO-8859-1)
(2 vars, 42 obs)
(39 observations deleted)
file mydata.dta saved
reading rows 3002 to 4001
(encoding automatically selected: ISO-8859-1)
(2 vars, 0 obs)

. display "done"
done

. describe using mydata

Contains data                                
 Observations:           213                  8 Sep 2021 20:43
    Variables:             2                  
------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------------
x               byte    %8.0g                
y               float   %9.0g                
------------------------------------------------------------------------------------------------
Sorted by:

.

Added in edit: I haven't tested whether this will work if no observations are matched by the keep command or if there are problems if the total number of data rows is an exact multiple of the batch size.

Last edited by William Lisowski; 08 Sep 2021, 18:51.

Comment

Shadi Shad

Join Date: Apr 2021
Posts: 33

10 Sep 2021, 06:51

Originally posted by William Lisowski View Post

Code:

// generate example data

set seed 666
set obs 2042
generate x = runiformint(1,10)
generate y = rnormal()
export delimited using gnxl.csv, replace

// create an empty dataset to accumulate observations

clear
save mydata, emptyok replace

// read data 1000 observations at a time

local batch 1000
local i = 2
local j = 1 + `batch'
while 1 {
display "reading rows `i' to `j'"
clear
import delimited using gnxl.csv, varnames(1) rowrange(`i':`j')
if `r(N)'==0 continue, break
keep if x==1
append using mydata
save mydata, replace
local i = `i' + `batch'
local j = `j' + `batch'
}
display "done"
describe using mydata

Code:

. // generate example data
. set seed 666

. set obs 2042
Number of observations (_N) was 0, now 2,042.

. generate x = runiformint(1,10)

. generate y = rnormal()

. export delimited using gnxl.csv, replace
file gnxl.csv saved

.
. // create an empty dataset
. clear

. save mydata, emptyok replace
(dataset contains 0 observations)
file mydata.dta saved

.
. // read data 1000 observations at a time
.
. local batch 1000

. local i = 2

. local j = 1 + `batch'

. while 1 {
2. display "reading rows `i' to `j'"
3. clear
4. import delimited using gnxl.csv, varnames(1) rowrange(`i':`j')
5. if `r(N)'==0 continue, break
6. keep if x==1
7. append using mydata
8. save mydata, replace
9. local i = `i' + `batch'
10. local j = `j' + `batch'
11. }
reading rows 2 to 1001
(encoding automatically selected: ISO-8859-1)
(2 vars, 1,000 obs)
(890 observations deleted)
file mydata.dta saved
reading rows 1002 to 2001
(encoding automatically selected: ISO-8859-1)
(2 vars, 1,000 obs)
(900 observations deleted)
file mydata.dta saved
reading rows 2002 to 3001
(encoding automatically selected: ISO-8859-1)
(2 vars, 42 obs)
(39 observations deleted)
file mydata.dta saved
reading rows 3002 to 4001
(encoding automatically selected: ISO-8859-1)
(2 vars, 0 obs)

. display "done"
done

. describe using mydata

Contains data
Observations: 213 8 Sep 2021 20:43
Variables: 2
------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------------
x byte %8.0g
y float %9.0g
------------------------------------------------------------------------------------------------
Sorted by:

.

Dear William,

Thank you for your ideas. At the end of the day I had to chunk my data, and I could use your post to get some inspirations for my script.

Best,
Shadi

Announcement