Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Working with large datasets, saving the dta files with different names in a loop

    Dear all,

    I am working with several csv. file each having a size between 700 MB to 2 GIG. I need to do some commands on each file and save them separately, otherwise the size would we even more huge and Stata crashes on my laptop, then I cannot see any results. So I look to append these results but I want to only save them in a single dta file. Here's is my code:
    Code:
        global path "E:/Comtradet1"
        cd "$path"    
        foreach file of local datafiles {
            import delimited "`file'" , clear
            
           some commands
            
            save "$path/`file'.dta", replace
    
        }
    I'm not sure if my save command has the correct format. I could not get any result from running the code above because Stata keeps adding observations and loading stuff and then stops responding. I tried this only with one csv file in my directory so that I can check the code and see if it works and I hoped it would be fast. But there's something that slows it down and I can't figure what. I even checked if the data is only imported and that alone also cannot be done. Please tell me if I wrote the code properly.
    Additionally, since this is also a problem of large data I would appreciate your suggestions about how to handle such a situation. The code works when I don't use a loop and simply import one csv file and do the commands on it and save it. But that's just too manual. I look for more automation and that's why I tried the loop above.

    Looking forward to your solutions,
    Shadi

  • #2
    First try importing and saving just one dataset, but in your loop. Does this succeed?
    Code:
        global path "E:/Comtradet1"
        cd "$path"    
        macro list _datafiles
        local datafiles : word 1 of datafiles
        macro list _datafiles
        foreach file of local datafiles {
            import delimited "`file'" , clear
            save "$path/`file'.dta", replace
        }
    If it doesn't succeed, copy the code and all of Stata's output from the Results window and paste it into a new post here. Reading "Stata keeps adding observations and loading stuff and then stops responding" doesn't tell us anything that helps guess what the problem might be, and in particular doesn't allow us to see if Stata told you something you didn't understand to be a report of a problem.

    The code works when I don't use a loop and simply import one csv file and do the commands on it and save it.
    When you import just one csv file, is it the first file in the list in datafiles? And are you using the import delimited command in Stata's command window or a do-file, or did you use Stata's import delimited dialog from the File menu?

    Comment


    • #3
      Originally posted by William Lisowski View Post
      First try importing and saving just one dataset, but in your loop. Does this succeed?
      Code:
      global path "E:/Comtradet1"
      cd "$path"
       macro list _datafiles
      local datafiles : word 1 of datafiles
      macro list _datafiles
      foreach file of local datafiles {
      import delimited "`file'" , clear
      save "$path/`file'.dta", replace
      }
      If it doesn't succeed, copy the code and all of Stata's output from the Results window and paste it into a new post here. Reading "Stata keeps adding observations and loading stuff and then stops responding" doesn't tell us anything that helps guess what the problem might be, and in particular doesn't allow us to see if Stata told you something you didn't understand to be a report of a problem.



      When you import just one csv file, is it the first file in the list in datafiles? And are you using the import delimited command in Stata's command window or a do-file, or did you use Stata's import delimited dialog from the File menu?
      Thank you for your answer. I have to say that when I posted the topic yesterday I missed come lines from my code unfortunately, but I added your extra commands to both versions and here is what I got:

      1)
      Code:
          global path "E:\Comtradet1"
          cd "$path"    
          macro list _datafiles
          local datafiles : word 1 of datafiles
          macro list _datafiles
         foreach file of local datafiles {
              import delimited "`file'" , clear
              some commands
              save "$path/`file'.dta", replace
          }
      Result:

      . do "C:\Users\Shadi\AppData\Local\Temp\STDaf4_000000.t mp"

      global path "E:\Comtradet1"


      . cd "$path"
      E:\Comtradet1

      . macro list _datafiles
      local macro `datafiles' not found
      r(111);

      end of do-file

      r(111);


      in the second case that is my actual code I got a different error:
      2)

      Code:
         
          global path "E:/Comtradet1"
          cd "$path"
          local datafiles: dir "." files "*.csv"
          dir
          macro list _datafiles
          local datafiles : word 1 of datafiles
          macro list _datafiles
          foreach file of local datafiles {
              import delimited "`file'" , clear
              some commands       
              save "$path/`file'.dta", replace
          }
      Result:
      do "C:\Users\Shadi\AppData\Local\Temp\STDaf4_000000.t mp"

      . global path "E:/Comtradet1"
      . cd "$path"
      E:\Comtradet1

      . local datafiles: dir "." files "*.csv"

      . dir
      <dir> 9/06/21 15:15 .
      <dir> 9/06/21 15:15 ..
      1576.3M 5/12/19 19:42 type-C_r-ALL_ps-1998_freq-A_px-S2_pub-20060110_fmt-c
      > sv_ex-20190512.csv

      . macro list _datafiles
      _datafiles: "type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20
      > 190512.csv"

      . local datafiles : word 1 of datafiles

      . macro list _datafiles
      _datafiles: datafiles

      . foreach file of local datafiles {
      2. import delimited "`file'" , clear
      3.
      . keep ïclassification year period perioddesc aggregatelevel
      > isleafcode tradeflowcode tradeflow reporter partner commodity commoditycode
      > tradevalueus
      4. keep if partner == "World" & tradeflow == "Export"
      5. drop partner tradeflow tradeflowcode
      6. compress
      7.
      . save "$path/`file'.dta", replace
      . }
      file datafiles.csv not found
      r(601);

      end of do-file

      r(601);



      I was not sure if I should adjust this line of your suggested code:
      Code:
      local datafiles : word 1 of datafiles
      I'm not familiar with the combination. So please let me know and a brief on what the purpose of these codes would be nice.

      When you import just one csv file, is it the first file in the list in datafiles? And are you using the import delimited command in Stata's command window or a do-file, or did you use Stata's import delimited dialog from the File menu?
      First: No, it's not the first file. To make thing faster I made a folder with two files for the case when I use the loop to see if it works for that I used 1998 and 1999 csv files. But overall I have one cvs file for each year during 1977 and 2020. For the case where I don't use a loop I used the first file from the complete folder, i.e., 1977 which is also the smallest in size. I hope this was clear. Second: No, I didn't use the menu. Here is the code I used for that:
      Code:
         
      global path "E:/Comtradet"
          import delimited "type-C_r-ALL_ps-1977_freq-A_px-S2_pub-20041207_fmt-csv_ex-20190512", clear
          cd "$path"   
          some commands
          save "$path/1977.dta", replace
      But after reading your question I also tried the same first file that I use in the "two file" case and it's again taking too long. The size for this file is 1.53 GB. IS that the reason? What do you suggest?
      Is my loop in general theoretically correct? I mean in the second case (without the commands you suggested):

      Code:
         
          global path "E:/Comtradet1"
          cd "$path"
          local datafiles: dir "." files "*.csv"
          dir
          foreach file of local datafiles {
              import delimited "`file'" , clear
              some commands       
              save "$path/`file'.dta", replace
          }
      Do you see mistakes in my script?
      Looking forward,
      Shadi

      Comment


      • #4
        When presenting output copied from the Results window, please surround it with code delimiters [CODE] and [/CODE], as you present your code.

        Apologies for my mistake, the line to select just the first datafile should have been
        Code:
        local datafiles : word 1 of `datafiles'
        but that is no longer important, because your test of the first file in the two-file case shows that the problem is not in the loop but in the data.

        It could be that the file is too large. It could also be that there are problems with the data in the file that cause Stata to take longer than it should.

        Try starting smaller: rerun the 1998 file, adding the rowrange() option to your import delimited command. Suppose you know there are 1,000,000 observations in the file. Try importing just the first 1000 and see how long it takes.
        Code:
        import delimited "`file'" , clear rowrange(1:1000)
        Then, use Stata's Data Browser window to look at the data you imported. Does it look correct?

        Then try increasing the rowrange to 1:10000, then 1:100000, and see if the time goes up proportionally.

        I see nothing in what you showed in post #3 of the "Stata keeps adding observations and loading stuff and then stops responding" that you described in your first post.

        I do believe that you should reverse the order of your two keep commands, first keeping observations and then keeping variables, since I expect you are reducing the number of observations substantially.

        Added in edit: I am growing convinced that the "some commands" that you downplay are in fact the cause of the problem. I encourage you to return to the spirit of my suggestion in post #2 and leave out everything between the import delimited and save commands and see if those two commands complete in reasonable time.
        Last edited by William Lisowski; 07 Sep 2021, 06:18.

        Comment


        • #5
          Apologies for my mistake, the line to select just the first datafile should have been
          I adjusted the code like this and got the same error:
          Code:
              global path "E:\Comtrade"
              cd "$path"
              save "all.dta", replace emptyok
              macro list _datafiles
          local datafiles : word 1 of `datafiles'
          macro list _datafiles
              
              *local datafiles: dir "." files "*.csv"
              *dir
              
              foreach file of local datafiles {
                  import delimited "`file'" , clear
                  save "$path/`file'.dta", replace
              }
          Since you did not mention if I should the command lines that I missed in post #1 (the stared ones above) I also ran the code once after including them but I get the same error, but please tell me if I should eliminate my lines generally:
          Code:
          . do "C:\Users\Shadi\AppData\Local\Temp\STD2f14_000000.tmp"
          
          .         global path "E:\Comtradet"
          .         cd "$path"
          E:\Comtradet
          .         macro list _datafiles
          local macro `datafiles' not found
          r(111);
          
          end of do-file
          
          r(111);
          Thank you for the row limit suggestion, stata managed to do it, for both cases 1:10000 and 1:100000 and it took more time in the second case (4 minutes) but I have saved dta files in my directory now. It also worked with "some command". This time I just leave them out so that you can see they are not complicated or problematic:
          Code:
              global path "E:\Comtrade"
              cd "$path"   
              local datafiles: dir "." files "*.csv"
              dir
              foreach file of local datafiles  {
                  import delimited "`file'" , clear rowrange(1:100000)
                  keep ïclassification year period perioddesc aggregatelevel isleafcode tradeflowcode tradeflow reporter partner commodity commoditycode tradevalueus
                  keep if partner == "World" & tradeflow == "Export"
                  drop partner tradeflow tradeflowcode
                  compress
                  save "$path/`file'.dta", replace
              }
          Result
          Code:
          . do "C:\Users\Shadi\AppData\Local\Temp\STD2f14_000000.tmp"
          
          .         global path "E:\Comtrade"
          
          .         *global path "F:\UN Comtrade"
          .         cd "$path"
          E:\Comtrade
          
          .         *save "all.dta", replace emptyok
          .         *macro list _datafiles
          .         *local datafiles : word 1 of `datafiles'
          .         *macro list _datafiles
          .         
          .         local datafiles: dir "." files "*.csv"
          
          .     dir
            <dir>   9/07/21 15:26  .                 
            <dir>   9/07/21 15:26  ..                
           272.7M   5/12/19  8:44  type-C_r-ALL_ps-1977_freq-A_px-S2_pub-20041207_fmt-c
          > sv_ex-20190512.csv
            44.4k   9/07/21 15:26  type-c_r-all_ps-1977_freq-a_px-s2_pub-20041207_fmt-c
          > sv_ex-20190512.csv.dta
           587.7M   5/11/19 20:53  type-C_r-ALL_ps-1978_freq-A_px-S2_pub-20050214_fmt-c
          > sv_ex-20190511.csv
            29.4k   9/07/21 15:26  type-c_r-all_ps-1978_freq-a_px-s2_pub-20050214_fmt-c
          > sv_ex-20190511.csv.dta
           667.1M   5/12/19 22:53  type-C_r-ALL_ps-1979_freq-A_px-S2_pub-20051212_fmt-c
          > sv_ex-20190512.csv
            25.6k   9/07/21 15:26  type-c_r-all_ps-1979_freq-a_px-s2_pub-20051212_fmt-c
          > sv_ex-20190512.csv.dta
           716.1M   5/11/19 21:49  type-C_r-ALL_ps-1980_freq-A_px-S2_pub-20051212_fmt-c
          > sv_ex-20190511.csv
            42.4k   9/07/21 15:26  type-c_r-all_ps-1980_freq-a_px-s2_pub-20051212_fmt-c
          > sv_ex-20190511.csv.dta
           740.6M   5/13/19  6:30  type-C_r-ALL_ps-1981_freq-A_px-S2_pub-20051212_fmt-c
          > sv_ex-20190513.csv
            29.2k   9/07/21 15:26  type-c_r-all_ps-1981_freq-a_px-s2_pub-20051212_fmt-c
          > sv_ex-20190513.csv.dta
           740.4M   5/12/19 11:42  type-C_r-ALL_ps-1982_freq-A_px-S2_pub-20091029_fmt-c
          > sv_ex-20190512.csv
            31.9k   9/07/21 15:26  type-c_r-all_ps-1982_freq-a_px-s2_pub-20091029_fmt-c
          > sv_ex-20190512.csv.dta
           775.1M   5/13/19 11:09  type-C_r-ALL_ps-1983_freq-A_px-S2_pub-20091029_fmt-c
          > sv_ex-20190513.csv
            32.0k   9/07/21 15:26  type-c_r-all_ps-1983_freq-a_px-s2_pub-20091029_fmt-c
          > sv_ex-20190513.csv.dta
           764.2M   5/12/19 16:03  type-C_r-ALL_ps-1984_freq-A_px-S2_pub-20091029_fmt-c
          > sv_ex-20190512.csv
            34.1k   9/07/21 15:26  type-c_r-all_ps-1984_freq-a_px-s2_pub-20091029_fmt-c
          > sv_ex-20190512.csv.dta
           800.3M   5/12/19  0:30  type-C_r-ALL_ps-1985_freq-A_px-S2_pub-20091029_fmt-c
          > sv_ex-20190512.csv
            32.3k   9/07/21 15:26  type-c_r-all_ps-1985_freq-a_px-s2_pub-20091029_fmt-c
          > sv_ex-20190512.csv.dta
           848.5M   5/12/19 20:56  type-C_r-ALL_ps-1986_freq-A_px-S2_pub-20091029_fmt-c
          > sv_ex-20190512.csv
            35.1k   9/07/21 15:26  type-c_r-all_ps-1986_freq-a_px-s2_pub-20091029_fmt-c
          > sv_ex-20190512.csv.dta
           867.5M   5/12/19  3:16  type-C_r-ALL_ps-1987_freq-A_px-S2_pub-20051214_fmt-c
          > sv_ex-20190512.csv
            39.0k   9/07/21 15:26  type-c_r-all_ps-1987_freq-a_px-s2_pub-20051214_fmt-c
          > sv_ex-20190512.csv.dta
           897.8M   5/13/19 13:51  type-C_r-ALL_ps-1988_freq-A_px-S2_pub-20051214_fmt-c
          > sv_ex-20190513.csv
            45.8k   9/07/21 15:26  type-c_r-all_ps-1988_freq-a_px-s2_pub-20051214_fmt-c
          > sv_ex-20190513.csv.dta
           968.6M   5/12/19  7:39  type-C_r-ALL_ps-1989_freq-A_px-S2_pub-20051219_fmt-c
          > sv_ex-20190512.csv
            41.1k   9/07/21 15:26  type-c_r-all_ps-1989_freq-a_px-s2_pub-20051219_fmt-c
          > sv_ex-20190512.csv.dta
          1284.8M   5/12/19 10:21  type-C_r-ALL_ps-1994_freq-A_px-S2_pub-20060110_fmt-c
          > sv_ex-20190512.csv
            50.4k   9/07/21 15:26  type-c_r-all_ps-1994_freq-a_px-s2_pub-20060110_fmt-c
          > sv_ex-20190512.csv.dta
          1380.5M   5/13/19  6:51  type-C_r-ALL_ps-1995_freq-A_px-S2_pub-20060110_fmt-c
          > sv_ex-20190513.csv
            46.1k   9/07/21 15:26  type-c_r-all_ps-1995_freq-a_px-s2_pub-20060110_fmt-c
          > sv_ex-20190513.csv.dta
          1472.2M   5/12/19  5:20  type-C_r-ALL_ps-1996_freq-A_px-S2_pub-20060110_fmt-c
          > sv_ex-20190512.csv
            43.4k   9/07/21 15:26  type-c_r-all_ps-1996_freq-a_px-s2_pub-20060110_fmt-c
          > sv_ex-20190512.csv.dta
          1541.4M   5/11/19 21:07  type-C_r-ALL_ps-1997_freq-A_px-S2_pub-20060110_fmt-c
          > sv_ex-20190511.csv
            12.4k   9/07/21 15:26  type-c_r-all_ps-1997_freq-a_px-s2_pub-20060110_fmt-c
          > sv_ex-20190511.csv.dta
          1576.3M   5/12/19 19:42  type-C_r-ALL_ps-1998_freq-A_px-S2_pub-20060110_fmt-c
          > sv_ex-20190512.csv
            12.1k   9/07/21 15:26  type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-c
          > sv_ex-20190512.csv.dta
          1648.0M   5/12/19  1:51  type-C_r-ALL_ps-1999_freq-A_px-S2_pub-20051228_fmt-c
          > sv_ex-20190512.csv
            12.2k   9/07/21 15:26  type-c_r-all_ps-1999_freq-a_px-s2_pub-20051228_fmt-c
          > sv_ex-20190512.csv.dta
          
          .     
          .     foreach file of local datafiles  {
            2.         import delimited "`file'" , clear rowrange(1:100000)
            3.                 keep ïclassification year period perioddesc aggregatelev
          > el isleafcode tradeflowcode tradeflow reporter partner commodity commodityc
          > ode tradevalueus
            4.                 keep if partner == "World" & tradeflow == "Export"
            5.                 drop partner tradeflow tradeflowcode
            6.                 compress
            7.                 save "$path/`file'.dta", replace
            8.         }
          (22 vars, 99,999 obs)
          (97,180 observations deleted)
            commodity is strL now coalesced
            (206,928 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1977_freq-a_px-s2_pub-20041207_fmt-csv_ex-20
          > 190512.csv.dta saved
          (22 vars, 99,999 obs)
          (98,044 observations deleted)
            commodity is strL now coalesced
            (118,714 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1978_freq-a_px-s2_pub-20050214_fmt-csv_ex-20
          > 190511.csv.dta saved
          (22 vars, 99,999 obs)
          (98,157 observations deleted)
            commodity is strL now coalesced
            (104,464 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1979_freq-a_px-s2_pub-20051212_fmt-csv_ex-20
          > 190512.csv.dta saved
          (22 vars, 99,999 obs)
          (97,490 observations deleted)
            commodity is strL now coalesced
            (165,619 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1980_freq-a_px-s2_pub-20051212_fmt-csv_ex-20
          > 190511.csv.dta saved
          (22 vars, 99,999 obs)
          (97,682 observations deleted)
            commodity is strL now coalesced
            (146,194 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1981_freq-a_px-s2_pub-20051212_fmt-csv_ex-20
          > 190513.csv.dta saved
          (22 vars, 99,999 obs)
          (97,641 observations deleted)
            commodity is strL now coalesced
            (152,949 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1982_freq-a_px-s2_pub-20091029_fmt-csv_ex-20
          > 190512.csv.dta saved
          (22 vars, 99,999 obs)
          (97,583 observations deleted)
            commodity is strL now coalesced
            (159,509 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1983_freq-a_px-s2_pub-20091029_fmt-csv_ex-20
          > 190513.csv.dta saved
          (22 vars, 99,999 obs)
          (97,600 observations deleted)
            commodity is strL now coalesced
            (157,595 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1984_freq-a_px-s2_pub-20091029_fmt-csv_ex-20
          > 190512.csv.dta saved
          (22 vars, 99,999 obs)
          (97,719 observations deleted)
            commodity is strL now coalesced
            (144,359 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1985_freq-a_px-s2_pub-20091029_fmt-csv_ex-20
          > 190512.csv.dta saved
          (22 vars, 99,999 obs)
          (97,598 observations deleted)
            commodity is strL now coalesced
            (159,689 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1986_freq-a_px-s2_pub-20091029_fmt-csv_ex-20
          > 190512.csv.dta saved
          (22 vars, 99,999 obs)
          (97,662 observations deleted)
            commodity is strL now coalesced
            (152,709 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1987_freq-a_px-s2_pub-20051214_fmt-csv_ex-20
          > 190512.csv.dta saved
          (22 vars, 99,999 obs)
          (98,147 observations deleted)
            commodity is strL now coalesced
            (117,413 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1988_freq-a_px-s2_pub-20051214_fmt-csv_ex-20
          > 190513.csv.dta saved
          (22 vars, 99,999 obs)
          (97,953 observations deleted)
            commodity is strL now coalesced
            (139,598 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1989_freq-a_px-s2_pub-20051219_fmt-csv_ex-20
          > 190512.csv.dta saved
          (22 vars, 99,999 obs)
          (98,030 observations deleted)
            commodity is strL now coalesced
            (124,193 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1994_freq-a_px-s2_pub-20060110_fmt-csv_ex-20
          > 190512.csv.dta saved
          (22 vars, 99,999 obs)
          (97,883 observations deleted)
            commodity is strL now coalesced
            (142,891 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1995_freq-a_px-s2_pub-20060110_fmt-csv_ex-20
          > 190513.csv.dta saved
          (22 vars, 99,999 obs)
          (96,742 observations deleted)
            (0 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1996_freq-a_px-s2_pub-20060110_fmt-csv_ex-20
          > 190512.csv.dta saved
          (22 vars, 99,999 obs)
          (97,060 observations deleted)
            (0 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1997_freq-a_px-s2_pub-20060110_fmt-csv_ex-20
          > 190511.csv.dta saved
          (22 vars, 99,999 obs)
          (97,118 observations deleted)
            (0 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20
          > 190512.csv.dta saved
          (22 vars, 99,999 obs)
          (97,028 observations deleted)
            (0 bytes saved)
          file E:\Comtrade/type-c_r-all_ps-1999_freq-a_px-s2_pub-20051228_fmt-csv_ex-20
          > 190512.csv.dta saved
          
          .
          end of do-file
          I see nothing in what you showed in post #3 of the "Stata keeps adding observations and loading stuff and then stops responding" that you described in your first post.
          I'm not sure what you mean here. But when I said that I meant that down the data browser and stata command window I see observations piling up and I can see that stata is running because there's the red STOP button that could break the process but this goes on for several hours and I get only "not responding" in the end which is prompted by my PC Windows when any program stops responding. Does this answer your question?


          I do believe that you should reverse the order of your two keep commands, first keeping observations and then keeping variables, since I expect you are reducing the number of observations substantially.
          Since it worked above for the limited rows I did not try that yet. But I'll keep that in mind as a possibility.
          So I think we can sum up that the problem is that the size is too large. Or do you have other conclusions in mind? Please let me know. What do you think I can do to best deal with this. I am also not sure why this is happening because in another post here: https://www.statalist.org/forums/for...large-csv-file I read that they could upload files also with GB size. My laptop has an 8 GB RAM btw. So I'm confused why I have this problem and would appreciate your ideas. Thanks again.

          Looking forward,
          Shadi

          Comment


          • #6
            So I think we can sum up that the problem is that the size is too large.
            To be precise, it is too large for the code you have written, given the configuration of your laptop. A 1.5GB CSV might occupy more memory when it is imported. It is possible that changes to your code will overcome the problem.

            First try this: importing your 1.5GB CSV file, and providing some diagnostic information
            Code:
            clear all
            global path "E:\Comtrade"
            cd "$path"
            local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20190512.csv
            memory
            import delimited "`file'" , clear
            memory
            describe
            Does that succeed? How long does it take? Please show the code and output from your Results window.

            Next try this: importing CSV and saving a Stata dataset
            Code:
            clear all
            global path "E:\Comtrade"
            cd "$path"
            local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20190512.csv
            import delimited "`file'" , clear
            save "$path/`file'.dta", replace
            Does that succeed? How long does it take?

            Next try this: importing, reducing the number of observations, and saving
            Code:
            clear all
            global path "E:\Comtrade"
            cd "$path"
            local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20190512.csv
            import delimited "`file'" , clear
            keep if partner == "World" & tradeflow == "Export"
            save "$path/`file'.dta", replace
            Does that succeed? How long does it take?

            Finally try this: importing, reducing the number of observations, then reducing the number of variables, and saving
            Code:
            clear all
            global path "E:\Comtrade"
            cd "$path"
            local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20190512.csv
            import delimited "`file'" , clear
            keep if partner == "World" & tradeflow == "Export"
            keep ïclassification year period perioddesc aggregatelevel isleafcode reporter commodity commoditycode tradevalueus
            save "$path/`file'.dta", replace
            Does that succeed? How long does it take?

            Comment


            • #7
              First try this: importing your 1.5GB CSV file, and providing some diagnostic information

              Code:
               clear all
              global path "E:\Comtrade"
              cd "$path"
              local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20190512.csv
              memory
              import delimited "`file'" , clear
              memory
              describe
              Code:
               do "C:\Users\Shadi\AppData\Local\Temp\STDac8_000000.tmp"
               
              
              . clear all
              
              . global path "E:\Comtrade"
              
              . cd "$path"
              E:\Comtrade
              
              . local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-
              > 20190512.csv
              
              . memory
              
                Memory usage
                                                          used                allocat
              > ed
                  -------------------------------------------------------------------
              > --
                  data                                       0               67,108,8
              > 64
                  strLs                                      0                      
              >  0
                  -------------------------------------------------------------------
              > --
                  data & strLs                               0               67,108,8
              > 64
              
                  -------------------------------------------------------------------
              > --
                  data & strLs                               0               67,108,8
              > 64
                  var. names, %fmts, ...                     4                   65,0
              > 00
                  overhead                           1,081,344                1,082,1
              > 36
              
                  Stata matrices                             0                      
              >  0
                  ado-files                              3,263                    3,2
              > 63
                  stored results                             0                      
              >  0
              
                  Mata matrices                              0                      
              >  0
                  Mata functions                             0                      
              >  0
              
                  set maxvar usage                   5,281,738                5,281,7
              > 38
              
                  other                                  3,231                    3,2
              > 31
                  -------------------------------------------------------------------
              > --
                  grand total                        6,365,976               73,544,2
              > 32
              
              . import delimited "`file'" , clear
              I had to break the process after 45 minutes, because my laptop was slowed down and I could not work with it properly. Maybe my laptop could afford to finish it in half a day but that's not what I look for. Since this is only one of the many files I have to adjust. I know you suggested this to get some diagnostics though. So I hope the result is helpful.

              Same thing happened for the second case:
              Code:
              clear all
              global path "E:\Comtrade"
              cd "$path" l
              ocal file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20190512.csv import delimited "`file'" , clear
              save "$path/`file'.dta", replace
              Code:
              . do "C:\Users\Shadi\AppData\Local\Temp\STD330_000000.tmp"
              
              . clear all
              
              . global path "E:\Comtrade"
              
              . cd "$path"
              E:\Comtrade
              
              . local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-201905
              > 12.csv
              
              . import delimited "`file'" , clear
              --Break--
              r(1);
              
              end of do-file
              
              --Break--
              r(1);
              Also for the third case:
              Code:
               clear all
              global path "E:\Comtrade"
              cd "$path"
              local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-20190512.csv
              import delimited "`file'" , clear
              keep if partner == "World" & tradeflow == "Export"
              keep ïclassification year period perioddesc aggregatelevel isleafcode reporter commodity commoditycode tradevalueus
              save "$path/`file'.dta", replace
              Code:
              . do "C:\Users\Shadi\AppData\Local\Temp\STD330_000000.tmp"
              
              . clear all
              
              . global path "E:\Comtrade"
              
              . cd "$path"
              E:\Comtrade
              
              . local file type-c_r-all_ps-1998_freq-a_px-s2_pub-20060110_fmt-csv_ex-201905
              > 12.csv
              
              . import delimited "`file'" , clear
              --Break--
              r(1);
              
              end of do-file
              
              --Break--
              r(1);
              So I get stuck with importing the file in the first place and then cannot proceed to drop and keep commands. But I also believe that the order of commands and how the code is organized could make a difference. But now the problem is one level before this, i.e., when importing the huge file. This file has 22 variables and 10,188,935 observations (I opened it with EmEditor). I would share the file here but I don't have the permission to do this based on the download policy for UN Comtrade for bulk data.
              Do you have any suggestions? Thank you for your effort, it's been very encouraging.

              Looking forward,
              Shadi
              Last edited by Shadi Shad; 08 Sep 2021, 04:44.

              Comment


              • #8
                Here is an example of how to import a dataset in batches, drop unwanted observations, reducing the size of the data, and then appending the reduced datasets together, hoping that the resulting dataset will not itself be large enough to bog your machine down.
                Code:
                // generate example data
                
                set seed 666
                set obs 2042
                generate x = runiformint(1,10)
                generate y = rnormal()
                export delimited using gnxl.csv, replace
                
                // create an empty dataset to accumulate observations
                
                clear
                save mydata, emptyok replace
                
                // read data 1000 observations at a time
                
                local batch 1000
                local i = 2
                local j = 1 + `batch'
                while 1 {
                    display "reading rows `i' to `j'"
                    clear
                    import delimited using gnxl.csv, varnames(1) rowrange(`i':`j')
                    if `r(N)'==0 continue, break
                    keep if x==1
                    append using mydata
                    save mydata, replace
                    local i = `i' + `batch'
                    local j = `j' + `batch'
                    }
                display "done"
                describe using mydata
                Code:
                . // generate example data
                . set seed 666
                
                . set obs 2042
                Number of observations (_N) was 0, now 2,042.
                
                . generate x = runiformint(1,10)
                
                . generate y = rnormal()
                
                . export delimited using gnxl.csv, replace
                file gnxl.csv saved
                
                .
                . // create an empty dataset
                . clear
                
                . save mydata, emptyok replace
                (dataset contains 0 observations)
                file mydata.dta saved
                
                .
                . // read data 1000 observations at a time
                .
                . local batch 1000
                
                . local i = 2
                
                . local j = 1 + `batch'
                
                . while 1 {
                  2.     display "reading rows `i' to `j'"
                  3.     clear
                  4.     import delimited using gnxl.csv, varnames(1) rowrange(`i':`j')
                  5.     if `r(N)'==0 continue, break
                  6.     keep if x==1
                  7.     append using mydata
                  8.     save mydata, replace
                  9.     local i = `i' + `batch'
                 10.     local j = `j' + `batch'
                 11.     }
                reading rows 2 to 1001
                (encoding automatically selected: ISO-8859-1)
                (2 vars, 1,000 obs)
                (890 observations deleted)
                file mydata.dta saved
                reading rows 1002 to 2001
                (encoding automatically selected: ISO-8859-1)
                (2 vars, 1,000 obs)
                (900 observations deleted)
                file mydata.dta saved
                reading rows 2002 to 3001
                (encoding automatically selected: ISO-8859-1)
                (2 vars, 42 obs)
                (39 observations deleted)
                file mydata.dta saved
                reading rows 3002 to 4001
                (encoding automatically selected: ISO-8859-1)
                (2 vars, 0 obs)
                
                . display "done"
                done
                
                . describe using mydata
                
                Contains data                                
                 Observations:           213                  8 Sep 2021 20:43
                    Variables:             2                  
                ------------------------------------------------------------------------------------------------
                Variable      Storage   Display    Value
                    name         type    format    label      Variable label
                ------------------------------------------------------------------------------------------------
                x               byte    %8.0g                
                y               float   %9.0g                
                ------------------------------------------------------------------------------------------------
                Sorted by:
                
                .
                Added in edit: I haven't tested whether this will work if no observations are matched by the keep command or if there are problems if the total number of data rows is an exact multiple of the batch size.
                Last edited by William Lisowski; 08 Sep 2021, 18:51.

                Comment


                • #9
                  Originally posted by William Lisowski View Post
                  Here is an example of how to import a dataset in batches, drop unwanted observations, reducing the size of the data, and then appending the reduced datasets together, hoping that the resulting dataset will not itself be large enough to bog your machine down.
                  Code:
                  // generate example data
                  
                  set seed 666
                  set obs 2042
                  generate x = runiformint(1,10)
                  generate y = rnormal()
                  export delimited using gnxl.csv, replace
                  
                  // create an empty dataset to accumulate observations
                  
                  clear
                  save mydata, emptyok replace
                  
                  // read data 1000 observations at a time
                  
                  local batch 1000
                  local i = 2
                  local j = 1 + `batch'
                  while 1 {
                  display "reading rows `i' to `j'"
                  clear
                  import delimited using gnxl.csv, varnames(1) rowrange(`i':`j')
                  if `r(N)'==0 continue, break
                  keep if x==1
                  append using mydata
                  save mydata, replace
                  local i = `i' + `batch'
                  local j = `j' + `batch'
                  }
                  display "done"
                  describe using mydata
                  Code:
                  . // generate example data
                  . set seed 666
                  
                  . set obs 2042
                  Number of observations (_N) was 0, now 2,042.
                  
                  . generate x = runiformint(1,10)
                  
                  . generate y = rnormal()
                  
                  . export delimited using gnxl.csv, replace
                  file gnxl.csv saved
                  
                  .
                  . // create an empty dataset
                  . clear
                  
                  . save mydata, emptyok replace
                  (dataset contains 0 observations)
                  file mydata.dta saved
                  
                  .
                  . // read data 1000 observations at a time
                  .
                  . local batch 1000
                  
                  . local i = 2
                  
                  . local j = 1 + `batch'
                  
                  . while 1 {
                  2. display "reading rows `i' to `j'"
                  3. clear
                  4. import delimited using gnxl.csv, varnames(1) rowrange(`i':`j')
                  5. if `r(N)'==0 continue, break
                  6. keep if x==1
                  7. append using mydata
                  8. save mydata, replace
                  9. local i = `i' + `batch'
                  10. local j = `j' + `batch'
                  11. }
                  reading rows 2 to 1001
                  (encoding automatically selected: ISO-8859-1)
                  (2 vars, 1,000 obs)
                  (890 observations deleted)
                  file mydata.dta saved
                  reading rows 1002 to 2001
                  (encoding automatically selected: ISO-8859-1)
                  (2 vars, 1,000 obs)
                  (900 observations deleted)
                  file mydata.dta saved
                  reading rows 2002 to 3001
                  (encoding automatically selected: ISO-8859-1)
                  (2 vars, 42 obs)
                  (39 observations deleted)
                  file mydata.dta saved
                  reading rows 3002 to 4001
                  (encoding automatically selected: ISO-8859-1)
                  (2 vars, 0 obs)
                  
                  . display "done"
                  done
                  
                  . describe using mydata
                  
                  Contains data
                  Observations: 213 8 Sep 2021 20:43
                  Variables: 2
                  ------------------------------------------------------------------------------------------------
                  Variable Storage Display Value
                  name type format label Variable label
                  ------------------------------------------------------------------------------------------------
                  x byte %8.0g
                  y float %9.0g
                  ------------------------------------------------------------------------------------------------
                  Sorted by:
                  
                  .
                  Added in edit: I haven't tested whether this will work if no observations are matched by the keep command or if there are problems if the total number of data rows is an exact multiple of the batch size.
                  Dear William,

                  Thank you for your ideas. At the end of the day I had to chunk my data, and I could use your post to get some inspirations for my script.

                  Best,
                  Shadi

                  Comment

                  Working...
                  X