Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    That approach would not work because once you keep observations with form1, you will not find any with form3 as they have already been dropped. Your error was that the pattern is missing a right single quote when you refer to j:
    Code:
    keep if strmatch(filename, "*`j'*.txt")
    If you are going to prune files upfront, you might as well break-up the filename into the parts you wanted from the start. That way you can make sure that all filenames you want to process match your expectations. You could do something like:
    Code:
    clear all
    filelist, dir("text_files")
     
    * reduce to files with a ".txt" file extension
    keep if strmatch(filename, "*.txt")
    
    * split the file name into parts
    gen s = subinstr(filename,".txt", "", 1)
    split s, parse("_")
    rename (`r(varlist)') (id date form)
    assert !mi(id, date, form)
    
    * reduce to form1 and form3
    keep if inlist(form, "form1", "form3")
    Note that there is a limit of 10 (I think) match strings when using inlist() with strings. If you have more, you can make a separate dataset with the list to use and use merge to reduce the observations to those that match the list.

    Here's an expanded version of the program that handles the extra part variables:
    Code:
    * code to import one text file
    program import_txt
      // move values of interest from variables to locals
      local dsource = dirname
      local fsource = filename
      local id1 = id
      local date1 = date
      local form1 = form
      
      import delimited using `"`dsource'/`fsource'"', clear stringcols(_all) varnames(nonames)
      
      // get the desired info
      keep if strpos(v1,"name:")
      gen name = subinstr(v1,"name:","",1)
    
      // copy over the file's information
      gen sourcefile = `"`fsource'"'
      gen sourcedir  = `"`dsource'"'
      gen id = "`id1'"
      gen date = "`date1'"
      gen form = "`form1'"
    end
    
    runby import_txt, by(dirname filename) verbose

    Comment


    • #17
      I got the above code to work on specific file types, but that does not get around the 10,000 file limit. For example if I

      Code:
      clear all
      filelist, dir("text_files")
      keep if strmatch(filename, "*form1*.txt")
      And there are 100 "form1" files and 20,000 total files in the folder, it only collects ~ half of the data in the files with type "form1" (it will not consider the last 10,000 files).

      I have 100,000 files per year and 20 years. This means I need to break the data into 200 different subdirectories.

      I tried your method in post #29 here, but got the error "no observations r(2000)" error going through the first loop. But it seems the missing values (or whatever is giving the "no observations" errors) causes problems within loops. So it would be best to avoid loops and instead use the runby command (which works wonderfully). The only problem is that filelist can only handle 10,000 files at a time.

      Is there a way to have filelist consider more than 10,000 files in the second line of the above code? I don't need to keep more than 10,000 files in the 3rd line, but need to consider more than 10,000 in the 2nd line.

      Thanks again.

      Comment


      • #18
        I did not see your post #16 until after I posted my #17. Apologies. It wasn't there when I started writing mine.

        Comment


        • #19
          Robert,

          Thanks for all the time you spent on replying to my posts. Also, thanks for authoring 'filelist' and 'runby'.

          Your new code in #16 takes case of the form1 and form3 issue, but that was in an effort to avoid having to break up the 2M files into directories of less than 10,000. The problem is in the first line of code:

          Code:
          filelist, dir("text_files")
          I can't use loops because they stop when they find "no observations" and I can't use filelist because I have 100,000 files in each of 20 folders. Any ideas? Can you force filelist to consider more than 10,000 files?

          Thanks again,
          Kyle

          Comment


          • #20
            Alternatively, could I delete all files not of type=form1 | type=form3 using the erase command? The files are backed up in zip format elsewhere on my HD. When I need access to other types I can just erase all existing files and unzip again and start over.

            Something like:

            Code:
            cd "text_files"
            local list : dir . files "*form1*.txt" "*form3*.txt"
            foreach f of local list {
            erase "`f'"
            }
            Once that is done, then carry out the above code (post #16). Would that work, assuming that after the erase command ran there were less than 10,000 files of form1 and form3 remaining?

            Comment


            • #21
              To clarify, filelist has no limit on the number of files it can handle and will happily scan your whole hard disk. The issue here is that the Mata function dir() has a hard-coded limit of 10,000 files it can return. See this post from 2015 that mentions the limit. What this means is that, in any given directory (ignoring its subdirectories), filelist will only return the first 10,000 files.

              You can still use the filelist approach as long as you can identify patterns that will reduce the number of files returned in a given directory to below 10,000. The following example will collect all form1 and form3 separately:

              Code:
              clear all
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str5 form
              "form1"
              "form3"
              end
              
              program get_in_parts
                local f = form
                filelist, dir("mega_files") pattern("*`f'*")
              end
              runby get_in_parts, by(form)
              Let me know if this works for you. If not, you can still get there via filelist by removing/renaming/copying files once they have been captured. It shouldn't be too hard to put together code for that but there's no point in trying if you can manage with the above technique.

              Comment


              • #22
                Robert,

                You da man! That works.

                So now the entire thing would look like (I think the below steps are in the correct order):

                Code:
                clear
                input str5 form
                "form1"
                "form3"
                end
                
                program get_in_parts
                  local f = form
                  filelist, dir("mega_files") pattern("*`f'*")
                end
                runby get_in_parts, by(form)
                
                program import_txt
                  // move values of interest from variables to locals
                  local dsource = dirname
                  local fsource = filename
                  local id1 = id
                  local date1 = date
                  local form1 = form
                  
                  import delimited using `"`dsource'/`fsource'"', clear stringcols(_all) varnames(nonames)
                  
                  // get the desired info
                  keep if strpos(v1,"name:")
                  gen name = subinstr(v1,"name:","",1)
                
                  // copy over the file's information
                  gen sourcefile = `"`fsource'"'
                  gen sourcedir  = `"`dsource'"'
                  gen id = "`id1'"
                  gen date = "`date1'"
                  gen form = "`form1'"
                end
                
                runby import_txt, by(dirname filename) verbose
                
                save import_txt, replace
                That's awesome. Very grateful for the help.


                Comment


                • #23
                  I am trying to make the above code work over many sub-directories, some of which have more than 10K files in of them. To do this, I am trying to incorporate Robert's code here: https://www.statalist.org/forums/for...57#post1317257

                  The tricky part for me is:

                  1) I don't know how saving files works within runby.
                  2) I don't know how to pass the runby output file to the follow-on import_txt program.
                  3) I don't know how to build or store/save the runby output file.

                  I think I only need to manipulate program get_in_parts (lines 7-11).

                  I think the code should be something like:

                  Code:
                  program get_in_parts
                    local g = form
                    filelist, dir("mega_files") pattern("*`g'*")
                    local obs = _N
                    save "myfiles.dta", replace
                  
                  forvalues i=1/`obs' {
                      use "myfiles.dta" in `i', clear
                      local f = dirname + "/" + filename
                      insheet using "`f'", clear
                      gen source = "`f'"
                      save "mydata_`i'.dta", replace
                  }
                  
                  clear
                  forvalues i=1/`obs' {
                      append using "mydata_`i'.dta"
                  }
                  save "mydatacombo.dta", replace
                  
                  end
                  runby get_in_parts, by(form)


                  The code runs with no problems, but results in an empty data set when complete.

                  If I open "mydatacombo.dta", it does NOT have dirname or filename variables like it does when I run it within a specific folder with less than 10K files. It only has a source variable.

                  Any help is very much appreciated.

                  Thanks in advance. The filelist and runby commands are wonderful and have made my life better, easier, and faster. Am grateful for them.

                  Comment


                  • #24
                    Here is what -get_in_parts- is doing:

                    * get a pattern in variable -form-
                    * put all filenames matching this pattern in myfiles.dta
                    * for each filename, import and save in a new dta file
                    * append all the resulting dta files and save as mydatacombo.dta

                    The variables -dirname- and -filename-, which are found in myfiles.dta, should not appear in mydatacombo.dta (unless they also appear in the files you import). It has, however a -source- variable, that comes from the line "gen source = "`f'"" just before "save mydata_`i'.dta".

                    There is however a problem: -runby- will run -get_in_parts- for each pattern, and save each time in the same file mydatacombo.dta, replacing the previous contents. In the end, you will have only data from the last pattern. To make it work, you would have to save in mydatacombo_`g'.dta, write another loop on patterns to append all these combo files.

                    Hope this helps

                    Jean-Claude Arbaut

                    PS: if all you want is appending all the files in a specific folder into a single Stata datafile, it could be simpler to write a Python program that writes a Stata do file that does everything (import and append). Let Python retrieve the filenames (as it's very good at that and has no 10K limit), and Stata do the import. That's what I would do, anyway.
                    Last edited by Jean-Claude Arbaut; 29 Apr 2018, 04:20.

                    Comment


                    • #25
                      Jean-Claude - thanks for the reply and clarification. I do not have a great understanding of runby. It is clearer now.

                      I did not know Python could write Stata do files? I am very novice at Python. I would prefer to keep this in Stata if possible.

                      Is this closer?
                      Code:
                      program get_in_parts
                        local g = form
                        filelist, dir("mega_files") pattern("*`g'*")
                        local obs = _N
                        save "myfiles.dta", replace
                      
                      forvalues i=1/`obs' {
                          use "myfiles.dta" in `i', clear
                          local f = dirname + "/" + filename
                          insheet using "`f'", clear
                          gen source = "`f'"
                          save "mydata_`f'.dta", replace
                      }
                      
                      clear
                      forvalues i=1/`obs' {
                          append using "mydata_`f'.dta"
                      }
                      save "mydatacombo_`g'.dta", replace
                      
                      clear
                      forvalues i=1/`g' {
                          append using "mydatacombo_`g'.dta"
                      }
                      save "mydatacombo_all.dta", replace
                      
                      end
                      runby get_in_parts, by(form)
                      Thanks in advance.

                      Comment


                      • #26
                        You should probably put the last loop outside of -get_in_parts-.

                        Code:
                        program get_in_parts
                            local g = form
                            filelist, dir("mega_files") pattern("*`g'*")
                            local obs = _N
                            save "myfiles.dta", replace
                        
                            forvalues i=1/`obs' {
                                use "myfiles.dta" in `i', clear
                                local f = dirname + "/" + filename
                                insheet using "`f'", clear
                                gen source = "`f'"
                                save "mydata_`i'.dta", replace
                            }
                        
                            clear
                            forvalues i=1/`obs' {
                                append using "mydata_`i'.dta"
                            }
                            save "mydatacombo_`g'.dta", replace
                            glo patterns $patterns `g'
                        end
                        
                        runby get_in_parts, by(form)
                        
                        clear
                        foreach g in $patterns {
                            append using "mydatacombo_`g'.dta"
                        }
                        save "mydatacombo_all.dta", replace
                        Side note: be aware that patterns are not necessarily exclusive. For instance, "file12.csv" is matched by *1*, *2* and *12*, among others.

                        Another bug I overlooked: save "mydata_`f'.dta", replace is not correct, because f is a pathname, with one or more slashes: Stata would try to save a file "`filename'.dta" in a directory called "mydata_`dirname'", and this directory will likely not exist. However, the following forvalues tells me you really wanted to use `i' instead of `f' in the name. I've made the correction in my answer above.

                        About Python: it can easily write a text file, and you are free to put Stata commands in what you write, hence, even if it's not specified in Python documentation, of course it can "write a do file" (I have also used Python to write SAS programs in the past, and to prepare SAS formats and Stata labels from raw csv data, and many other similar tasks). Even if you don't write the full do file with PYthon, you could still write a list of filenames (one name by row), that you could read in Stata, then use to import data. But I'll leave this, as it's off-topic here and you prefer to use Stata. Another possibility, still within Stata, would be to write a plugin (either Java or C/C++) that the does the job of finding filenames, but that would be a more "advanced" project. However, that would be much more robust (see above the risk with patterns).

                        For the record, here is a Python program that prints a list of filenames. You can redirect the output to a text file and import this in Stata. That's the basis of several programs I use (here I removed all error checking to make it as simple as possible).

                        Code:
                        import sys, os
                        
                        def readdir(path):
                            for name in os.listdir(path):
                                c = os.path.join(path, name)
                                if os.path.isfile(c):
                                    print(c)
                                elif os.path.isdir(c):
                                    readdir(c)
                        
                        readdir(sys.argv[1])
                        Last edited by Jean-Claude Arbaut; 29 Apr 2018, 08:45.

                        Comment


                        • #27
                          Jean-Claude, thanks again for the help.

                          A couple more questions.

                          You mention in your side note that patterns are not exclusive. Will this cause a problem in the last loop? What is the problem? And is there a simple way to resolve it (besides writing a plugin)?

                          Also, the last loop fails to run. It gives an error "invalid syntax r(198)". I thought maybe it should be `g' instead of g in the last loop, but that didn't resolve it. Is it because g is defined as a local variable in get_in_parts and as a global later?
                          Last edited by Kyle Smith; 29 Apr 2018, 09:12.

                          Comment


                          • #28
                            If your patterns are not exclusive, then the same file can appear several times in the listings, thus be appended several times. This can probably be a problem for you.

                            The syntax error may come from the form of patterns. I should have asked first: what does the form variable look like? In case Stata is not very happy with it (in foreach), you can use numbers instead:

                            Code:
                            program get_in_parts
                                local g = form
                                filelist, dir("mega_files") pattern("*`g'*")
                                local obs = _N
                                save "myfiles.dta", replace
                            
                                forvalues i=1/`obs' {
                                    use "myfiles.dta" in `i', clear
                                    local f = dirname + "/" + filename
                                    insheet using "`f'", clear
                                    gen source = "`f'"
                                    save "mydata_`i'.dta", replace
                                }
                            
                                clear
                                forvalues i=1/`obs' {
                                    append using "mydata_`i'.dta"
                                }
                                glo last=$last+1
                                save "mydatacombo_$last.dta", replace
                            end
                            
                            glo last=0
                            runby get_in_parts, by(form)
                            
                            clear
                            forv i=1/$last {
                                append using "mydatacombo_`i'.dta"
                            }
                            save "mydatacombo_all.dta", replace

                            Comment


                            • #29
                              Jean-Claude, Thanks again. I think we are getting closer.

                              When I run the last loop now I get the error "no variables defined r(111)".

                              If I open one of the mydata_`i' files, the variable source is there. Somehow we are losing that variable in the last loop? Any idea why?



                              ****************

                              If I try your python code, but add a line at the beginning (after import command) like:

                              Code:
                              path="C:\\mega_files"
                              I get the following error.

                              Traceback (most recent call last):
                              File "test1.py", line 24, in <module>

                              readdir(sys.argv[1])
                              IndexError: list index out of range

                              Last edited by Kyle Smith; 29 Apr 2018, 10:24.

                              Comment


                              • #30
                                Kyle Smith, you are reviving a thread that's several months old and you seem lost compared to where it was left with #21 (and #22). The code in #21 will process in one pass as many subdirectories as you have and will create a dataset of millions and millions of file names if that's what you have. The only limitation is that Stata will not return more than 10,000 file names from a single directory (ignoring those in its subdirectories). This is a hard coded limitation of Stata that still exist in the most up to date version of Stata (I just checked).

                                To illustrate the issue, here's code that will create a little over 50K files into a "mega_files" directory within Stata's current directory. The files are split into 3 subdirectories called "batch1", "batch2", and "batch3".
                                Code:
                                clear all
                                set seed 3213
                                set obs 11
                                gen form = _n
                                expand 3
                                bysort form: gen batch = _n
                                expand runiformint(1000,2000)
                                bysort form batch: gen path = "mega_files/batch" + string(batch) + ///
                                    "/file" + string(_n) + "_form" + string(form) + ".txt"
                                
                                cap mkdir mega_files
                                cap mkdir mega_files/batch1
                                cap mkdir mega_files/batch2
                                cap mkdir mega_files/batch3
                                
                                program doit
                                    local fpath = path
                                    save "`fpath'"
                                end
                                runby doit, by(path)
                                It's easy to check if the 10K limi is biting you, you simply need to check how many files filelist has returned by directory:
                                Code:
                                . filelist, dir("mega_files")
                                Number of files found = 30000
                                
                                . contract dirname
                                
                                . list
                                
                                     +---------------------------+
                                     | dirname             _freq |
                                     |---------------------------|
                                  1. | mega_files/batch1   10000 |
                                  2. | mega_files/batch2   10000 |
                                  3. | mega_files/batch3   10000 |
                                     +---------------------------+
                                The code in #21 offers a workaround for the limitation provided you spell out a list of patterns that will pick-up fewer than 10,000 files in any given directory. In the "mega_files" directory, all files follow a pattern that I can identify. Files end with "_form1.txt", "_form2.txt", ..., "_form11.txt". With this information in hand, I can overcome the 10,000 file limit using:
                                Code:
                                clear all
                                input str11 form
                                "_form1.txt" 
                                "_form2.txt" 
                                "_form3.txt" 
                                "_form4.txt" 
                                "_form5.txt" 
                                "_form6.txt" 
                                "_form7.txt" 
                                "_form8.txt" 
                                "_form9.txt" 
                                "_form10.txt"
                                "_form11.txt"
                                end
                                
                                program get_in_parts
                                  local p = form
                                  filelist, dir("mega_files") pattern("*`p'")
                                  gen form = "`p'"
                                end
                                runby get_in_parts, by(form) verbose
                                
                                contract dirname form
                                assert _freq < 10000
                                
                                list, sepby(dirname)
                                And here are the results:
                                Code:
                                . list, sepby(dirname)
                                
                                     +-----------------------------------------+
                                     | dirname                    form   _freq |
                                     |-----------------------------------------|
                                  1. | mega_files/batch1    _form1.txt    1570 |
                                  2. | mega_files/batch1   _form10.txt    1504 |
                                  3. | mega_files/batch1   _form11.txt    1768 |
                                  4. | mega_files/batch1    _form2.txt    1445 |
                                  5. | mega_files/batch1    _form3.txt    1474 |
                                  6. | mega_files/batch1    _form4.txt    1035 |
                                  7. | mega_files/batch1    _form5.txt    1781 |
                                  8. | mega_files/batch1    _form6.txt    1137 |
                                  9. | mega_files/batch1    _form7.txt    1648 |
                                 10. | mega_files/batch1    _form8.txt    1484 |
                                 11. | mega_files/batch1    _form9.txt    1923 |
                                     |-----------------------------------------|
                                 12. | mega_files/batch2    _form1.txt    1633 |
                                 13. | mega_files/batch2   _form10.txt    1417 |
                                 14. | mega_files/batch2   _form11.txt    1191 |
                                 15. | mega_files/batch2    _form2.txt    1031 |
                                 16. | mega_files/batch2    _form3.txt    1903 |
                                 17. | mega_files/batch2    _form4.txt    1506 |
                                 18. | mega_files/batch2    _form5.txt    1329 |
                                 19. | mega_files/batch2    _form6.txt    1942 |
                                 20. | mega_files/batch2    _form7.txt    1141 |
                                 21. | mega_files/batch2    _form8.txt    1877 |
                                 22. | mega_files/batch2    _form9.txt    1645 |
                                     |-----------------------------------------|
                                 23. | mega_files/batch3    _form1.txt    1963 |
                                 24. | mega_files/batch3   _form10.txt    1540 |
                                 25. | mega_files/batch3   _form11.txt    1408 |
                                 26. | mega_files/batch3    _form2.txt    1981 |
                                 27. | mega_files/batch3    _form3.txt    1180 |
                                 28. | mega_files/batch3    _form4.txt    1330 |
                                 29. | mega_files/batch3    _form5.txt    1225 |
                                 30. | mega_files/batch3    _form6.txt    1437 |
                                 31. | mega_files/batch3    _form7.txt    1188 |
                                 32. | mega_files/batch3    _form8.txt    1979 |
                                 33. | mega_files/batch3    _form9.txt    1665 |
                                     +-----------------------------------------+
                                
                                .
                                The code you posted in #23 refers to a post of mine that predates runby. What you posted in #23 and subsequent posts to fix the issue do not make any sense in my mind. Your task is first to make a dataset of all the files you want to process. That follows the format in #21 and the example above. Only once you are satisfied that the list is complete and that there are no remaining issues with the 10K limit, proceed to import content from each file as you do in #22.

                                If I'm missing something, please clarify what the issue is.

                                Comment

                                Working...
                                X