Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Robert - Thanks alot for the help. I was confused. It is clearer now.

    Quick question. If the forms are by year and month and I am sure there will NOT be more than 10K forms within a given month (but there maybe more than 10K within a given year), is there an easy way to write a loop instead of writing out all the possible combinations of year*month in the first step.

    Specifically:

    Code:
    clear all
    input str18 form
    "200501*_form1.txt"
    "200502*_form1.txt"
    "200503*_form1.txt"
    "200504*_form1.txt"
    "200505*_form1.txt"
    "200506*_form1.txt"
    "200507*_form1.txt"
    "200508*_form1.txt"
    "200509*_form1.txt"
    "200510*_form1.txt"
    "200511*_form1.txt"
    "200512*_form1.txt"
    "200601*_form1.txt"
    end
    Thanks in advance.

    Comment


    • #32
      This should work if the format is consistent with the example you posted:

      Code:
      clear
      set obs 10
      gen year = 2000 +_n
      expand 12
      bysort year: gen month = _n
      gen pattern = string(year) + string(month,"%02.0f") + "*_form1.txt"
      Last edited by Robert Picard; 29 Apr 2018, 12:42.

      Comment


      • #33
        Thank you very much for your time today Robert.

        You da man.

        Comment


        • #34
          You could also drop into Mata to get the lists of files to iterate over if the number of items is a concern.

          Comment


          • #35
            wbuchanan, how does your suggestion relate to this thread? Care to elaborate?

            Comment


            • #36
              Before this thread fades away and in the interest of users who end up here via a search engine, wbuchanan is mistaken as the 10,000 files limit is hardcoded in Mata's dir() function so "dropping" into Mata is not going to help if a user has more than 10,000 files in a single directory. See this 2015 post that confirms the issue. Until StataCorp decides to address the issue, the workaround is to split the files into sub-directories or to search for specific patterns as I have explained in this thread.

              Comment


              • #37
                I am running code very similar to #21 in this thread (I just cut and pasted that code here so readers don't have to flip back to see it) and it is causing Stata to crash. If I run it on only a couple files it works fine, but if I run it on many files (some of which are quite large), then it crashes.

                runby and filelist are from SSC.

                Code:
                clear
                input str5 form
                "form1"
                "form3"
                end
                
                program get_in_parts
                  local f = form
                  filelist, dir("mega_files") pattern("*`f'*")
                end
                runby get_in_parts, by(form)
                
                program import_txt
                  // move values of interest from variables to locals
                  local dsource = dirname
                  local fsource = filename
                  local id1 = id
                  local date1 = date
                  local form1 = form
                  
                  import delimited using `"`dsource'/`fsource'"', clear stringcols(_all) varnames(nonames)
                  
                  // get the desired info
                  keep if strpos(v1,"name:")
                  gen name = subinstr(v1,"name:","",1)
                
                  // copy over the file's information
                  gen sourcefile = `"`fsource'"'
                  gen sourcedir  = `"`dsource'"'
                  gen id = "`id1'"
                  gen date = "`date1'"
                  gen form = "`form1'"
                end
                
                runby import_txt, by(dirname filename) 
                
                save import_txt, replace
                I asked a CompSci friend and he said it might be crashing because "the program is adding (indexing) all of the files to memory. When your memory cannot hold a very large number of files, the OS will terminate the program." I suspect the friend is right b/c the program works fine if I run it on a small number of files.

                Is there a way to open/close an input file (one at a time) and then append an output file after each iteration so Stata doesn't have to store all the files in memory (if that is what it is doing)?

                Thanks in advance for the help. Big thanks to Robert Picard and Clyde Schechter for writing these very useful commands (runby and filelist).

                Comment

                Working...
                X