Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Append multiple files based on variable conditions

    Hi,

    I have a lot of datasets to append, but Stata runs out of memory as the files are large contain a lot of variables. I tried the following, but it seems like the 'keep' function does not allow these options:

    Code:
    local f : dir . files "*.dta"
    
     display as result `"`f'"'
    
     append using `f' if year>2009
    I just need two variables (IncomeGroup and HBP) and only for the years after 2009.

    Please let me know if this possible to do.

    Thank you.

  • #2
    Try this:

    Code:
    clear
    tempfile building
    save `building', emptyok
    local files: dir . files "*.dta"
    foreach f of local files {
        use IncomeGroup HBP year if year > 2009
        drop year
        append using `building'
        save `"`building'", replace
    }

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      Try this:

      Code:
      clear
      tempfile building
      save `building', emptyok
      local files: dir . files "*.dta"
      foreach f of local files {
      use IncomeGroup HBP year if year > 2009
      drop year
      append using `building'
      save `"`building'", replace
      }
      Thanks Clyde. It seems this is not working. I used the following options (using raw code names as variables are not renamed yet):

      Code:
      clear
      tempfile building
      save `building', emptyok
      local files: dir . files "*.dta"
      foreach f of local files {
          use v012 v024 v007 if v007 > 2009
          drop v007
          append using `building'
          save `"`building'", replace
      }
      The error is:

      invalid 'v024'
      r(198);

      When I replaced v024 with another, it shows the same error for that variable, and so on. And I keep only the year variable then it shows: is not a valid command name:

      Code:
      clear
      tempfile building
      save `building', emptyok
      local files: dir . files "*.dta"
      foreach f of local files {
          use   v007 if v007 > 2009
          drop v007
          append using `building'
          save `"`building'", replace
      }

      Comment


      • #4
        use v012 v024 v007 if v007 > 2009
        is invalid syntax. The error message is, unfortunately, misleading, as v024 is not the problem. The problem is that you have not specified any file name in the -use- command. It should be
        Code:
        use v012 v024 v007 if v007 > 2009 using `f'
        That, evidently, was my original mistake, which you just copied. Sorry about that. These problems with file management are a bit problematic because it is difficult to set up a good test for them, and it is easy to make mistakes in code that goes untested. My apologies.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          is invalid syntax. The error message is, unfortunately, misleading, as v024 is not the problem. The problem is that you have not specified any file name in the -use- command. It should be
          Code:
          use v012 v024 v007 if v007 > 2009 using `f'
          That, evidently, was my original mistake, which you just copied. Sorry about that. These problems with file management are a bit problematic because it is difficult to set up a good test for them, and it is easy to make mistakes in code that goes untested. My apologies.
          Thanks so much Clyde! This is a great help already. It seems like the errors are happening due to the messiness of the data. It keeps showing different errors on different variable names and values.

          In light of some past lessons, I think it is happening as certain variables are string in one dataset and numeric in another. I tried using the 'capture' function which I believe forces to run a command till the end even if a dataset in between doesn't meet the conditions.

          I will try this method for other projects and will keep updating this post, I think this is an important concern for those who face 'system refuses to provide more memory' issues.


          Code:
          Code:
          tempfile building
          save `building', emptyok
          local files: dir . files "*.dta"
          foreach f of local files {
          capture  use vs_* h1* if v012<20 using `f'
              append using `building'
              save `"building'"', replace
          }

          Comment


          • #6
            I tried using the 'capture' function which I believe forces to run a command till the end even if a dataset in between doesn't meet the conditions.
            I wouldn't do that. Error messages are not something to be avoided. They are welcome warnings of trouble ahead. Except when they are just syntax errors, they are warning you that your data are not appropriate to the commands you are issuing. Suppressing their effects with -capture- just sweeps the problems under the rug. The likely end result is that your end result file will just be a big pile of garbage. -capture- should not be used unless the error condition being suppressed is simply expected and unimportant. Even then, it should be used with caution.

            First, with apologies, let's attend to another error I originally made, that you copied. The -use- command needs to have the -clear- option added to it: -use vs_* h1* if v012 < 20 using `f', clear-. (DON'T use -capture-.)

            So do that first and see if that eliminates the errors.

            If not, the next thing to wonder is where the incompatibilities among the files being combined are. Install the -precombine- command, by Mark Chatfield, from Stata Journal (http://www.stata-journal.com/software/sj15-3). Run it on all those files to find out where you have variables that are string in some files and numeric in others. Then you have to actually look at the files and make a decision whether to convert the strings to numeric (and, if so, how), or convert the numeric ones to string. But, ultimately, you need all the variables to have the same storage type in all the files. After you've made your fixes, run -precombine- again to make sure you fixed everything and didn't create any new problems. Then try again.



            Comment


            • #7
              Originally posted by Clyde Schechter View Post
              I wouldn't do that. Error messages are not something to be avoided. They are welcome warnings of trouble ahead. Except when they are just syntax errors, they are warning you that your data are not appropriate to the commands you are issuing. Suppressing their effects with -capture- just sweeps the problems under the rug. The likely end result is that your end result file will just be a big pile of garbage. -capture- should not be used unless the error condition being suppressed is simply expected and unimportant. Even then, it should be used with caution.

              First, with apologies, let's attend to another error I originally made, that you copied. The -use- command needs to have the -clear- option added to it: -use vs_* h1* if v012 < 20 using `f', clear-. (DON'T use -capture-.)

              So do that first and see if that eliminates the errors.

              If not, the next thing to wonder is where the incompatibilities among the files being combined are. Install the -precombine- command, by Mark Chatfield, from Stata Journal (http://www.stata-journal.com/software/sj15-3). Run it on all those files to find out where you have variables that are string in some files and numeric in others. Then you have to actually look at the files and make a decision whether to convert the strings to numeric (and, if so, how), or convert the numeric ones to string. But, ultimately, you need all the variables to have the same storage type in all the files. After you've made your fixes, run -precombine- again to make sure you fixed everything and didn't create any new problems. Then try again.


              Thanks Clyde! I was never really sure what -capture- was doing in the background, and it actually didn't do the job in most cases. I now use it only when there is a desperate need to continue a process.

              Regarding the use of -clear-, it is keeping only the last dataset and not appending the rest.

              The -precombine- is a wonderful toolbox, Thank you for sharing this. I tried it and the output is so long that most of it gets hidden at the top. Do you know any method bychance to force Stata showing the full result of a process, or capturing that to clipboard.

              Comment


              • #8
                Regarding the use of -clear-, it is keeping only the last dataset and not appending the rest.
                That should not be. Please post the full exact code you are using.

                That said, you can probably omit the -clear- after all. Because the data in memory are saved at the end of the loop, -use- will not object to overwriting it.

                Comment

                Working...
                X