Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Appending multiple datasets in loop

    Hi there,

    I am using Stata 18. Using the code below I am trying to use a loop to append individual datasets for each region in each year (for each outcome/season/age) into one dataset per year. However, not all years have datasets for 9 all regions. As a result, Stata doesn't save a combined dataset for that year. Because I've used 'Capture' the loop runs to the end. Is there a way to get Stata to save each year's data when there aren't datasets for every region?

    Thank you in advance,

    Mel

    Code:
     Combine the regions for each year into one dataset
    cd "datasets\age"
    foreach outcome in anx cvd dep eatdis psy rti selfharm sleep{
        foreach season in Spring Autumn{
                foreach year in 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019{
                    foreach age in 0 1{
        capture{
            use "`outcome'_`season'_`year'_region1_age`age'.dta", clear
            foreach region in region2 region3 region4 region5 region6 region7 region8 region9{
            append using "`outcome'_`season'_`year'_`region'_age`age'.dta"
            }
            save "datasets\age\regions_combined/`outcome'_`season'_`year'_allregions_age`age'.dta", replace        
        }        
    }    
    }
    }
    }
    Last edited by Mel de Lange; 16 Jan 2025, 09:37.

  • #2
    Code:
    foreach year 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
    should be
    Code:
    foreach year in 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
    and would be even better as
    Code:
    forval year = 2008/2019
    Can we trust the rest of your code, because problems could easily arise from small mistakes like that?

    Comment


    • #3
      It sounds like what you are saying is that there is a problem with this part:

      Code:
      use "`outcome'_`season'_`year'_region1_age`age'.dta", clear
      foreach region in region2 region3 region4 region5 region6 region7 region8 region9{
          append using "`outcome'_`season'_`year'_`region'_age`age'.dta"
      }
      Where if one of the regions 2-9 doesn't exist, the file you try to append doesn't exist, and there is an error that causes the loop to exit prematurely? Is there always a file for region 1? You can deal with the 2-9 case by putting capture before the append command only instead of capturing the whole block.

      Code:
      use "`outcome'_`season'_`year'_region1_age`age'.dta", clear
      foreach region in region2 region3 region4 region5 region6 region7 region8 region9{
          capture append using "`outcome'_`season'_`year'_`region'_age`age'.dta"
      }
      If region 1 can also be missing, you need something a little more complicated. Something along these lines should work.

      Code:
      local first = 1
      foreach region in region1 region2 region3 region4 region5 region6 region7 region8 region9{
          if `first' {
              capture use "`outcome'_`season'_`year'_`region'_age`age'.dta", clear
              if !_rc {
                  local first = 0
              }
          }
          else {
              capture append using "`outcome'_`season'_`year'_`region'_age`age'.dta"
          }
      }
      Edit: crossed with #2
      Last edited by Daniel Schaefer; 16 Jan 2025, 09:57.

      Comment


      • #4
        Sorry Nick - That wasn't the issue, that was just me making a mistake editing the code I'd copied onto here. I've corrected the code above to show the code I ran. Sorry again.

        Comment


        • #5
          Mel de Lange You have to use the region local instead of hard coding region1 in the use command for #3 to work. I've edited that post to include that correction.

          Comment


          • #6
            Daniel Schaefer Thank you so much - your code works and I now have a dataset for each year so you definitely solved the question that I asked.

            However, in achieving that, I've realised that there isn't always a dataset for each age, but running this code creates a dataset (of the data for the previous age category) for each year for the non-existent age category!

            Do I need to apply 'capture' somewhere around this bit of code?
            Code:
            foreach age in 0 1{
            Just for reference, there are datasets for every outcome, season and year so it is just age that is the issue now.....

            Thank you again - your help is massively appreciated by a PhD student who's never had to do anything this complicated before!

            Comment


            • #7
              I think this approach can be improved in two ways. First, there is no need to treat the first file (whichever it is) separately. -append- will work just fine if you start with an empty data set and then bring in all of the files to be appended in one fell swoop. Second, rather than -capture-ing the calls to missing files, it is, in my opinion, simpler to first generate a list of the existing filenames and use that. So:
              Code:
              cd "datasets\age"
              foreach outcome in anx cvd dep eatdis psy rti selfharm sleep {
                  foreach season in Spring Autumn {
                      foreach year in 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 {
                          foreach age in 0 1 {
                              local filenames: dir "." files "`outcome'_`season'_`year'_region*_age`age'.dta"
                              clear
                              append using `filenames'
                              save "datasets\age\regions_combined/`outcome'_`season'_`year'_allregions_age`age'.dta", replace  
                          }    
                      }
                  }
              }

              Comment


              • #8
                I'm surprised you say the code in #3 creates new data of any kind, since it should only append together files that exist; it should not create new observations. If a file for age==0 or age==1 is missing, the file should be skipped and no changes should be made to the dataset in memory. The code should work for missing age files as well as it works for missing region files. What does the "extra" data look like? Do the extra observations appear as missing values?

                Comment


                • #9
                  I think my response in #7 crossed with #6, or, at least, I didn't notice and read #6 before posting.

                  Like Daniel Schaefer, I don't see why the code in #3 would produce a file for a non-existent age group. In fact, it should halt with an error message at -use "`outcome'_`season'_`year'_region1_age`age'.dt a", clear- since that file will not exist.

                  Be that as it may, if you want to try the approach shown in #7, a modification of it will work properly for non-existent age groups (i.e. not create a file for them):
                  Code:
                  cd "datasets\age"
                  foreach outcome in anx cvd dep eatdis psy rti selfharm sleep {
                      foreach season in Spring Autumn {
                          foreach year in 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 {
                              foreach age in 0 1 {
                                  local filenames: dir "." files "`outcome'_`season'_`year'_region*_age`age'.dta"
                                  clear
                                  if `"`filenames'"' != "" {
                                      append using `filenames'
                                      save "datasets\age\regions_combined/`outcome'_`season'_`year'_allregions_age`age'.dta", replace
                                  }  
                              }    
                          }
                      }
                  }
                  In fact, with this modification, the code will properly handle a non-existent value for any of outcome, season, year, age, or region.

                  Comment


                  • #10

                    Daniel Schaefer it is very odd. I noticed it when I ran the code for sex rather than age (code below). There are 3 categories of sex and very few regions/years have data for the third sex category (it is actually the people with sex missing) so I should only end up with a few files for the 3rd category. Instead I have a sex3 file for every year. The sex3 files created that shouldn't have been are exactly the same as the sex2 files i.e. anx_Autumn_2008_allregions_sex3 is an exact replica of anx_Autumn_2008_allregions_sex2. The files have a variable for sexcategory and it is 2 even though the filename is 3! So it seems like when it finds there isn't a category 3 it saves what is in memory but then carries on again correctly after that?!

                    Code:
                    cd "datasets\sex"
                    clear
                    foreach outcome in anx cvd dep eatdis psy rti selfharm sleep{
                        foreach season in Spring Autumn{
                                foreach year in 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019{
                                    foreach sex in 1 2 3{
                                        
                    local first = 1
                    foreach region in region1 region2 region3 region4 region5 region6 region7 region8 region9{
                        if `first' {
                            capture use "`outcome'_`season'_`year'_`region'_sex`sex'.dta", clear
                            if !_rc {
                                local first = 0
                            }
                        }
                        else {
                            capture append using "`outcome'_`season'_`year'_`region'_sex`sex'.dta"
                        }
                    }
                    save "datasets\sex\regions_combined/`outcome'_`season'_`year'_allregions_sex`sex'.dta", replace        
                        }        
                    }    
                    }
                    }

                    Comment


                    • #11
                      @Clyde Schecter Thank you - your code has worked. I now have the files I need.

                      Daniel Schaefer Thank you for your help too.

                      Comment


                      • #12
                        Clyde Schechter please note that when I refer to the code in three, I am referring to the last codeblock which captures the use command on error. I do agree that your code in #7 is an improvement since in general using capture risks capturing an error other than the intended error.

                        Mel de Lange Notice your original code in #1 does not include anything about sex. It definitely helps to know all of the relevant details!

                        Notice your "save" command is inside of the loop over sex, so you will write a new sex file every time. You only clear the dataset when you successfully load the first sex file. So what happens is, if there is no third sex, the second file gets left in memory, you skip past the command that loads and clears the data, then you save what you have in memory to the sex 3 file.

                        I think Clydes code from #9 is what you want since it accounts for that by checking to see if there are any files to combine before trying to save anything.

                        Comment


                        • #13
                          Daniel Schaefer Yes, sorry. I was looking at an earlier code block in #3 and so I didn't recognize that, in the end, that -use- command got -capture-d.

                          Comment


                          • #14
                            @Daniel Schaefer Thank you - That all makes sense. Thank you for taking the time to explain what I had wrong!

                            Comment

                            Working...
                            X