Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Automating merge vs. append

    My program is for working with published tables from American Community Survey, accessible via American Fact Finder. One of the first steps of the program is to append multiple years (each from its own file) from a single table ID. The files may be, for example:

    Code:
    acs_10_1yr_b06009_with_ann.csv.dta
    acs_11_1yr_b06009_with_ann.csv.dta
    acs_12_1yr_b06009_with_ann.csv.dta
    (The program creates the .dta files from downloaded .csv files.)

    I would like to combine different tables such as the following. The file names will be different at various times the program is run.

    Code:
    acs_10_1yr_b06009_with_ann.csv.dta
    acs_11_1yr_b06009_with_ann.csv.dta
    acs_12_1yr_b06009_with_ann.csv.dta
    acs_10_1yr_b07009_with_ann.csv.dta
    acs_11_1yr_b07009_with_ann.csv.dta
    acs_12_1yr_b07009_with_ann.csv.dta
    (Note the different table numbers of B06009 and B07009.)

    The following code works but creates extra rows in the data with corresponding missing values:

    Code:
    local datafiles : dir . files  "*.dta"
    append using `datafiles', force
    I have altered the variable names before this step to avoid having the same variable names in B06009 data and B07009 data, making the use of append possible. The identifying variable, geoid2, has not been changed and could be used as a merge variable.

    The question is, how can I put the filenames into groups to first be appended to each other? The appended files would then be merged. In the above set of files, I would append all of the B06009's together, likewise the B07009's, then merge the two resultant files.

    Luckily, the filenames have the same format, though the length of the table name may change. I already have the existing code to get at the table name:

    Code:
    local datafiles : dir . files  "*.csv.dta"
    
    foreach file in `datafiles' {
    local filename_truncated = subinstr(`"`file'"', `"_with_ann.csv"', "", .) local table_name = substr("`filename_truncated'", 12, .)
    }
    Then, what I would like to do is use the resulting `table_name' to group the files. I'm imagining some sort of matrix with each file name and a group number. Or perhaps I could store the `table_name' in a data characteristic or global macro. Note that I am using version 11.2.

    Thanks

  • #2
    Originally posted by Kevin McCaffrey View Post
    The following code works but creates extra rows in the data with corresponding missing values...
    You are using the force option which can result in missing values if trying to append string and numeric types for a same variable. Are these not the missings you mention?

    Originally posted by Kevin McCaffrey View Post
    how can I put the filenames into groups to first be appended to each other? The appended files would then be merged. In the above set of files, I would append all of the B06009's together, likewise the B07009's, ...
    How about something like:

    Code:
    forvalues i = 6/7 {
        
        clear all
        set obs 0
        
        local myfiles : dir . files "*b0`i'009*.dta"
        display `"`myfiles'"'
    
        append using `myfiles'
        save "series`i'.dta", replace
        
    }
    
    use series6.dta
    list in 1/10
    
    use series7.dta
    list in 1/10
    You should:

    1. Read the FAQ carefully.

    2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

    3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

    4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

    Comment


    • #3
      As a couple of footnotes:

      1. I see no merge here. If append is the command envisaged, merge terminology can confuse.

      2. fs (SSC) is a convenient wrapper for the dir functionality here. It also generalizes that functionality slightly.

      Comment


      • #4
        Thanks Roberto, though I need code that is more general. Below is some working code I wrote today that handles the append and merge combination that I refer to in my post, but it seems there should be a shorter way. I would think that combining datasets with both append and merge is something that comes up frequently, so I was hoping someone had a clean way of doing this. My guess is that there's a better way to group like files that can be appended. Any suggestions?

        In R, I would use an array approach, and I might try that next time in Stata (relevant posts). For others' reference, I was glad that a key word search took me to this quick introduction to the extended macro function list. More info is on the help page for macro lists.


        Code:
        **THIS SECTION APPENDS YEARLY DATASETS BASED ON THE SAME ACS TABLE
        
        
        local datafiles : dir . files  "*.csv.dta"
        display `datafiles'
        
        *Determine the ACS tables involved
        foreach active_file in `datafiles' {
            
            display "Active file: `active_file'"
            local filename_truncated = subinstr(`"`active_file'"', `"_with_ann.csv.dta"', "", .)
            local table_name = substr("`filename_truncated'", 12, .)
            display "Current table: `table_name'"
            local table_list `table_list' `table_name'
            display "Table list: `table_list'"
            
        }    
        
        local tables : list uniq table_list
        display "Unique Tables: `tables'"
        
        *Append the yearly files for each ACS table
        display "Now checking filenames against the table list"
         
         foreach active_table in `tables' {
         
            clear
            display "Active Table: `active_table'"
        
            foreach active_file in `datafiles' {
                
                display "Active file: `active_file'"
                local filename_truncated = subinstr(`"`active_file'"', `"_with_ann.csv.dta"', "", .)
                local file_table_name = substr("`filename_truncated'", 12, .)
                display "Current file based on table: `file_table_name'"
                
                if `: list active_table == file_table_name' {
                    append using `active_file', force
                    save `active_table', replace
                }
            }
        }
            
        
        
        **THIS SECTION MERGES THE APPENDED DATASETS (MULTIPLE ACS TABLES GO INTO ONE .DTA FILE)
        
        
        *Remove the separate yearly files, which are intermediary and not needed, so they don't get pulled into the merge
        shell del *.csv.dta
        
        clear
        
        local datafiles : dir . files  "*.dta"
        local n : word count `datafiles'
        local n_1 : word 1 of `datafiles'
        
        use `n_1'
        
        forvalues i = 2/`n' {
        
            local using_file : word `i' of `datafiles'
            display "Merging using: `using_file'"
            merge 1:1 geoid2 year using `using_file', nogenerate
        } 
        
        save appended_and_merged_data

        Comment


        • #5
          You could save some lines nesting the string functions and changing other bits. This is one example up to the part where the appending goes on, but more are possible:

          Code:
          *----- save names of all files of interest -----
          
          /*
          Assume a directory with all types of files (pdf, jpg, etc.) 
          and three dta files to be analysed:
          
          sometext1234_b06009_some1234.dta" 
          sometexttext_b06009_sometext.dta" 
          sometexttext_b070095_sometext.dta"
          */
          
          local files : dir . files "sometext*"
          
          *----- create list of common parts of file names (e.g. b06009) -----
          
          foreach f of local files {    
              local files2 `files2' `=reverse(substr(reverse(substr("`f'", 14, .)), 14, .))'
              display "`files2'"
          }
          
          local filesu : list uniq files2
          
          *----- loop through files with common parts and append -----
          
          foreach f of local filesu {
          
              clear all
              
              local myfiles : dir . files "*`f'*.dta"
              display `"`myfiles'"'
              
              append using `myfiles'
              save "series`f'.dta"
          }
          You should:

          1. Read the FAQ carefully.

          2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

          3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

          4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

          Comment


          • #6
            OK, thanks for the suggestion.

            Comment

            Working...
            X