Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge multiple csv files

    Dear all,

    I have a set of csv files named firm1, firm2, ..., firmn. Each file contains a specific date range (for instance, firm1 may have dates from Jan 1, 2000 to Dec 31, 2015 and firm2 may have dates from Apr 1, 2008 to Dec 31, 2020) and data for four variables with the same name in each file (var1, var2, var3 and var4). I want to merge all these csv files into one with dates from Jan 1, 2000 to Dec 31, 2020 (max range), keep observations according to their dates, and for the four variables to add as termination the name of the firm (e.g., var1firm1, var1firm2, var2firm1, var2firm2 and so on).

    Thanks in advance for your help.

  • #2
    Step 1 is to convert each of your csv files to Stata datasets.
    Code:
    foreach f in firm1 firm2 firm3 firm4 {
        import delimited `f'.csv // add any options you need
        generate string40 firm = "`f'"
        save `f', replace
    }
    Step 2 is to append your data rather than merge it, creating a panel data structure which almost certainly is what you want to use for your analyses (see the Stata Longitudinal Data/Panel Data Reference Manual included in your Stata installation and accessible from Stata's Help menu).
    Code:
    append using firm1 firm2 firm3 firm4
    save allfirms, replace
    Step 3 is if you insist on the structure you described, you can use the reshape command to transform your data from the long layout that this yields to the wide layout you describe.

    Comment


    • #3
      William Lisowski, thaks for the response. I've tried this code:
      HTML Code:
      local myfilelist : dir . files "*.csv"
      foreach file of local myfilelist {
      drop _all
      import delimited using "`file'"
      local outfile = subinstr("`file'",".csv","",.)
      save "`outfile'", replace
      }
      The idea is that each file contains variables with same name, thus any
      HTML Code:
      append
      or
      HTML Code:
      merge
      command does not work because the name of variables should be different. A more difficult approach will be to rename all variables manually and then merge files, but I have over 500 firms and this will take a while. I do not know whether I can loop over them and change their names. Thanks.

      Comment


      • #4
        The variables have the same name in each file because the measure the same thing, right, just for different firms? So why shouldn't the datasets be appended?

        If you have height and weight data for 100 children for 5 years, you don't name the variables AliceHeight AliceWeight BobHeight BobWeight ... or Height2001 Weight2001 Height2002 Weight2002 ... . You have height and weight, and you have name and year to identify the individual.

        By adding the variable I called firm in my step 1, you will know for each observation which firm on what date the four variables represent.

        If you are familiar with Excel Pivot Tables, the data organization is similar. You have four numbers - your four variables - and then you have for each set of numbers two descriptors - the firm and the date.

        The experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of data like yours rather than a wide layout of the same data. You should try to achieve what you need with the data organized as step 2 produces it, and seek the help of Statalist in doing so. The sort of problems you will encounter trying to use your data with 500 firms x 4 variables per form = 2000 similarly named variables will be difficult.

        And, I indicated in Step 3 that if you insist, long data can be transformed into wide data using the rehsape command. I just didn't spell out the code for doing that.

        Comment


        • #5
          William Lisowski, yes, you're right, I did't pay attention to the structure of the data after appending (I did not work before with the command), and the command
          HTML Code:
          generate string40 firm = "`f'"
          did not work initially, only after removing
          HTML Code:
          string40
          . Now it's perfect. Thanks a lot!

          Comment


          • #6
            My apologies, to generate a string variable it should have been
            Code:
            generate str40 firm = "`f'"
            Another example of Stata's inability to discern what the programmer meant, leading to an error message that, while technically correct, doesn't really help diagnose the problem.
            Code:
            . generate string40 firm = "`f'"
            too many variables specified
            r(103);

            Comment

            Working...
            X