Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • merge and append excel files

    Hi
    I want to transform a set of around 1,500 Excel files (all in a similar format as the ones attached) into a single Stata file. The Stata file will have
    (1) the first column with the "Company Name" (in Cell B1 in each Excel file),
    (2) the second column with "Company Ticker" (in the Excel file name before the "_"),
    (3) the third column with "Date" (in the Excel file name after the "_").
    (4) some columns (one for each nationality) with the percentage of investors per nationality (the percentage of each investor is in column C and the nationality of that investor is in column K). These columns should be the sum of investors' percentage for each nationality.

    my data in excel files are something like this
    Company Energizer Holdings Inc
    RIC ENR
    Download Date 25-Feb-2024
    Shareholders Report
    # Investor Name % Outstanding Position (M) Position Change (M) Turnover Investor Type Investor Sub-Type Equity Assets ($, M) Investment Style City Country/Region
    31-May-2016 31-May-2016 31-May-2016 31-May-2016
    1 JP Morgan Asset Management 8.21% 5.08 0 Low Investment Managers Investment Advisor 6,58,717.11 GARP New York United States
    2 The Vanguard Group, Inc. 7.29% 4.51 0 Low Investment Managers Investment Advisor/Hedge Fund 65,91,312.17 Index Malvern United States
    3 The London Company of Virginia, LLC 7.22% 4.46 0 Low Investment Managers Investment Advisor 18,623.15 Core Value Richmond United States
    4 BlackRock Institutional Trust Company, N.A. 6.74% 4.17 0 Low Investment Managers Investment Advisor 35,20,976.98 Index San Francisco United States
    5 Gabelli Funds, LLC 3.70% 2.29 0 Low Investment Managers Investment Advisor/Hedge Fund 24,307.34 Core Value Rye United States
    6 Ceredex Value Advisors LLC 3.40% 2.10 0 Moderate Investment Managers Investment Advisor 5,759.02 Deep Value Orlando United States
    7 Westwood Management Corp. (Texas) 3.17% 1.96 0 Low Investment Managers Investment Advisor 12,055.33 Core Value Dallas United States
    8 State Street Global Advisors (US) 2.82% 1.74 0 Low Investment Managers Investment Advisor/Hedge Fund 21,58,203.83 Index Boston United States
    9 Janus Henderson Investors 2.73% 1.69 0 Low Investment Managers Investment Advisor/Hedge Fund 2,07,004.05 Core Growth London United Kingdom
    and my desired output would be like this
    Company Name Date Country name 1 Country name 2 ….
    First File Name before _ First File Name after _ Total of coloumn C of country name 1 in first excel file Total of coloumn C of country name 2 in first excel file …..
    Second File Name before _ second File Name after _ Total of coloumn C of country name 1 in second excel file Total of coloumn C of country name 2 in second excel file ….
    ….

  • #2
    Welcome to Statalist.

    To import Excel file, on the Stata main screen, go to "File" > "Import" and then choose the xlsx file option. A new panel will appear, and you'll have to choose the options that work for you. Generally, may want to check "Import first row as variable names". Then you can click OK to execute it. Check if the file was correctly imported to Stata.

    It also appears that your Excel data is not purely rectangular, if that's the case, you can use the "Cell Range" in the panel to decide what to cut off. Depending if you need that few row in Excel or not, this can be a more complicated process than it's described here. But there aren't enough details for me to tell. And I have to admit I don't see the relationship between the Excel data and the wanted format.

    Once you have done that, you should be able to see a Stata command logged under "History", you should see something like: import excel "C:\WhateverFolder\YourExcelFileName.xlsx", sheet("SomeTabName") firstrow. That is the command for this import work. Using that line and some clever integration with other commands such as foreach or program define, it is possible to loop it for the other 1,499 files. However, there are not enough details in #1 for us to help, you may have to come back with a more specific description once you are able to successfully import one file.

    Check the FAQ, and read carefully section 12.
    Last edited by Ken Chui; 19 Mar 2024, 08:20.

    Comment


    • #3
      Hey Ashish,

      to follow up on what Ken said: Ideally you have all the excel files in one folder on your computer/cloud. Then you can easily create a local that contains all file names and loop over this local. The changes in the variables you mention you would do at the point in the code I specified. As Ken said we would need more information and a data example for these changes, however.

      Code:
      * Define global for the path (alternatively set cd:)
      global path_excel "C:\Users\...\excel_files"
      
      * Get local with all 1,500 file names (note that only files ending with .xlsx will be considered)
      local filelist: dir "$path_excel" files "*.xlsx"
      
      * Use a loop to import and append
      local i = 0
      foreach file of local filelist {
          local i = `i'+1
          import excel using "$path_excel/`file'", firstrow sheet("..") clear
      
          * Now do some stuff and tranformations to your variables
          ....
      
          * Now save the first file and append the others
          if `i' ==1 {
              save "$path_excel\dataset.dta", replace
          }
          else {
              append using "$path_excel\dataset.dta"
              save "$path_excel\dataset.dta", replace
          }    
      }

      Best,
      Sebastian

      Comment


      • #4
        Thank you Ken and Sebastian, I got your point. but what I need is different from this. in my data files first few lines are description not data there. only cell B1 I want to put this value in my stata data file as first column, what should my stata data have...
        (1) the first column with the "Company Name" (in Cell B1 in each Excel file),
        (2) the second column with "Company Ticker" (in the Excel file name before the "_"),
        (3) the third column with "Date" (in the Excel file name after the "_").

        my data in each file started from row 6 in which there are variable names and from row 8 onwards the data. row 1 to 5 and 7 also I don't want to gather in this. row one only used to gather the info in cell B1. and then I write a code to find the unique country names in files to gather them and put them into columns in my stata data look like this
        Company Name Date Country name 1 Country name 2 ….
        First File Name before _ First File Name after _ Total of coloumn C of country name 1 in first excel file Total of coloumn C of country name 2 in first excel file …..
        Second File Name before _ second File Name after _ Total of coloumn C of country name 1 in second excel file Total of coloumn C of country name 2 in second excel file ….
        ….
        here is my code to find the unique country names


        clear all
        set more off

        // Specify the folder containing Excel files
        local folder "C:/Users/raoas/Desktop/Data"

        // Get a list of all Excel files in the folder
        local files : dir "`folder'" files "*.xlsx"

        // Initialize an empty list for unique country names
        local unique_countries ""

        // Loop through each Excel file
        foreach file of local files {
        // Import the Excel file
        import excel using "`folder'/`file'", clear

        // Extract unique country names from column L (assuming 8th row onwards)
        levelsof L if _n > 7, local(country_list)

        // Loop through each country in the country list
        foreach country of local country_list {
        // Check if the country is not already in the list of unique countries
        if strpos("`unique_countries'", "`country'") == 0 {
        // Append the country to the list of unique countries
        local unique_countries "`unique_countries' `country'"
        }
        }
        }

        // Display the final list of unique countries
        di "Unique countries across all files: `unique_countries'"

        after this code the list of unique countries I found I want to put them on column 3 onwards.

        in our stata data each row contains the data from one file, means there are 1500 rows in our stata data. first column contains the company name form cell B1, second column contains the excel file name before _ and third column contains the excel file name after _ the columns in which the unique country names are there the data should be total of column C against the country in each file.

        Comment

        Working...
        X