Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge mulitple .csv files

    I am using Stata 16.1 MP to merge six .csv files on neonatal, infant, and child mortality. I would like to create a ADM1 year panel data set that contains probability and count measures for neonatal, infant, and child mortality.

    I have used the following code which successfully creates the .dta files and then merges the files. However, when I do the merge there are 31,500 observations and 14 variables. The study population is correct (N=31,500), but the resulting data set should contain more variables. I think the problem is that the variable names are the same for each of the data sets and therefore I'm not merging in the new information (child and neonatal mortality).

    Is it possible to rename the variables within the macro so the variables will successfully merge at the ADM1 and year level?


    Code:
    local myfilelist : dir . files "*.csv"
    foreach file of local myfilelist {
    drop _all
    insheet using "`file'"
    local outfile = subinstr("`file'",".csv","",.)
    save "`outfile'", replace
    }
    
    
    local myfilelist : dir . files "*.dta"
    use ihme_lmics_u5m_2000_2017_d_infant_adm1_y2019m10d16
    
    foreach f of local myfilelist {
    merge 1:1 adm1_code year using `f'
    drop _merge
    save master2, replace
    }
    In order to provide a sample of the data I appended the data sets and pasted the sample below.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str32 adm0_name int adm0_code str43 adm1_name long adm1_code int year byte age_group_id str8 age_group_name byte sex_id str4 sex byte measure_id str20 measure float(mean lower upper)
    "Afghanistan" 2 "Badakhshan" 1002 2003 28 "Infant"   3 "Both" 27 "Probability of death"  .11054444  .09099275 .13221398
    "Afghanistan" 2 "Badakhshan" 1002 2003  1 "Under 5"  3 "Both" 27 "Probability of death"   .1530938  .12572843 .18161544
    "Afghanistan" 2 "Badakhshan" 1002 2010 28 "Infant"   3 "Both" 27 "Probability of death"   .0794504  .06524893  .0949191
    "Afghanistan" 2 "Badakhshan" 1002 2010  1 "Under 5"  3 "Both"  1 "Deaths"                 6222.563   4976.263  7709.781
    "Afghanistan" 2 "Badakhshan" 1002 2007 28 "Infant"   3 "Both" 27 "Probability of death"  .08842167  .07211453 .10599828
    "Afghanistan" 2 "Badakhshan" 1002 2017  1 "Under 5"  3 "Both" 27 "Probability of death"  .08431107  .06200083 .10956819
    "Afghanistan" 2 "Badakhshan" 1002 2005 28 "Infant"   3 "Both" 27 "Probability of death"  .09914187  .08018186 .11967482
    "Afghanistan" 2 "Badakhshan" 1002 2016 42 "Neonatal" 3 "Both" 27 "Probability of death" .034245655 .025212256 .04473999
    "Afghanistan" 2 "Badakhshan" 1002 2004 28 "Infant"   3 "Both" 27 "Probability of death"  .10478496  .08530268 .12553692
    "Afghanistan" 2 "Badakhshan" 1002 2017 28 "Infant"   3 "Both"  1 "Deaths"                 4840.812  3561.0266  6388.955
    "Afghanistan" 2 "Badakhshan" 1002 2013 28 "Infant"   3 "Both"  1 "Deaths"                 4897.643   3793.688  6178.554
    
    end


  • #2
    I am going to recommend that you start, not by trying to combine your datasets with the merge command, but rather by using the append command, as you apparently did to prepare your example data.

    From the appended data, you can then use the reshape command to reorganize it from the "long" layout in which it appears to a "wide" layout, if that is indeed what you need.

    You appear to have longitudinal data, where data has been collected from the same ADM1 entities ("panels") at successive points in time. Stata's tools for analyzing longitudinal data are found in the Stata Longitudinal-Data/Panel-Data Reference Manual PDF included in your Stata installation and accessible from Stata's Help menu. There you will see that the required data organization is in the "long" layout produced by append, with separate observations for each combination of panel and time. To combine your data with merge will result a "wide" layout, with repeated measures of the same value (e.g. mortality) in variables with different names reflecting, perhaps, which of the 6 original datasets it arose from.

    The experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data. You should try to achieve what you need with the data organized in a long layout, and seek the help of Statalist in doing so.

    Comment


    • #3
      Carrie Dolan ,

      use the rename command with new syntax to rename a bulk of variables:
      Code:
      sysuse auto , clear
      rename * *_#, addnumber(1)
      In your case you'd probably place it in the loop after the insheet command, introduce a file counter f=1,2...6 also in the loop referring to it in the addnumber() option . If these are years it is logical to start with the first year, rather than 1.
      To be able to merge the files later, perhaps un-rename the id variable back to its original name.

      Best, Sergiy

      Comment


      • #4
        Thanks for the suggestions.

        The data is already in the panel structure I needed for analysis (ADM1 year panel). I was trying to merge in infant and child mortality data to the neonatal dataset. The problem was that the variable names were the same across the data sets. For example, neonatal mortality was called 'mean' in infant, child, and neonatal data sets.

        I was trying to build/improve my knowledge of macros/loops by fixing the variable naming problem in the code. In the end I solved it by just opening the original .csv files and renaming the variables and running my usual code.

        For transparency in the thread, below is how I appended multiple .csv files

        Code:
        tempfile building
        save `building', emptyok
        
        local myfilelist : dir . files"*.csv"
        foreach file of local myfilelist {
        drop _all
        insheet using `"`file'"'
        local outfile = subinstr("`file'",".csv","",.)
        append using "`building'"
        save `building', replace
        }
        And this is the code I used to convert the .csv to .dta files and then merge them together (after I renamed the variable names to be distinct in each .csv)

        Code:
        local myfilelist : dir . files "*.csv"
        foreach file of local myfilelist {
        drop _all
        insheet using "`file'"
        local outfile = subinstr("`file'",".csv","",.)
        save "`outfile'", replace
        }
        
        
        local myfilelist : dir . files "*.dta"
        use ihme_lmics_u5m_2000_2017_d_infant_adm1_y2019m10d16
        
        foreach f of local myfilelist {
        merge 1:1 adm1_code year using `f'
        drop _merge
        save master2, replace
        }

        Comment

        Working...
        X