Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge and sum several dataset

    Hello World !

    I am working with several (almost 100) database, which all have the same configuration.
    I have data for each country of the EU from jan2000 to dec2019 (some of them have missing values, which are written ":").
    Since all my database are on the same period & frequency, and since they all have the same data names, I would like to know how to merge them all in one file (that is the sum of all values for each country).

    Illustration:

    Database 1
    m y AUSTRIA BELGIUM (and LUXBG -> 1998) BULGARIA CYPRUS CZECHIA (CS->1992) GERMANY (incl DD from 1991) DENMARK ESTONIA
    1 2000 : : : : : 37328 148550 :
    2 2000 : : : : : 38461 185092 67
    3 2000 : : : : : : 250220 :
    4 2000 : : : : : 54536 181326 :
    5 2000 : : : : : 150693 370858 :
    6 2000 : : : : : 371919 1067027 :
    7 2000 : : : : : 575746 1157469 56
    8 2000 : : : : : 501796 1219928 47
    9 2000 : : : : : 478317 903141 57
    10 2000 : : : : : 569846 1190308 235
    11 2000 : : : : : 541034 1096933 443
    12 2000 : : : : : 807289 566551 388

    Database 2
    m y AUSTRIA BELGIUM (and LUXBG -> 1998) BULGARIA CYPRUS CZECHIA (CS->1992) GERMANY (incl DD from 1991) DENMARK ESTONIA
    1 2000 : : : : : 418 : 243
    2 2000 : : 465 : : : : 366
    3 2000 : : 38 : : : : :
    4 2000 : : 49 : : : : :
    5 2000 : : : : : : : :
    6 2000 : : : : : : : :
    7 2000 : : : : : 64 581 : 54
    8 2000 : : 91 : : : : 299
    9 2000 : : : : : : : 366
    10 2000 : : 147 : : : : 553
    11 2000 : : : : : 15 646 : 249
    12 2000 : : : : : 65 956 22 545 2 021

    Database 3
    m y AUSTRIA BELGIUM (and LUXBG -> 1998) BULGARIA CYPRUS CZECHIA (CS->1992) GERMANY (incl DD from 1991) DENMARK ESTONIA
    1 2000 : : 30 : : 3 954 918 : 1 252
    2 2000 : : : : : 638 856 83 229 818
    3 2000 : 36 212 : : : 2 184 877 136 363 126
    4 2000 : 17 648 : : : 1 097 291 91 441 585
    5 2000 : : : : : 432 368 235 023 456
    6 2000 : : 100 : : 493 020 471 125 196
    7 2000 : : : : : 2 828 800 304 038 148
    8 2000 : : 96 : : 2 440 080 453 412 :
    9 2000 : 20 655 : : : 2 355 697 565 428 :
    10 2000 : : : : : 4 057 272 747 554 241
    11 2000 : : 232 : : 2 707 837 471 183 :
    12 2000 : : : : : 3 514 004 353 058 344

    and so on.

    What I want is a database that is merging all and do the sum of all, or, in this example, these 3 database (under the name database4, for example):


    Database 4
    m y AUSTRIA BELGIUM (and LUXBG -> 1998) BULGARIA CYPRUS CZECHIA (CS->1992) GERMANY (incl DD from 1991) DENMARK ESTONIA
    1 2000 0 0 30 0 0 3992664 148550 1495
    2 2000 0 0 465 0 0 677317 268321 1251
    3 2000 0 36212 38 0 0 2184877 386583 126
    4 2000 0 17648 49 0 0 1151827 272767 585
    5 2000 0 0 0 0 0 583061 605881 456
    6 2000 0 0 100 0 0 864939 1538152 196
    7 2000 0 0 0 0 0 3469127 1461507 258
    8 2000 0 0 187 0 0 2941876 1673340 346
    9 2000 0 20655 0 0 0 2834014 1468569 423
    10 2000 0 0 147 0 0 4627118 1937862 1029
    11 2000 0 0 232 0 0 3264517 1568116 692
    12 2000 0 0 0 0 0 4387249 942154 2753


    I hope that I have correctly explained my problem.

    I know that I can't keep these ":" and that weird variable names are a problem so I will first of all run something like that for each database:

    Code:
    // Gen date from "y" and "m" columns
    {gen edate = ym(y, m)
    format edate %tm
    drop y m
    rename edate date
    }
    
    // rename 
    {rename belgium~1998 belgium
    rename czechia~1992 czechia
    rename germany~1991 germany
    //and so on ...
    }
    
    // replace ":" by "." for STATA to understand that these are missing values
    { replace austria = "." if austria == ":"
    replace belgium = "." if belgium == ":"
    replace bulgaria = "." if bulgaria == ":"
    //and so on ...
    }
    Thank you for your help !

  • #2
    Once you have done the cleaning you describe in #1, the steps will be as follows:

    1. Destring all the country variables: you can't add strings, they have to be numeric. -destring austria-estonia, replace-

    2. Write a loop to append all the data sets together. The code will look something like this:
    Code:
    clear
    
    forvalues j = 1/100 {
        append using data_set_`j'
    }
    At the end of this loop, the appended data will be in memory and ready for your next steps. If your data set names are not indexed by numbers 1 through 100, then you will need to construct a local macro containing a list of all the filenames, and do a -foreach- loop that iterates over that instead of -forvalues-. Also, in case you run into unexpected things in the data, it might make sense to add to that code some commands that include an additional variable identifying the source file from which each observation comes.

    3. -collapse (sum) austria-estonia, by(date)-
    But you might want to run some sanity checks on the data before you do this -collapse-.

    A few other points:

    A. You are better off replacing ":" by "" (empty string). The "." observations will be treated by -destring- as non-numeric material and you won't get the numeric variable you need.

    B. If you have nearly 100 data sets it is, in fact, unlikely that they all have the same configuration. So don't be surprised if you run into problems trying to append them all. It is likely that some of the data sets will be non-conforming to the overall pattern and you will have to individually manage those data sets to bring them into line so that they can be correctly appended.

    Comment


    • #3
      Hello Clyde,

      Thank you very much for your fast answer !!!

      I managed to obtain the same results as in my example. Final code at the end of this post, in "TO CONCLUDE".
      I have however a few questions.

      1) "type mismatch"

      When I run this for values in "database 1" :

      Code:
      replace austria = "" if austria == ":"
      replace belgium = "" if belgium == ":"
      replace bulgaria = "" if bulgaria == ":"
      replace cyprus = "" if cyprus == ":"
      replace czechia = "" if czechia == ":"
      replace germany = "" if germany == ":"
      replace denmark = "" if denmark == ":"
      replace estonia = "" if estonia == ":"
      I have this error for denmark :

      Code:
      type mismatch
      r(109);
      I guessed that it is because there is no ":" in the data. So I tried adding some ":", it worked. However, is there a way to avoid this error ? Just for the code to look "clean" when I run it =)


      2) Variable was byte

      Once I did :

      Code:
      gen edate = ym(y, m)
      format edate %tm
      drop y m
      rename edate date
      
      rename belgium~1998 belgium
      rename czechia~1992 czechia
      rename germany~1991 germany
      
      replace austria = "" if austria == ":"
      replace belgium = "" if belgium == ":"
      replace bulgaria = "" if bulgaria == ":"
      replace cyprus = "" if cyprus == ":"
      replace czechia = "" if czechia == ":"
      replace germany = "" if germany == ":"
      replace denmark = "" if denmark == ":"
      replace estonia = "" if estonia == ":"
      
      destring austria-estonia, replace

      on my three files, that I saved in my working directory as :
      data_set_1
      data_set_2
      data_set_3

      I ran :
      Code:
       
       forvalues j = 1/3 {     append using data_set_`j' }
      And I got this :

      Code:
      (note: variable bulgaria was byte, now int to accommodate
             using data's values)
      (note: variable belgium was byte, now long to accommodate
             using data's values)
      Is it something to worry about ?

      ( I am using dataset from my first post )

      3) Sanity check
      What do you mean when you say that I shall do "sanity check" before I do the -collapse- ?

      --

      TO CONCLUDE:

      My final code looks like this :

      Code:
      gen edate = ym(y, m)
      format edate %tm
      drop y m
      rename edate date
      
      rename belgium~1998 belgium
      rename czechia~1992 czechia
      rename germany~1991 germany
      
      replace austria = "" if austria == ":"
      replace belgium = "" if belgium == ":"
      replace bulgaria = "" if bulgaria == ":"
      replace cyprus = "" if cyprus == ":"
      replace czechia = "" if czechia == ":"
      replace germany = "" if germany == ":"
      replace denmark = "" if denmark == ":"
      replace estonia = "" if estonia == ":"
      
      destring austria-estonia, replace
      cd "write the path to your working directory where data_set files are located"
      
      clear
      forvalues j = 1/3 {
          append using data_set_`j'
      }
      
      collapse (sum) austria-estonia, by(date)
      And I obtain the same results that those from "Database 4" in my first post:
      date austria belgium bulgaria cyprus czechia germany denmark estonia
      2000m1 0 0 30 0 0 3992664 148550 1495
      2000m2 0 0 465 0 0 677317 268321 1251
      2000m3 0 36212 38 0 0 2184877 386583 126
      2000m4 0 17648 49 0 0 1151827 272767 585
      2000m5 0 0 0 0 0 583061 605881 456
      2000m6 0 0 100 0 0 864939 1.50E+06 196
      2000m7 0 0 0 0 0 3469127 1.50E+06 258
      2000m8 0 0 187 0 0 2941876 1.70E+06 346
      2000m9 0 20655 0 0 0 2834014 1.50E+06 423
      2000m10 0 0 147 0 0 4627118 1.90E+06 1029
      2000m11 0 0 232 0 0 3264517 1.60E+06 692
      2000m12 0 0 0 0 0 4387249 942154 2753

      Comment


      • #4
        Concerning question 1, you can both abbreviate the code and avoid the -type mismatch- error by doing the following.

        Code:
        foreach v of varlist austria-estonia {
            capture confirm numeric var `v'
            if c(rc) == 7 {
                replace `v' = "" if `v' == ":"
                destring `v', replace
           }
           else if c(rc) != 0 {
                display as error "Unexpected error: `v'"
           }
        }
        This code wil test each variable to see if it is already numeric, and skip over those. It will destring the non-numeric ones.

        Concerning #2, this is probably nothing to be concerned about. In some of the files, the value of the variable mentioned was, in all observations, small enough to fit into a single byte. Then in a subsequent file, values too large for single-byte storage were found, and so the storage type of the variable was increased to accommodate the larger values. This sort of thing happens frequently when appending data sets. It is only a problem if your understanding of the data is that the values should always be small enough to fit into a single byte. If that is the case, then your data contains wrong values and you need to find out why and fix it.

        Concerning #3, I mean before doing collapse run some descriptive statistics on the data, or maybe some tabulations or cross tabulations and make sure that the data look OK. My experience is that when you work with a large number of data sets the probability that one or more of them will contain bad data is high. When you collapse the data by taking sums, that won't necessarily be obvious in the summed results. So it makes sense to check the data in the individual sets first and make sure, at least, that the values are within appropriate ranges. Also, if there are certain relationships that should always be true (e.g. maybe the values for the Estonia variable should always be smaller than those for Germany, you should check that with an -assert- statement Again, the errors won't necessarily be obvious, or readily detectable in the summed data. So check the pre-collapse data first.

        Comment

        Working...
        X