Merge and sum several dataset

Morad Bali

Join Date: Apr 2016
Posts: 101

Merge and sum several dataset

08 Mar 2020, 09:03

Hello World !

I am working with several (almost 100) database, which all have the same configuration.
I have data for each country of the EU from jan2000 to dec2019 (some of them have missing values, which are written ":").
Since all my database are on the same period & frequency, and since they all have the same data names, I would like to know how to merge them all in one file (that is the sum of all values for each country).

Illustration:

Database 1

m	y	AUSTRIA	BELGIUM (and LUXBG -> 1998)	BULGARIA	CYPRUS	CZECHIA (CS->1992)	GERMANY (incl DD from 1991)	DENMARK	ESTONIA
1	2000	:	:	:	:	:	37328	148550	:
2	2000	:	:	:	:	:	38461	185092	67
3	2000	:	:	:	:	:	:	250220	:
4	2000	:	:	:	:	:	54536	181326	:
5	2000	:	:	:	:	:	150693	370858	:
6	2000	:	:	:	:	:	371919	1067027	:
7	2000	:	:	:	:	:	575746	1157469	56
8	2000	:	:	:	:	:	501796	1219928	47
9	2000	:	:	:	:	:	478317	903141	57
10	2000	:	:	:	:	:	569846	1190308	235
11	2000	:	:	:	:	:	541034	1096933	443
12	2000	:	:	:	:	:	807289	566551	388

Database 2

m	y	AUSTRIA	BELGIUM (and LUXBG -> 1998)	BULGARIA	CYPRUS	CZECHIA (CS->1992)	GERMANY (incl DD from 1991)	DENMARK	ESTONIA
1	2000	:	:	:	:	:	418	:	243
2	2000	:	:	465	:	:	:	:	366
3	2000	:	:	38	:	:	:	:	:
4	2000	:	:	49	:	:	:	:	:
5	2000	:	:	:	:	:	:	:	:
6	2000	:	:	:	:	:	:	:	:
7	2000	:	:	:	:	:	64 581	:	54
8	2000	:	:	91	:	:	:	:	299
9	2000	:	:	:	:	:	:	:	366
10	2000	:	:	147	:	:	:	:	553
11	2000	:	:	:	:	:	15 646	:	249
12	2000	:	:	:	:	:	65 956	22 545	2 021

Database 3

m	y	AUSTRIA	BELGIUM (and LUXBG -> 1998)	BULGARIA	CYPRUS	CZECHIA (CS->1992)	GERMANY (incl DD from 1991)	DENMARK	ESTONIA
1	2000	:	:	30	:	:	3 954 918	:	1 252
2	2000	:	:	:	:	:	638 856	83 229	818
3	2000	:	36 212	:	:	:	2 184 877	136 363	126
4	2000	:	17 648	:	:	:	1 097 291	91 441	585
5	2000	:	:	:	:	:	432 368	235 023	456
6	2000	:	:	100	:	:	493 020	471 125	196
7	2000	:	:	:	:	:	2 828 800	304 038	148
8	2000	:	:	96	:	:	2 440 080	453 412	:
9	2000	:	20 655	:	:	:	2 355 697	565 428	:
10	2000	:	:	:	:	:	4 057 272	747 554	241
11	2000	:	:	232	:	:	2 707 837	471 183	:
12	2000	:	:	:	:	:	3 514 004	353 058	344

and so on.

What I want is a database that is merging all and do the sum of all, or, in this example, these 3 database (under the name database4, for example):

Database 4

m	y	AUSTRIA	BELGIUM (and LUXBG -> 1998)	BULGARIA	CYPRUS	CZECHIA (CS->1992)	GERMANY (incl DD from 1991)	DENMARK	ESTONIA
1	2000	0	0	30	0	0	3992664	148550	1495
2	2000	0	0	465	0	0	677317	268321	1251
3	2000	0	36212	38	0	0	2184877	386583	126
4	2000	0	17648	49	0	0	1151827	272767	585
5	2000	0	0	0	0	0	583061	605881	456
6	2000	0	0	100	0	0	864939	1538152	196
7	2000	0	0	0	0	0	3469127	1461507	258
8	2000	0	0	187	0	0	2941876	1673340	346
9	2000	0	20655	0	0	0	2834014	1468569	423
10	2000	0	0	147	0	0	4627118	1937862	1029
11	2000	0	0	232	0	0	3264517	1568116	692
12	2000	0	0	0	0	0	4387249	942154	2753

I hope that I have correctly explained my problem.

I know that I can't keep these ":" and that weird variable names are a problem so I will first of all run something like that for each database:

Code:

// Gen date from "y" and "m" columns
{gen edate = ym(y, m)
format edate %tm
drop y m
rename edate date
}

// rename 
{rename belgium~1998 belgium
rename czechia~1992 czechia
rename germany~1991 germany
//and so on ...
}

// replace ":" by "." for STATA to understand that these are missing values
{ replace austria = "." if austria == ":"
replace belgium = "." if belgium == ":"
replace bulgaria = "." if bulgaria == ":"
//and so on ...
}

Thank you for your help !

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

08 Mar 2020, 11:33

Once you have done the cleaning you describe in #1, the steps will be as follows:

1. Destring all the country variables: you can't add strings, they have to be numeric. -destring austria-estonia, replace-

2. Write a loop to append all the data sets together. The code will look something like this:

Code:

clear forvalues j = 1/100 { append using data_set_`j' }

At the end of this loop, the appended data will be in memory and ready for your next steps. If your data set names are not indexed by numbers 1 through 100, then you will need to construct a local macro containing a list of all the filenames, and do a -foreach- loop that iterates over that instead of -forvalues-. Also, in case you run into unexpected things in the data, it might make sense to add to that code some commands that include an additional variable identifying the source file from which each observation comes.

3. -collapse (sum) austria-estonia, by(date)-
But you might want to run some sanity checks on the data before you do this -collapse-.

A few other points:

A. You are better off replacing ":" by "" (empty string). The "." observations will be treated by -destring- as non-numeric material and you won't get the numeric variable you need.

B. If you have nearly 100 data sets it is, in fact, unlikely that they all have the same configuration. So don't be surprised if you run into problems trying to append them all. It is likely that some of the data sets will be non-conforming to the overall pattern and you will have to individually manage those data sets to bring them into line so that they can be correctly appended.
Comment

Morad Bali

Join Date: Apr 2016
Posts: 101

09 Mar 2020, 04:06

Hello Clyde,

Thank you very much for your fast answer !!!

I managed to obtain the same results as in my example. Final code at the end of this post, in "TO CONCLUDE".
I have however a few questions.

1) "type mismatch"

When I run this for values in "database 1" :

Code:

replace austria = "" if austria == ":"
replace belgium = "" if belgium == ":"
replace bulgaria = "" if bulgaria == ":"
replace cyprus = "" if cyprus == ":"
replace czechia = "" if czechia == ":"
replace germany = "" if germany == ":"
replace denmark = "" if denmark == ":"
replace estonia = "" if estonia == ":"

I have this error for denmark :

Code:

type mismatch
r(109);

I guessed that it is because there is no ":" in the data. So I tried adding some ":", it worked. However, is there a way to avoid this error ? Just for the code to look "clean" when I run it =)

2) Variable was byte

Once I did :

Code:

gen edate = ym(y, m)
format edate %tm
drop y m
rename edate date

rename belgium~1998 belgium
rename czechia~1992 czechia
rename germany~1991 germany

replace austria = "" if austria == ":"
replace belgium = "" if belgium == ":"
replace bulgaria = "" if bulgaria == ":"
replace cyprus = "" if cyprus == ":"
replace czechia = "" if czechia == ":"
replace germany = "" if germany == ":"
replace denmark = "" if denmark == ":"
replace estonia = "" if estonia == ":"

destring austria-estonia, replace

on my three files, that I saved in my working directory as :
data_set_1
data_set_2
data_set_3

I ran :

Code:

 
 forvalues j = 1/3 {     append using data_set_`j' }

And I got this :

Code:

(note: variable bulgaria was byte, now int to accommodate
       using data's values)
(note: variable belgium was byte, now long to accommodate
       using data's values)

Is it something to worry about ?

( I am using dataset from my first post )

3) Sanity check
What do you mean when you say that I shall do "sanity check" before I do the -collapse- ?

--

TO CONCLUDE:

My final code looks like this :

Code:

gen edate = ym(y, m)
format edate %tm
drop y m
rename edate date

rename belgium~1998 belgium
rename czechia~1992 czechia
rename germany~1991 germany

replace austria = "" if austria == ":"
replace belgium = "" if belgium == ":"
replace bulgaria = "" if bulgaria == ":"
replace cyprus = "" if cyprus == ":"
replace czechia = "" if czechia == ":"
replace germany = "" if germany == ":"
replace denmark = "" if denmark == ":"
replace estonia = "" if estonia == ":"

destring austria-estonia, replace
cd "write the path to your working directory where data_set files are located"

clear
forvalues j = 1/3 {
    append using data_set_`j'
}

collapse (sum) austria-estonia, by(date)

And I obtain the same results that those from "Database 4" in my first post:

date	austria	belgium	bulgaria	cyprus	czechia	germany	denmark	estonia
2000m1	0	0	30	0	0	3992664	148550	1495
2000m2	0	0	465	0	0	677317	268321	1251
2000m3	0	36212	38	0	0	2184877	386583	126
2000m4	0	17648	49	0	0	1151827	272767	585
2000m5	0	0	0	0	0	583061	605881	456
2000m6	0	0	100	0	0	864939	1.50E+06	196
2000m7	0	0	0	0	0	3469127	1.50E+06	258
2000m8	0	0	187	0	0	2941876	1.70E+06	346
2000m9	0	20655	0	0	0	2834014	1.50E+06	423
2000m10	0	0	147	0	0	4627118	1.90E+06	1029
2000m11	0	0	232	0	0	3264517	1.60E+06	692
2000m12	0	0	0	0	0	4387249	942154	2753

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#4

10 Mar 2020, 11:14

Concerning question 1, you can both abbreviate the code and avoid the -type mismatch- error by doing the following.

Code:

foreach v of varlist austria-estonia { capture confirm numeric var `v' if c(rc) == 7 { replace `v' = "" if `v' == ":" destring `v', replace } else if c(rc) != 0 { display as error "Unexpected error: `v'" } }

This code wil test each variable to see if it is already numeric, and skip over those. It will destring the non-numeric ones.

Concerning #2, this is probably nothing to be concerned about. In some of the files, the value of the variable mentioned was, in all observations, small enough to fit into a single byte. Then in a subsequent file, values too large for single-byte storage were found, and so the storage type of the variable was increased to accommodate the larger values. This sort of thing happens frequently when appending data sets. It is only a problem if your understanding of the data is that the values should always be small enough to fit into a single byte. If that is the case, then your data contains wrong values and you need to find out why and fix it.

Concerning #3, I mean before doing collapse run some descriptive statistics on the data, or maybe some tabulations or cross tabulations and make sure that the data look OK. My experience is that when you work with a large number of data sets the probability that one or more of them will contain bad data is high. When you collapse the data by taking sums, that won't necessarily be obvious in the summed results. So it makes sense to check the data in the individual sets first and make sure, at least, that the values are within appropriate ranges. Also, if there are certain relationships that should always be true (e.g. maybe the values for the Estonia variable should always be smaller than those for Germany, you should check that with an -assert- statement Again, the errors won't necessarily be obvious, or readily detectable in the summed data. So check the pre-collapse data first.
Comment

Announcement

Merge and sum several dataset

Comment

Comment

Comment