Appending Datasets

Patrick Donkor

Join Date: Nov 2022
Posts: 35

Appending Datasets

12 Feb 2024, 10:57

Good day Everyone,
Please I have two cross sectional datasets, namely df15 and df16. All the datasets have the same variables (Banks, Country, Year, etc) but slightly different observations (Banks). Almost all the observations (banks) are the same just that there are very few banks in one dataset that is not in the other dataset. df15 and df16 have 1552 and 1634 observations respectively. It is important to state that the Year variable for df15 dataset is filled with only '2015' values throughout while the Year variable for df16 dataset is also filled with only '2016' values throughout. Now the problem is, whenever I append these two datasets, the total observations increases from approximately 1000 (i.e. df15 has 1552 and df16 has 1634) to 3186 which means stata summed the two datasets to get that 3186 total observations. Upon investigation of the data editor in stata, I realized that after I appended the datasets, the second dataset, df16 started from 1553 observation after the last observation (1552) of df15 dataset ended. In view of this problem, how can I accurately append these two datasets without just doubling the observations?
Kindly find the attached, which are extract from the two stata datasets:

The first dataset df15 is as follows:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str119 Banks str25 Country double(Interest_Revenue Interest_Expense) float Year
"BNP PARIBAS"                                                        "France"      22553000 22553000 2015
"CREDIT AGRICOLE SA"                                                 "France"      11558000 11558000 2015
"BANCO SANTANDER SA"                                                 "Spain"       32812000 32812000 2015
"SOCIETE GENERALE"                                                   "France"       9306000  9306000 2015
"DEUTSCHE BANK AG"                                                   "Germany"     15881000 15881000 2015
"CREDIT MUTUEL (COMBINED - IFRS)"                                    "France"       7075000  7075000 2015
"INTESA SANPAOLO"                                                    "Italy"        9132000  9132000 2015
"ING BANK NV"                                                        "Netherlands" 12744000 12744000 2015
"BPCE SA"                                                            "France"       2841000  2841000 2015
"UNICREDIT SPA"                                                      "Italy"       10664004 10664004 2015
"LA BANQUE POSTALE"                                                  "France"       3124903  3124903 2015
"CAIXABANK, S.A."                                                    "Spain"        4352650  4352650 2015
"BANQUE FEDERATIVE DU CREDIT MUTUEL"                                 "France"       3830000  3830000 2015
"BANCO BILBAO VIZCAYA ARGENTARIA SA"                                 "Spain"       16022000 16022000 2015
"COOPERATIEVE RABOBANK U.A."                                         "Netherlands"  9139000  9139000 2015
"DZ BANK AG DEUTSCHE ZENTRAL-GENOSSENSCHAFTSBANK, FRANKFURT AM MAIN" "Germany"      2755000  2755000 2015
"CREDIT AGRICOLE CORPORATE AND INVESTMENT BANK SA"                   "France"       1898000  1898000 2015
"NATIXIS SA"                                                         "France"       2370000  2370000 2015
"COMMERZBANK AG"                                                     "Germany"      5727000  5727000 2015
"ABN AMRO BANK NV"                                                   "Netherlands"  6077000  6077000 2015
end

The second dataset df16 is also as follows:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str119 Banks str25 Country double(Interest_Revenue Interest_Expense) float Year
"BNP PARIBAS"                                                        "France"      22376000 22376000 2016
"DEUTSCHER SPARKASSEN-UND GIROVERBAND EV (COMBINED)"                 "Germany"     31166000 31166000 2016
"CREDIT AGRICOLE SA"                                                 "France"      11739000 11739000 2016
"BANCO SANTANDER SA"                                                 "Spain"       31089000 31089000 2016
"SOCIETE GENERALE"                                                   "France"       9467000  9467000 2016
"DEUTSCHE BANK AG"                                                   "Germany"     14707000 14707000 2016
"CREDIT MUTUEL (COMBINED - IFRS)"                                    "France"       6899000  6899000 2016
"INTESA SANPAOLO"                                                    "Italy"        8598000  8598000 2016
"ING BANK NV"                                                        "Netherlands" 13317000 13317000 2016
"BPCE SA"                                                            "France"       2996000  2996000 2016
"UNICREDIT SPA"                                                      "Italy"       10307011 10307011 2016
"LA BANQUE POSTALE"                                                  "France"       2827871  2827871 2016
"CAIXABANK, S.A."                                                    "Spain"        4156856  4156856 2016
"BANQUE FEDERATIVE DU CREDIT MUTUEL"                                 "France"       3981000  3981000 2016
"BANCO BILBAO VIZCAYA ARGENTARIA SA"                                 "Spain"       17060000 17060000 2016
"COOPERATIEVE RABOBANK U.A."                                         "Netherlands"  8743000  8743000 2016
"DZ BANK AG DEUTSCHE ZENTRAL-GENOSSENSCHAFTSBANK, FRANKFURT AM MAIN" "Germany"      2547000  2547000 2016
"CREDIT AGRICOLE CORPORATE AND INVESTMENT BANK SA"                   "France"       2833000  2833000 2016
"NATIXIS SA"                                                         "France"       2654000  2654000 2016
"COMMERZBANK AG"                                                     "Germany"      4164000  4164000 2016
end

The codes I used are as follows:

Code:

use df15.dta, clear
append using df16

Please how can I accurately append these two datasets without doubling the observations? Thanks in advance.

Last edited by Patrick Donkor; 12 Feb 2024, 11:07.

Tags: append, cross sectional data, merge

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

12 Feb 2024, 11:50

I don't understand what you want. It seems to me that -append- is doing exactly what it is supposed to do and is combining the data sets correctly. You refer to "doubling the observations" but in the combined data set there are no duplicate observations at all, not even when ignoring the Year variable. If you remove any of the observations from that combined data set you will lose information.

So what do you actually want the resulting combination to look like?
Comment
Patrick Donkor

Join Date: Nov 2022

Posts: 35
#3

12 Feb 2024, 12:39

Hello Clyde,
Thanks for your response, I'm new regarding the use of stata. Since both datasets have the same observations, I thought when I combine them, I will have the total observations to be equal to the number of observations in either df15 or df16 dataset, but the total number of observations I get looks like stata sum up the two datasets. For instance in the original data, df15 has 1552 and df16 has 1634, when I append them, the total observations (banks) increase to 3186, meanwhile the number of banks in the data that I'm working with is approximately 1000.
Anyways, please in your combined data, did you get the total observations to be 20 or 40?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

12 Feb 2024, 13:03

When I -append-ed those two data examples, the number of observations in the combined result was 40, as it should be. I don't understand how you could possibly expect to get anything smaller. If some of the observations in the data sets were duplicates of each other, you could, of course, drop the duplicates. But there are no duplicates at all in the example, even disregarding the year variable. Every observation in each data set is contributing new information and any reduction in the data set will make it incomplete and inaccurate.

Now, it sounds like you want to reduce to one observation per bank. But how would you do that? You have different information about those banks in the d15 and d16 data sets. If you discard either of them, your data set will be incomplete and, well, just plain wrong. You need to retain all those observations. It ain't broke, and I don't understand why (or how) you want to fix it.
Comment
Patrick Donkor

Join Date: Nov 2022

Posts: 35
#5

12 Feb 2024, 13:36

Hello Clyde,
Ok I perfectly understand your point, thanks so much for the clarification
Comment

Announcement

Appending Datasets

Comment

Comment

Comment

Comment