merging 2 rounds of WBES to create a short panel

Maira Fama

Join Date: Oct 2024

Posts: 4
#1

merging 2 rounds of WBES to create a short panel

07 Oct 2024, 12:10

Hey everyone,
I need your help on how to merge the 2 waves of the WB's Enterprise survey for The Gambia. The waves in question are the 2018 and 2023. the 2023 contains 74 firms that were part of the 2018 dataset. So i want to be able to construct a short panel where firms from 2018 will be merged with those in the 2023. However, my main problem has been that the ID and IDSTD are different in the datasets. But i was able to get the ISIC rev 3 codes (mprod) and the fact that a variable stratificationpanelcode has been defined such that all firms for the year of the wave are coded fresh and if from a previous round are coded panel. So i am able to identify the 74 firms even without matching by simply observing this variable. I have tried using a one to one merge by generating a merge key in both datasets which uses the mprod and stratificationpanelcode to construct it vis gen merge_key_2018 = mprod + Stratificationpanelcode*10000.

After doing this in both datasets. then I merge using
merge 1:1 merge_key_2018 using "C:\Users\User\Desktop\dataset_2023_with_keys. dta" , keepusing(merge_key_2023)
This keeps returning variable merge_key_2018 not found r(111);

Even though the variable is in my dataset.

Is there something that i am missing that could explain what is happening?

Last edited by Maira Fama; 07 Oct 2024, 12:30.
Tags: Clyde Schechter, data, fixed effects, panel, Suggestion
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

07 Oct 2024, 13:00

There are several possibilities here, and since you don't show the full sequence of code, nor any example data, it is impossible to know which of the possibilities actually occurred. The generic causes of this situation are in the following list:
You successfully created variable merge_key_2018 in both data sets, but you didn't -save- one or both data sets after doing so.

The creation of merge_key_2018 was unsuccessful in one or both data sets. This could happen if, for example, one of the data sets doesn't have both variables mprod and Stratificationpanelcode. (This includes the possibility that it has equivalent variables with different names, and remember that "different" includes spelling errors or capitalization differences.)

Between the time you successfully created merge_key_2018 in both data sets and the time you ran the -merge-, other code was executed that either -drop-ped or -rename-d that variable in one or both data sets.

If this information does not help you track down the source of the problem, please post back and show:
All of the code used from the first creation of merge_key_2018 through the -merge- command, including both of those and everything in between.

Example data from both original data sets. Be sure your examples includes the variables mprod and Stratificationpanelcode, as well as any other variables mentioned in the code that you are showing in response to 1 (other than merge_key_2018 itself).

And be sure to use the -dataex- command to show the examples. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
Maira Fama

Join Date: Oct 2024

Posts: 4
#3

07 Oct 2024, 14:25

These were the codes
use dataset_2018.dta, clear
gen mprod_clean = floor(mprod)
gen merge_key_2018 = mprod_clean + stratificationpanelcode_2018 * 10000

use dataset_2023.dta, clear
gen mprod_clean = floor(mprod)
gen mprod_clean = floor(mprod)
gen merge_key_2023 = mprod_clean + stratificationpanelcode * 10000
use dataset_2018.dta, clear
merge 1:1 mprod_clean stratificationpanelcode using dataset_2023.dta

and it returned: file dataset_2023.dta not found
r(601);
to fix that I redo using
merge 1:1 mprod_clean stratificationpanelcode using "C:\Users\User\Desktop\dataset_2023.dta"
This time around the command probably worked by the choice of identifiers were not excellent so it returned
variables mprod_clean stratificationpanelcode do not uniquely
identify observations in the master data
r(459);

That was the first strategy I used but it did not work. So i extended and
gen merge_key_2018 = mprod_clean + stratificationpanelcode * 10000
drop if mprod== -9
gen year = 2018
save "C:\Users\USER\Desktop\twenty18_clean.dta", replace
use "C:\Users\USER\Desktop\twenty23_clean.dta"
gen year = 2023
gen merge_key_2023 = mprod_clean + stratificationpanelcode * 10000
use "C:\Users\USER\Desktop\twenty18_clean.dta", clear
merge 1:1 merge_key_2018 using "C:\Users\User\Desktop\twenty23_clean.dta", keepusing(merge_key_2023)

variable merge_key_2018 not found
from the 2018 data

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long idstd int(id mprod) float mprod_clean byte stratificationpanelcode float(merge_key_2018 year) 626763 273 1512 1512 0 1512 2018 626717 57 1512 1512 0 1512 2018 626810 490 1514 1514 0 1514 2018 626820 533 1514 1514 0 1514 2018 626769 297 1531 1531 0 1531 2018 626794 422 1531 1531 0 1531 2018 626751 222 1531 1531 0 1531 2018 626761 262 1541 1541 0 1541 2018 626846 634 1554 1554 0 1554 2018 626825 554 1810 1810 0 1810 2018 end label values idstd IDSTD label values id ID label values mprod D1A2 label values stratificationpanelcode StratificationPanelCode label def StratificationPanelCode 0 "Fresh", modify

from the 2023 data

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long idstd int(id mprod_code3) float mprod_clean byte stratificationpanelcode float(merge_key_2023 year) 1163818 2 6309 6309 1 16309 2023 1163819 3 1512 1512 1 11512 2023 1163820 4 1512 1512 1 11512 2023 1163821 5 1549 1549 1 11549 2023 1163822 8 5220 5220 1 15220 2023 1163823 11 1554 1554 1 11554 2023 1163824 12 1810 1810 1 11810 2023 1163825 14 1810 1810 1 11810 2023 1163826 16 1810 1810 1 11810 2023 1163827 18 1810 1810 1 11810 2023 end label values idstd IDSTD label values id ID label values mprod_code3 D1A2 label values stratificationpanelcode StratificationPanelCode label def StratificationPanelCode 1 "Panel", modify
Comment
Maira Fama

Join Date: Oct 2024

Posts: 4
#4

07 Oct 2024, 14:31

Please note that the difference in the data files was sorted. and had no role in the problems. Also I repeated the commands to generate merge key. I used the second commands instead of the first.

if you want me to specify more variables with the dataex command I can. I just shared the variables I thought were good sources for merging the datasets

Last edited by Maira Fama; 07 Oct 2024, 14:46.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

07 Oct 2024, 15:23

Well, I don't have a solution to your problem, but I can point out a few things that are clearly going wrong.

The first is that in one data set you call your new id variable merge_key_2018 and in the other you call it merge_key_2023. But the -merge- command requires that the merge key variable(s) have exactly the same name in both data sets. So your -merge- command is going to fail with these variable names, as whichever name you choose, it will not be found in one of the data sets. This, of course, is easy to fix by just renaming one or both variables so that they have the same name in each data set.

But you have a much bigger substantive problem. The combination of mprod_clean and stratificationpanelcode does not form a suitable identifier in either data set. For example, in the first data set, the combination of mprod = 1512 and stratificationpanelcode is associated with idstd 626763 and id 273 in the first observation, but the same combination of mprod and stratificationpanelcode is also associated with idstd 626717 and id 57 in the second observation. Similar problems arise for mprod values 1514, and 1531 (worse, 3 different people). Similarly in the 23 data set the combination of mprod_clean = 1810 and stsratificationpanelcode = 1 is associated with four different people, and mprod_clean = 1512 and stratificationpanelcode 1 is associated with two different people. So mprod_clean and stratificationpanelcode are not sufficient information to identify individuals in these data sets. N.B. It is not a matter of finding some more complicated way to combine them than 10000*stratificationpanelcode+mprod_clean. It is that the two variables, no matter how you use them, do not distinguish individuals in the data sets.

You will need to find one or more other variables that either by themselves, or in combination with some of those you already have, distinguish individuals.
Comment
Maira Fama

Join Date: Oct 2024

Posts: 4
#6

09 Oct 2024, 15:31

Thank you. It proved impossible for me to generate a key that is suitable to combine the datasets. So I have moved to using them separately instead of generating a short panel from them.
Comment

Announcement

merging 2 rounds of WBES to create a short panel

Comment

Comment

Comment

Comment

Comment