Merging two datasets with specific dates of start and end of period within a year

Mario Ferri

Join Date: Jul 2019
Posts: 190

Merging two datasets with specific dates of start and end of period within a year

03 Feb 2020, 07:23

I am running on panel project where I have to merge and match two dataset . My first dataset is a set of years and specific dates of start and end of period within a year and the second yearly country observations. Observations are yearly. In need to match the yearly observations of my second dataset to the longest duration in days starting from January 1st of each year i my first dataset . Below I paste some parts of the datasets in order to help you understand better. You can note that in some cases more than one period occurs. Fot those periods I need to mach to the longest duration as already said from the start of each year separately. Imagine that I need to do that for fifty years for around 45 countries and you get an idea how difficult and time consuming is doing that in excel in really Any ideas?

Dataset 1

Country OECD ts dm dm1 start of period End of period

Finland	oecd member	eu member	2006			24/6/03	19/4/07	Finland
Finland	oecd member	eu member	2007			24/6/03	19/4/07	Finland
Finland	oecd member	eu member	2007	39159	200703	19/4/07	22/6/10	Finland
Finland	oecd member	eu member	2008			19/4/07	22/6/10	Finland
Finland	oecd member	eu member	2009			19/4/07	22/6/10	Finland
Finland	oecd member	eu member	2010			19/4/07	22/6/10	Finland
Finland	oecd member	eu member	2010			22/6/10	22/6/11	Finland
Finland	oecd member	eu member	2011			22/6/10	22/6/11	Finland
Finland	oecd member	eu member	2011	40650	201104	22/6/11	25/3/14	Finland
Finland	oecd member	eu member	2012			22/6/11	25/3/14	Finland
Finland	oecd member	eu member	2013			22/6/11	25/3/14	Finland
Finland	oecd member	eu member	2014			22/6/11	25/3/14	Finland
Finland	oecd member	eu member	2014			25/3/14	24/6/14	Finland
Finland	oecd member	eu member	2014			24/6/14	26/9/14	Finland
Finland	oecd member	eu member	2014		97	26/9/14	29/5/15	Finland
Finland	oecd member	eu member	2015	42113	201504	29/5/15	31/12/18	Finland
Finland	oecd member	eu member	2015			29/5/15	31/12/18	Finland
Finland	oecd member	eu member	2016			29/5/15	31/12/18	Finland

Dataset 2

Country Country symbol year year code var1 var2 var3

Finland	FIN	2006	YR2006	0,623874	2,002701	8,384
Finland	FIN	2007	YR2007	1,566664	2,006751	7,719
Finland	FIN	2008	YR2008	2,510666	2,010769	6,854
Finland	FIN	2009	YR2009	4,065954	2,017309	6,369
Finland	FIN	2010	YR2010	-9,2E-07	2	8,249
Finland	FIN	2011	YR2011	1,184135	2,005112	8,394
Finland	FIN	2012	YR2012	3,416808	2,014591	7,781
Finland	FIN	2013	YR2013	2,808336	2,012028	7,689
Finland	FIN	2014	YR2014	1,478286	2,006373	8,193
Finland	FIN	2015	YR2015	1,041196	2,004498	8,663

Tags: None

Mario Ferri

Join Date: Jul 2019

Posts: 190
#2

03 Feb 2020, 11:46

Ok some people told that can be done by creating a panel data set from my first dataset using expand and by creating a new time indicator variable. Then make a m:1 merge.
Anyone one on this And what's the command?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#3

03 Feb 2020, 15:30

The tableaus you have shown to illustrate your data do not come from Stata data sets: they can't because variable names cannot contain blank spaces.

If you have not yet imported your data into Stata, then it is premature to ask for help with code (unless you need help with code for the importation itself). If you do have Stata data sets, then, as the FAQ (which everyone is asked to read before pasting) say, this kind of data tableau is not particularly helpful to those who might try to work out some code for you. Instead, you should use the -dataex- command to post your example data.
In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.

That said, I also do not understand what you mean by

If you post back with usable examples of your data "
match the yearly observations of my second dataset to the longest duration in days starting from January 1st of each year i my first dataset
." When you post back, point out examples of some observations that do match and some observations that don't, as well as trying to explain more clearly what you need.
Comment

Mario Ferri

Join Date: Jul 2019
Posts: 190

03 Feb 2020, 17:01

Originally posted by Clyde Schechter View Post

The tableaus you have shown to illustrate your data do not come from Stata data sets: they can't because variable names cannot contain blank spaces.

If you have not yet imported your data into Stata, then it is premature to ask for help with code (unless you need help with code for the importation itself). If you do have Stata data sets, then, as the FAQ (which everyone is asked to read before pasting) say, this kind of data tableau is not particularly helpful to those who might try to work out some code for you. Instead, you should use the -dataex- command to post your example data.
In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.

That said, I also do not understand what you mean by

If you post back with usable examples of your data "

match the yearly observations of my second dataset to the longest duration in days starting from January 1st of each year i my first dataset

." When you post back, point out examples of some observations that do match and some observations that don't, as well as trying to explain more clearly what you need.

Thanks for the reply.I am new to the forums.

Below I paste some part of the data .I have to merge these two dataset.As you may noticed there are multiple dates for some years in the first dataset. I need to match the yearly observations of my second dataset to the longest duration in days starting from January 1st of each year in my first dataset .For example take the year 1990 . You will notice that in that year there are two entries of 1990. The longest duration in days in that year is the second entry 4/4/1990 to 27/12/1991 ( or in my case of interest to 31/12/1990, end of year). I would like to merge the observation of my second dataset for 1990 to that date for 1990 in my first dataset(the longest duration)

Consider that this is panel for fifty years for around 45 countries and you get an idea .

Some people suggested me to create a panel date set from my first dataset using expand and by creating a new time indicator variable. Than make a m:1 merge. but did not get how

I hope now is clear and appreciate any help you can provide.

Daset1

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 ccode int ts long(edate date start end) str9 country int startyear byte(startmonth startday tap1 tap2) long(pr gr)
"Australia" 1990     .      . 3 6 "Australia" 1987  7 22 1 1 3 3
"Australia" 1990     .      . 6 5 "Australia" 1990  4  4 1 1 2 2
"Australia" 1991     .      . 6 5 "Australia" 1990  4  4 1 1 2 2
"Australia" 1991     .      . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1992     .      . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993     .      . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993 34041 199303 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1994     .      . 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1995     .      . 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1996 35126 199603 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1996     .      . 1 2 "Australia" 1996  3 11 3 3 4 4
"Australia" 1997     .      . 1 2 "Australia" 1996  3 11 3 3 4 4
"Australia" 1998 36071 199810 1 2 "Australia" 1996  3 11 3 3 4 4
"Australia" 1998     .      . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 1999     .      . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 2000     .      . 2 4 "Australia" 1998 10 21 2 2 5 5
end
label values start start
label def start 1 "11/3/96", modify
label def start 2 "21/10/98", modify
label def start 3 "22/7/87", modify
label def start 4 "24/3/93", modify
label def start 5 "27/12/91", modify
label def start 6 "4/4/90", modify
label values end end
label def end 1 "11/3/96", modify
label def end 2 "21/10/98", modify
label def end 3 "24/3/93", modify
label def end 4 "26/11/01", modify
label def end 5 "27/12/91", modify
label def end 6 "4/4/90", modify
label values pr pr
label def pr 1 "-0,165", modify
label def pr 2 "-14,9", modify
label def pr 3 "-4,5", modify
label def pr 4 "22,593", modify
label def pr 5 "48,458", modify
label values gr gr
label def gr 1 "-0,165", modify
label def gr 2 "-14,9", modify
label def gr 3 "-4,5", modify
label def gr 4 "22,593", modify
label def gr 5 "48,458", modify

Dataset 2

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 Country str3 Symbol int year str6 YearSymbol long(rate1 rate2)
"Australia" "AUS" 1990 "YR1990" 11  5
"Australia" "AUS" 1991 "YR1991"  8 10
"Australia" "AUS" 1992 "YR1992"  3  1
"Australia" "AUS" 1993 "YR1993"  5  2
"Australia" "AUS" 1994 "YR1994"  6 11
"Australia" "AUS" 1995 "YR1995" 10  8
"Australia" "AUS" 1996 "YR1996"  7  9
"Australia" "AUS" 1997 "YR1997"  1  7
"Australia" "AUS" 1998 "YR1998"  2  6
"Australia" "AUS" 1999 "YR1999"  4  4
"Australia" "AUS" 2000 "YR2000"  9  3
end
label values rate1 rate1
label def rate1 1 "0,224887556", modify
label def rate1 2 "0,86013463", modify
label def rate1 3 "1,012231126", modify
label def rate1 4 "1,483129403", modify
label def rate1 5 "1,753653445", modify
label def rate1 6 "1,969634797", modify
label def rate1 7 "2,615384615", modify
label def rate1 8 "3,17667537", modify
label def rate1 9 "4,457435148", modify
label def rate1 10 "4,6277666", modify
label def rate1 11 "7,333021952", modify
label values rate2 rate2
label def rate2 1 "10,72879982", modify
label def rate2 2 "10,87380028", modify
label def rate2 3 "6,282599926", modify
label def rate2 4 "6,872300148", modify
label def rate2 5 "6,926000118", modify
label def rate2 6 "7,675600052", modify
label def rate2 7 "8,362199783", modify
label def rate2 8 "8,46930027", modify
label def rate2 9 "8,506199837", modify
label def rate2 10 "9,579199791", modify
label def rate2 11 "9,718700409", modify

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29948

03 Feb 2020, 17:35

Alright, this is somewhat much, but we still have a way to go.

Before we can even approach the matching you want, the date variables in the first data set and the rate variables in the second are useless as they are. What you have there are value-labeled integers masqsuerading as dates and rates. This is a sure recipe for disaster when you try to do any calculations with them. So before we can even think about how these datasets might match up, we have to change the dates to real date variables, and the rates to real numbers.

Here is the code to do that much:

Code:

// FOR DATASET 1
clear*
* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 ccode int ts long(edate date start end) str9 country int startyear byte(startmonth startday tap1 tap2) long(pr gr)
"Australia" 1990     .      . 3 6 "Australia" 1987  7 22 1 1 3 3
"Australia" 1990     .      . 6 5 "Australia" 1990  4  4 1 1 2 2
"Australia" 1991     .      . 6 5 "Australia" 1990  4  4 1 1 2 2
"Australia" 1991     .      . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1992     .      . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993     .      . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993 34041 199303 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1994     .      . 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1995     .      . 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1996 35126 199603 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1996     .      . 1 2 "Australia" 1996  3 11 3 3 4 4
"Australia" 1997     .      . 1 2 "Australia" 1996  3 11 3 3 4 4
"Australia" 1998 36071 199810 1 2 "Australia" 1996  3 11 3 3 4 4
"Australia" 1998     .      . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 1999     .      . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 2000     .      . 2 4 "Australia" 1998 10 21 2 2 5 5
end
label values start start
label def start 1 "11/3/96", modify
label def start 2 "21/10/98", modify
label def start 3 "22/7/87", modify
label def start 4 "24/3/93", modify
label def start 5 "27/12/91", modify
label def start 6 "4/4/90", modify
label values end end
label def end 1 "11/3/96", modify
label def end 2 "21/10/98", modify
label def end 3 "24/3/93", modify
label def end 4 "26/11/01", modify
label def end 5 "27/12/91", modify
label def end 6 "4/4/90", modify
label values pr pr
label def pr 1 "-0,165", modify
label def pr 2 "-14,9", modify
label def pr 3 "-4,5", modify
label def pr 4 "22,593", modify
label def pr 5 "48,458", modify
label values gr gr
label def gr 1 "-0,165", modify
label def gr 2 "-14,9", modify
label def gr 3 "-4,5", modify
label def gr 4 "22,593", modify
label def gr 5 "48,458", modify
frame rename default dataset1

//  CREATE USABLE DATE VARIABLES
drop start
gen start = mdy(startmonth, startday, startyear)
assert missing(start) == missing(startmonth, startday, startyear)
format start %td
decode end, gen(_end)
drop end
gen end = daily(_end, "DMY", 2020)
assert missing(end) == missing(_end)
format end %td
drop _end startmonth startday startyear // NO LONGER NEEDED
gen duration = end - start

Code:

//  FOR DATASET 2

* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 Country str3 Symbol int year str6 YearSymbol long(rate1 rate2)
"Australia" "AUS" 1990 "YR1990" 11  5
"Australia" "AUS" 1991 "YR1991"  8 10
"Australia" "AUS" 1992 "YR1992"  3  1
"Australia" "AUS" 1993 "YR1993"  5  2
"Australia" "AUS" 1994 "YR1994"  6 11
"Australia" "AUS" 1995 "YR1995" 10  8
"Australia" "AUS" 1996 "YR1996"  7  9
"Australia" "AUS" 1997 "YR1997"  1  7
"Australia" "AUS" 1998 "YR1998"  2  6
"Australia" "AUS" 1999 "YR1999"  4  4
"Australia" "AUS" 2000 "YR2000"  9  3
end
label values rate1 rate1
label def rate1 1 "0,224887556", modify
label def rate1 2 "0,86013463", modify
label def rate1 3 "1,012231126", modify
label def rate1 4 "1,483129403", modify
label def rate1 5 "1,753653445", modify
label def rate1 6 "1,969634797", modify
label def rate1 7 "2,615384615", modify
label def rate1 8 "3,17667537", modify
label def rate1 9 "4,457435148", modify
label def rate1 10 "4,6277666", modify
label def rate1 11 "7,333021952", modify
label values rate2 rate2
label def rate2 1 "10,72879982", modify
label def rate2 2 "10,87380028", modify
label def rate2 3 "6,282599926", modify
label def rate2 4 "6,872300148", modify
label def rate2 5 "6,926000118", modify
label def rate2 6 "7,675600052", modify
label def rate2 7 "8,362199783", modify
label def rate2 8 "8,46930027", modify
label def rate2 9 "8,506199837", modify
label def rate2 10 "9,579199791", modify
label def rate2 11 "9,718700409", modify

//  CONVERT THE RATE VARIABLES TO ACTUAL NUMERIC RATES, NOT VALUE LABELED INTEGERS
foreach v of varlist rate1 rate2 {
    decode `v', gen(_`v')
    drop `v'
    rename _`v' `v'
    destring `v', dpcomma replace
}

But from that point on, I'm still confused about what you want. You say:

I need to match the yearly observations of my second dataset to the longest duration in days starting from January 1st of each year in my first dataset .For example take the year 1990 . You will notice that in that year there are two entries of 1990. The longest duration in days in that year is the second entry 4/4/1990 to 27/12/1991 ( or in my case of interest to 31/12/1990, end of year). I would like to merge the observation of my second dataset for 1990 to that date for 1990 in my first dataset(the longest duration)

But the data are different from what you describe. There are indeed two observations for year 1990 (I assume here by year you are referring to the variable called ts, not the one called startyear. If you mean startyear, it's still not what you say it is.) But eh duration of the first observation is 987 days, whereas the second is only 632, so it should be the first observation you want to merge with. You also refer to your "interest in 31/12/1990" but that date doesn't appear anywhere in the data at all. What does 31/12/1990 have to do with your desired merge process? You also refer to starting from January 1, but, again, I don't understand where that comes in at all.

I also have another question that you don't address: does the country (or ccode) play any role in the merge? Do you want to match up only observations referring to the same country? Or can an observation from Australia be acceptably matched with an observation from Tanzania provided the dates meet your criterion?

As an aside, you also have variables called edate and date. When I apply date formats to see what they are, they are all in the very remote future. So I think there is something wrong with those values.

Comment

Mario Ferri

Join Date: Jul 2019
Posts: 190

03 Feb 2020, 21:37

Originally posted by Clyde Schechter View Post

Code:

// FOR DATASET 1
clear*
* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 ccode int ts long(edate date start end) str9 country int startyear byte(startmonth startday tap1 tap2) long(pr gr)
"Australia" 1990 . . 3 6 "Australia" 1987 7 22 1 1 3 3
"Australia" 1990 . . 6 5 "Australia" 1990 4 4 1 1 2 2
"Australia" 1991 . . 6 5 "Australia" 1990 4 4 1 1 2 2
"Australia" 1991 . . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1992 . . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993 . . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993 34041 199303 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1994 . . 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1995 . . 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1996 35126 199603 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1996 . . 1 2 "Australia" 1996 3 11 3 3 4 4
"Australia" 1997 . . 1 2 "Australia" 1996 3 11 3 3 4 4
"Australia" 1998 36071 199810 1 2 "Australia" 1996 3 11 3 3 4 4
"Australia" 1998 . . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 1999 . . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 2000 . . 2 4 "Australia" 1998 10 21 2 2 5 5
end
label values start start
label def start 1 "11/3/96", modify
label def start 2 "21/10/98", modify
label def start 3 "22/7/87", modify
label def start 4 "24/3/93", modify
label def start 5 "27/12/91", modify
label def start 6 "4/4/90", modify
label values end end
label def end 1 "11/3/96", modify
label def end 2 "21/10/98", modify
label def end 3 "24/3/93", modify
label def end 4 "26/11/01", modify
label def end 5 "27/12/91", modify
label def end 6 "4/4/90", modify
label values pr pr
label def pr 1 "-0,165", modify
label def pr 2 "-14,9", modify
label def pr 3 "-4,5", modify
label def pr 4 "22,593", modify
label def pr 5 "48,458", modify
label values gr gr
label def gr 1 "-0,165", modify
label def gr 2 "-14,9", modify
label def gr 3 "-4,5", modify
label def gr 4 "22,593", modify
label def gr 5 "48,458", modify
frame rename default dataset1

// CREATE USABLE DATE VARIABLES
drop start
gen start = mdy(startmonth, startday, startyear)
assert missing(start) == missing(startmonth, startday, startyear)
format start %td
decode end, gen(_end)
drop end
gen end = daily(_end, "DMY", 2020)
assert missing(end) == missing(_end)
format end %td
drop _end startmonth startday startyear // NO LONGER NEEDED
gen duration = end - start

Code:

// FOR DATASET 2

* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 Country str3 Symbol int year str6 YearSymbol long(rate1 rate2)
"Australia" "AUS" 1990 "YR1990" 11 5
"Australia" "AUS" 1991 "YR1991" 8 10
"Australia" "AUS" 1992 "YR1992" 3 1
"Australia" "AUS" 1993 "YR1993" 5 2
"Australia" "AUS" 1994 "YR1994" 6 11
"Australia" "AUS" 1995 "YR1995" 10 8
"Australia" "AUS" 1996 "YR1996" 7 9
"Australia" "AUS" 1997 "YR1997" 1 7
"Australia" "AUS" 1998 "YR1998" 2 6
"Australia" "AUS" 1999 "YR1999" 4 4
"Australia" "AUS" 2000 "YR2000" 9 3
end
label values rate1 rate1
label def rate1 1 "0,224887556", modify
label def rate1 2 "0,86013463", modify
label def rate1 3 "1,012231126", modify
label def rate1 4 "1,483129403", modify
label def rate1 5 "1,753653445", modify
label def rate1 6 "1,969634797", modify
label def rate1 7 "2,615384615", modify
label def rate1 8 "3,17667537", modify
label def rate1 9 "4,457435148", modify
label def rate1 10 "4,6277666", modify
label def rate1 11 "7,333021952", modify
label values rate2 rate2
label def rate2 1 "10,72879982", modify
label def rate2 2 "10,87380028", modify
label def rate2 3 "6,282599926", modify
label def rate2 4 "6,872300148", modify
label def rate2 5 "6,926000118", modify
label def rate2 6 "7,675600052", modify
label def rate2 7 "8,362199783", modify
label def rate2 8 "8,46930027", modify
label def rate2 9 "8,506199837", modify
label def rate2 10 "9,579199791", modify
label def rate2 11 "9,718700409", modify

// CONVERT THE RATE VARIABLES TO ACTUAL NUMERIC RATES, NOT VALUE LABELED INTEGERS
foreach v of varlist rate1 rate2 {
decode `v', gen(_`v')
drop `v'
rename _`v' `v'
destring `v', dpcomma replace
}

But from that point on, I'm still confused about what you want. You say:

But the data are different from what you describe. There are indeed two observations for year 1990 (I assume here by year you are referring to the variable called ts, not the one called startyear. If you mean startyear, it's still not what you say it is.) But eh duration of the first observation is 987 days, whereas the second is only 632, so it should be the first observation you want to merge with. You also refer to your "interest in 31/12/1990" but that date doesn't appear anywhere in the data at all. What does 31/12/1990 have to do with your desired merge process? You also refer to starting from January 1, but, again, I don't understand where that comes in at all.

I also have another question that you don't address: does the country (or ccode) play any role in the merge? Do you want to match up only observations referring to the same country? Or can an observation from Australia be acceptably matched with an observation from Tanzania provided the dates meet your criterion?

As an aside, you also have variables called edate and date. When I apply date formats to see what they are, they are all in the very remote future. So I think there is something wrong with those values.[/FONT][/COLOR][/LEFT]

Thanks for the input . By ts and Year I refer to time (year). The references are from their original sources where they were downloaded from.

The dataset 1 refers to the start day and end day of the actual dates of governments in office while dataset2 refers to some sort of indexes(I provide only 2 but there are around 16 more of them).Consider that these data are only a small part of a panel for multiple years for multiple countries. I consider each year and country separately according to the actual duration in days of each incumbent in office for that particular year (365 or 366 days in a year depending on the leap year). I used in the explanation above the January to December 31 in order to make you better understand. They do not appear directly in the data. .I hope you get the point.

I want to match up only observations referring to the same country for that year (1990 to 1990 etc) without mixing other counties (Tanzania in your example) to the longest duration of a government in office in that year. In some years more than one governments occurred .There are indeed two observations for year 1990 or for the years with more than one observation since there was a change in government and was coded as such. And that's a key issue The duration in office in previous years is unrelated.

The ccode is crucial as I want to match up only observations referring to the same country.

date stands for election date (YYYYMM) and edate for election date. It should be edate 13mar1993 and date 199303 for example for election taken place in on 13mar1993 . I guess something went wrong in excel. (Probably they were transformed to excel calendar form). I will probably need to create a dummy later on for the analysis based on those dates.

I hope now is clear .Any assistance is appreciated

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29948

03 Feb 2020, 22:36

OK, now I think I understand it. If I have it right, the following will do it:

Code:

// FOR DATASET 1
clear*
* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 ccode int ts long(edate date start end) str9 country int startyear byte(startmonth startday tap1 tap2) long(pr gr)
"Australia" 1990     .      . 3 6 "Australia" 1987  7 22 1 1 3 3
"Australia" 1990     .      . 6 5 "Australia" 1990  4  4 1 1 2 2
"Australia" 1991     .      . 6 5 "Australia" 1990  4  4 1 1 2 2
"Australia" 1991     .      . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1992     .      . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993     .      . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993 34041 199303 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1994     .      . 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1995     .      . 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1996 35126 199603 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1996     .      . 1 2 "Australia" 1996  3 11 3 3 4 4
"Australia" 1997     .      . 1 2 "Australia" 1996  3 11 3 3 4 4
"Australia" 1998 36071 199810 1 2 "Australia" 1996  3 11 3 3 4 4
"Australia" 1998     .      . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 1999     .      . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 2000     .      . 2 4 "Australia" 1998 10 21 2 2 5 5
end
label values start start
label def start 1 "11/3/96", modify
label def start 2 "21/10/98", modify
label def start 3 "22/7/87", modify
label def start 4 "24/3/93", modify
label def start 5 "27/12/91", modify
label def start 6 "4/4/90", modify
label values end end
label def end 1 "11/3/96", modify
label def end 2 "21/10/98", modify
label def end 3 "24/3/93", modify
label def end 4 "26/11/01", modify
label def end 5 "27/12/91", modify
label def end 6 "4/4/90", modify
label values pr pr
label def pr 1 "-0,165", modify
label def pr 2 "-14,9", modify
label def pr 3 "-4,5", modify
label def pr 4 "22,593", modify
label def pr 5 "48,458", modify
label values gr gr
label def gr 1 "-0,165", modify
label def gr 2 "-14,9", modify
label def gr 3 "-4,5", modify
label def gr 4 "22,593", modify
label def gr 5 "48,458", modify
frame rename default dataset1

//  CREATE USABLE DATE VARIABLES
drop start
gen start = mdy(startmonth, startday, startyear)
assert missing(start) == missing(startmonth, startday, startyear)
format start %td
decode end, gen(_end)
drop end
gen end = daily(_end, "DMY", 2020)
assert missing(end) == missing(_end)
format end %td
drop _end startmonth startday startyear // NO LONGER NEEDED
gen in_year_duration = min(end, mdy(12, 31, ts)) - max(start, mdy(1, 1, ts))
assert !missing(in_year_duration)

//  KEEP THE OBSERVATION WITH LONGEST IN YEAR DURATION FOR EACH
//  COUNTRY AND YEAR
by country ts (in_year_duration), sort: keep if _n == _N

frame create dataset2
frame change dataset2
* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 Country str3 Symbol int year str6 YearSymbol long(rate1 rate2)
"Australia" "AUS" 1990 "YR1990" 11  5
"Australia" "AUS" 1991 "YR1991"  8 10
"Australia" "AUS" 1992 "YR1992"  3  1
"Australia" "AUS" 1993 "YR1993"  5  2
"Australia" "AUS" 1994 "YR1994"  6 11
"Australia" "AUS" 1995 "YR1995" 10  8
"Australia" "AUS" 1996 "YR1996"  7  9
"Australia" "AUS" 1997 "YR1997"  1  7
"Australia" "AUS" 1998 "YR1998"  2  6
"Australia" "AUS" 1999 "YR1999"  4  4
"Australia" "AUS" 2000 "YR2000"  9  3
end
label values rate1 rate1
label def rate1 1 "0,224887556", modify
label def rate1 2 "0,86013463", modify
label def rate1 3 "1,012231126", modify
label def rate1 4 "1,483129403", modify
label def rate1 5 "1,753653445", modify
label def rate1 6 "1,969634797", modify
label def rate1 7 "2,615384615", modify
label def rate1 8 "3,17667537", modify
label def rate1 9 "4,457435148", modify
label def rate1 10 "4,6277666", modify
label def rate1 11 "7,333021952", modify
label values rate2 rate2
label def rate2 1 "10,72879982", modify
label def rate2 2 "10,87380028", modify
label def rate2 3 "6,282599926", modify
label def rate2 4 "6,872300148", modify
label def rate2 5 "6,926000118", modify
label def rate2 6 "7,675600052", modify
label def rate2 7 "8,362199783", modify
label def rate2 8 "8,46930027", modify
label def rate2 9 "8,506199837", modify
label def rate2 10 "9,579199791", modify
label def rate2 11 "9,718700409", modify

//  CONVERT THE RATE VARIABLES TO ACTUAL NUMERIC RATES, NOT VALUE LABELED INTEGERS
foreach v of varlist rate1 rate2 {
    decode `v', gen(_`v')
    drop `v'
    rename _`v' `v'
    destring `v', dpcomma replace
}
isid Country year

//    LINK THE TWO DATA SETS AND BRING THE DATASET 1 VARIABLES OVER
frlink 1:1 Country year, frame(dataset1 country ts)
frget _all, from(dataset1)
drop dataset1

At the end of this code, the data in frame dataset2 is the merged result you asked for.

There is still one point of unclarity in your request: you do not say what you want to do if there are two (or more) governments that are tied for longest in year duration. You did not indicate which one among those should be selected to receive the match. The code above makes the selection at random (and that means that if this situation occurs in your data, it may be handled differently if you re-run the code). Perhaps this situation never arises. If it does, and you have some rule for breaking the tie, post back and that can be built into the code.

Turning to the variables date and edate, I see what date is now. This tyhpe of dat3e coding is not useful in Stata. You can sort things into chronological order with it, but you cannot compare it to real Stata date variables, nor do any calculations with it. The easiest way to make it a real Stata date variable would be:

Code:

gen mdate = ym(int(date/100), mod(date, 100))
assert missing(mdate) == missing(date)
format mdate %tm
order mdate, before(date)
drop date
rename mdate date

I don't know what to say about edate. I have no sense of what those numbers are. If I treat them as Stata internal format dates, they are in the year 2053, which can't be right. There is no obvious way to interpret the numbers 34041 and 36071 as human-readable dates. So I'll give up on that, but if you need to do anything with this variable, you need to pursue it. You won't get far at all with edate as it stands.

Comment

Mario Ferri

Join Date: Jul 2019
Posts: 190

04 Feb 2020, 10:58

Originally posted by Clyde Schechter View Post

OK, now I think I understand it. If I have it right, the following will do it:

Code:

// FOR DATASET 1
clear*
* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 ccode int ts long(edate date start end) str9 country int startyear byte(startmonth startday tap1 tap2) long(pr gr)
"Australia" 1990 . . 3 6 "Australia" 1987 7 22 1 1 3 3
"Australia" 1990 . . 6 5 "Australia" 1990 4 4 1 1 2 2
"Australia" 1991 . . 6 5 "Australia" 1990 4 4 1 1 2 2
"Australia" 1991 . . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1992 . . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993 . . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993 34041 199303 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1994 . . 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1995 . . 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1996 35126 199603 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1996 . . 1 2 "Australia" 1996 3 11 3 3 4 4
"Australia" 1997 . . 1 2 "Australia" 1996 3 11 3 3 4 4
"Australia" 1998 36071 199810 1 2 "Australia" 1996 3 11 3 3 4 4
"Australia" 1998 . . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 1999 . . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 2000 . . 2 4 "Australia" 1998 10 21 2 2 5 5
end
label values start start
label def start 1 "11/3/96", modify
label def start 2 "21/10/98", modify
label def start 3 "22/7/87", modify
label def start 4 "24/3/93", modify
label def start 5 "27/12/91", modify
label def start 6 "4/4/90", modify
label values end end
label def end 1 "11/3/96", modify
label def end 2 "21/10/98", modify
label def end 3 "24/3/93", modify
label def end 4 "26/11/01", modify
label def end 5 "27/12/91", modify
label def end 6 "4/4/90", modify
label values pr pr
label def pr 1 "-0,165", modify
label def pr 2 "-14,9", modify
label def pr 3 "-4,5", modify
label def pr 4 "22,593", modify
label def pr 5 "48,458", modify
label values gr gr
label def gr 1 "-0,165", modify
label def gr 2 "-14,9", modify
label def gr 3 "-4,5", modify
label def gr 4 "22,593", modify
label def gr 5 "48,458", modify
frame rename default dataset1

// CREATE USABLE DATE VARIABLES
drop start
gen start = mdy(startmonth, startday, startyear)
assert missing(start) == missing(startmonth, startday, startyear)
format start %td
decode end, gen(_end)
drop end
gen end = daily(_end, "DMY", 2020)
assert missing(end) == missing(_end)
format end %td
drop _end startmonth startday startyear // NO LONGER NEEDED
gen in_year_duration = min(end, mdy(12, 31, ts)) - max(start, mdy(1, 1, ts))
assert !missing(in_year_duration)

// KEEP THE OBSERVATION WITH LONGEST IN YEAR DURATION FOR EACH
// COUNTRY AND YEAR
by country ts (in_year_duration), sort: keep if _n == _N

frame create dataset2
frame change dataset2
* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 Country str3 Symbol int year str6 YearSymbol long(rate1 rate2)
"Australia" "AUS" 1990 "YR1990" 11 5
"Australia" "AUS" 1991 "YR1991" 8 10
"Australia" "AUS" 1992 "YR1992" 3 1
"Australia" "AUS" 1993 "YR1993" 5 2
"Australia" "AUS" 1994 "YR1994" 6 11
"Australia" "AUS" 1995 "YR1995" 10 8
"Australia" "AUS" 1996 "YR1996" 7 9
"Australia" "AUS" 1997 "YR1997" 1 7
"Australia" "AUS" 1998 "YR1998" 2 6
"Australia" "AUS" 1999 "YR1999" 4 4
"Australia" "AUS" 2000 "YR2000" 9 3
end
label values rate1 rate1
label def rate1 1 "0,224887556", modify
label def rate1 2 "0,86013463", modify
label def rate1 3 "1,012231126", modify
label def rate1 4 "1,483129403", modify
label def rate1 5 "1,753653445", modify
label def rate1 6 "1,969634797", modify
label def rate1 7 "2,615384615", modify
label def rate1 8 "3,17667537", modify
label def rate1 9 "4,457435148", modify
label def rate1 10 "4,6277666", modify
label def rate1 11 "7,333021952", modify
label values rate2 rate2
label def rate2 1 "10,72879982", modify
label def rate2 2 "10,87380028", modify
label def rate2 3 "6,282599926", modify
label def rate2 4 "6,872300148", modify
label def rate2 5 "6,926000118", modify
label def rate2 6 "7,675600052", modify
label def rate2 7 "8,362199783", modify
label def rate2 8 "8,46930027", modify
label def rate2 9 "8,506199837", modify
label def rate2 10 "9,579199791", modify
label def rate2 11 "9,718700409", modify

// CONVERT THE RATE VARIABLES TO ACTUAL NUMERIC RATES, NOT VALUE LABELED INTEGERS
foreach v of varlist rate1 rate2 {
decode `v', gen(_`v')
drop `v'
rename _`v' `v'
destring `v', dpcomma replace
}
isid Country year

// LINK THE TWO DATA SETS AND BRING THE DATASET 1 VARIABLES OVER
frlink 1:1 Country year, frame(dataset1 country ts)
frget _all, from(dataset1)
drop dataset1

Code:

gen mdate = ym(int(date/100), mod(date, 100))
assert missing(mdate) == missing(date)
format mdate %tm
order mdate, before(date)
drop date
rename mdate date

Thank you a wholeheartedly for the code and work! I really appreciate it and mostly your time spent on my request.

To start with we can drop edate entirely. No need to go further with it.

The government that should receive the match is the government with the longest duration.The are case in my data that there have been multiple governments within a year Fortunately, the situation where there are two (or more) governments that are tied for longest in year duration never arises(same number of days)

I have run the code and reached up to the point of the usage of the frame feature. This is a feature for Stata 16 and above, so a google search suggests. I am running Stata 15.1 and frame it's not available for this version . I am not able to see the finale results after the merging if that is right or not. So without asking much, is there another way to bypass the frame feature in order to use the code for lower version of stata?.

Also, probably it's just me and my abilities on stata but, I did not see the use dataset2 path line in the your code and where that should enter in the code.

Many thanks

Mario

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29948

04 Feb 2020, 11:14

Yes, the Forum FAQ, which everyone is expected to read before posting, advises people that if they are not using the current version of Stata (which, presently, is 16) they should state what version they are using. Here's a rewrite of the code that does not require frames, and uses tempfiles instead.

Code:

// FOR DATASET 1
* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 ccode int ts long(edate date start end) str9 country int startyear byte(startmonth startday tap1 tap2) long(pr gr)
"Australia" 1990     .      . 3 6 "Australia" 1987  7 22 1 1 3 3
"Australia" 1990     .      . 6 5 "Australia" 1990  4  4 1 1 2 2
"Australia" 1991     .      . 6 5 "Australia" 1990  4  4 1 1 2 2
"Australia" 1991     .      . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1992     .      . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993     .      . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993 34041 199303 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1994     .      . 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1995     .      . 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1996 35126 199603 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1996     .      . 1 2 "Australia" 1996  3 11 3 3 4 4
"Australia" 1997     .      . 1 2 "Australia" 1996  3 11 3 3 4 4
"Australia" 1998 36071 199810 1 2 "Australia" 1996  3 11 3 3 4 4
"Australia" 1998     .      . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 1999     .      . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 2000     .      . 2 4 "Australia" 1998 10 21 2 2 5 5
end
label values start start
label def start 1 "11/3/96", modify
label def start 2 "21/10/98", modify
label def start 3 "22/7/87", modify
label def start 4 "24/3/93", modify
label def start 5 "27/12/91", modify
label def start 6 "4/4/90", modify
label values end end
label def end 1 "11/3/96", modify
label def end 2 "21/10/98", modify
label def end 3 "24/3/93", modify
label def end 4 "26/11/01", modify
label def end 5 "27/12/91", modify
label def end 6 "4/4/90", modify
label values pr pr
label def pr 1 "-0,165", modify
label def pr 2 "-14,9", modify
label def pr 3 "-4,5", modify
label def pr 4 "22,593", modify
label def pr 5 "48,458", modify
label values gr gr
label def gr 1 "-0,165", modify
label def gr 2 "-14,9", modify
label def gr 3 "-4,5", modify
label def gr 4 "22,593", modify
label def gr 5 "48,458", modify

//  CREATE USABLE DATE VARIABLES
drop start
gen start = mdy(startmonth, startday, startyear)
assert missing(start) == missing(startmonth, startday, startyear)
format start %td
decode end, gen(_end)
drop end
gen end = daily(_end, "DMY", 2020)
assert missing(end) == missing(_end)
format end %td
drop _end startmonth startday startyear // NO LONGER NEEDED
gen in_year_duration = min(end, mdy(12, 31, ts)) - max(start, mdy(1, 1, ts))
assert !missing(in_year_duration)

//  KEEP THE OBSERVATION WITH LONGEST IN YEAR DURATION FOR EACH
//  COUNTRY AND YEAR
by country ts (in_year_duration), sort: keep if _n == _N
tempfile dataset1
save `dataset1'

* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 Country str3 Symbol int year str6 YearSymbol long(rate1 rate2)
"Australia" "AUS" 1990 "YR1990" 11  5
"Australia" "AUS" 1991 "YR1991"  8 10
"Australia" "AUS" 1992 "YR1992"  3  1
"Australia" "AUS" 1993 "YR1993"  5  2
"Australia" "AUS" 1994 "YR1994"  6 11
"Australia" "AUS" 1995 "YR1995" 10  8
"Australia" "AUS" 1996 "YR1996"  7  9
"Australia" "AUS" 1997 "YR1997"  1  7
"Australia" "AUS" 1998 "YR1998"  2  6
"Australia" "AUS" 1999 "YR1999"  4  4
"Australia" "AUS" 2000 "YR2000"  9  3
end
label values rate1 rate1
label def rate1 1 "0,224887556", modify
label def rate1 2 "0,86013463", modify
label def rate1 3 "1,012231126", modify
label def rate1 4 "1,483129403", modify
label def rate1 5 "1,753653445", modify
label def rate1 6 "1,969634797", modify
label def rate1 7 "2,615384615", modify
label def rate1 8 "3,17667537", modify
label def rate1 9 "4,457435148", modify
label def rate1 10 "4,6277666", modify
label def rate1 11 "7,333021952", modify
label values rate2 rate2
label def rate2 1 "10,72879982", modify
label def rate2 2 "10,87380028", modify
label def rate2 3 "6,282599926", modify
label def rate2 4 "6,872300148", modify
label def rate2 5 "6,926000118", modify
label def rate2 6 "7,675600052", modify
label def rate2 7 "8,362199783", modify
label def rate2 8 "8,46930027", modify
label def rate2 9 "8,506199837", modify
label def rate2 10 "9,579199791", modify
label def rate2 11 "9,718700409", modify

//  CONVERT THE RATE VARIABLES TO ACTUAL NUMERIC RATES, NOT VALUE LABELED INTEGERS
foreach v of varlist rate1 rate2 {
    decode `v', gen(_`v')
    drop `v'
    rename _`v' `v'
    destring `v', dpcomma replace
}
isid Country year
rename Country country
rename year ts
tempfile dataset2
save `dataset2'

//    LINK THE TWO DATA SETS AND BRING THE DATASET 1 VARIABLES OVER
use `dataset1', clear
merge 1:1 country ts using `dataset2'

This produces the same results as the original code (although the order of the variables in the data set comes out different.)

To be clear: where I have put the -dataex- output, you would simply -use- the corresponding real data set you have. You will still need to make those transformations of the dates, etc., and you will then need to save the results in new tempfiles `datset1' and `dataset2' as shown in this code.

Comment

Mario Ferri

Join Date: Jul 2019
Posts: 190

#10

04 Feb 2020, 12:33

Originally posted by Clyde Schechter View Post

Code:

// FOR DATASET 1
* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 ccode int ts long(edate date start end) str9 country int startyear byte(startmonth startday tap1 tap2) long(pr gr)
"Australia" 1990 . . 3 6 "Australia" 1987 7 22 1 1 3 3
"Australia" 1990 . . 6 5 "Australia" 1990 4 4 1 1 2 2
"Australia" 1991 . . 6 5 "Australia" 1990 4 4 1 1 2 2
"Australia" 1991 . . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1992 . . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993 . . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993 34041 199303 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1994 . . 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1995 . . 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1996 35126 199603 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1996 . . 1 2 "Australia" 1996 3 11 3 3 4 4
"Australia" 1997 . . 1 2 "Australia" 1996 3 11 3 3 4 4
"Australia" 1998 36071 199810 1 2 "Australia" 1996 3 11 3 3 4 4
"Australia" 1998 . . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 1999 . . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 2000 . . 2 4 "Australia" 1998 10 21 2 2 5 5
end
label values start start
label def start 1 "11/3/96", modify
label def start 2 "21/10/98", modify
label def start 3 "22/7/87", modify
label def start 4 "24/3/93", modify
label def start 5 "27/12/91", modify
label def start 6 "4/4/90", modify
label values end end
label def end 1 "11/3/96", modify
label def end 2 "21/10/98", modify
label def end 3 "24/3/93", modify
label def end 4 "26/11/01", modify
label def end 5 "27/12/91", modify
label def end 6 "4/4/90", modify
label values pr pr
label def pr 1 "-0,165", modify
label def pr 2 "-14,9", modify
label def pr 3 "-4,5", modify
label def pr 4 "22,593", modify
label def pr 5 "48,458", modify
label values gr gr
label def gr 1 "-0,165", modify
label def gr 2 "-14,9", modify
label def gr 3 "-4,5", modify
label def gr 4 "22,593", modify
label def gr 5 "48,458", modify

// CREATE USABLE DATE VARIABLES
drop start
gen start = mdy(startmonth, startday, startyear)
assert missing(start) == missing(startmonth, startday, startyear)
format start %td
decode end, gen(_end)
drop end
gen end = daily(_end, "DMY", 2020)
assert missing(end) == missing(_end)
format end %td
drop _end startmonth startday startyear // NO LONGER NEEDED
gen in_year_duration = min(end, mdy(12, 31, ts)) - max(start, mdy(1, 1, ts))
assert !missing(in_year_duration)

// KEEP THE OBSERVATION WITH LONGEST IN YEAR DURATION FOR EACH
// COUNTRY AND YEAR
by country ts (in_year_duration), sort: keep if _n == _N
tempfile dataset1
save `dataset1'

* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 Country str3 Symbol int year str6 YearSymbol long(rate1 rate2)
"Australia" "AUS" 1990 "YR1990" 11 5
"Australia" "AUS" 1991 "YR1991" 8 10
"Australia" "AUS" 1992 "YR1992" 3 1
"Australia" "AUS" 1993 "YR1993" 5 2
"Australia" "AUS" 1994 "YR1994" 6 11
"Australia" "AUS" 1995 "YR1995" 10 8
"Australia" "AUS" 1996 "YR1996" 7 9
"Australia" "AUS" 1997 "YR1997" 1 7
"Australia" "AUS" 1998 "YR1998" 2 6
"Australia" "AUS" 1999 "YR1999" 4 4
"Australia" "AUS" 2000 "YR2000" 9 3
end
label values rate1 rate1
label def rate1 1 "0,224887556", modify
label def rate1 2 "0,86013463", modify
label def rate1 3 "1,012231126", modify
label def rate1 4 "1,483129403", modify
label def rate1 5 "1,753653445", modify
label def rate1 6 "1,969634797", modify
label def rate1 7 "2,615384615", modify
label def rate1 8 "3,17667537", modify
label def rate1 9 "4,457435148", modify
label def rate1 10 "4,6277666", modify
label def rate1 11 "7,333021952", modify
label values rate2 rate2
label def rate2 1 "10,72879982", modify
label def rate2 2 "10,87380028", modify
label def rate2 3 "6,282599926", modify
label def rate2 4 "6,872300148", modify
label def rate2 5 "6,926000118", modify
label def rate2 6 "7,675600052", modify
label def rate2 7 "8,362199783", modify
label def rate2 8 "8,46930027", modify
label def rate2 9 "8,506199837", modify
label def rate2 10 "9,579199791", modify
label def rate2 11 "9,718700409", modify

// CONVERT THE RATE VARIABLES TO ACTUAL NUMERIC RATES, NOT VALUE LABELED INTEGERS
foreach v of varlist rate1 rate2 {
decode `v', gen(_`v')
drop `v'
rename _`v' `v'
destring `v', dpcomma replace
}
isid Country year
rename Country country
rename year ts
tempfile dataset2
save `dataset2'

// LINK THE TWO DATA SETS AND BRING THE DATASET 1 VARIABLES OVER
use `dataset1', clear
merge 1:1 country ts using `dataset2'

Thank you very much! The code works extremely well ! I am indebted to you and cannot thank you enough!

Probably I was not clear enough and I apologize for that . My fault. I want to match to the longest duration for that year but the data if the are multiple observation in dataset1 in a year ( pr gr, dates etc) should not be dropped. Pr gr etc are variables that contain observation-values for all governments no matter their duration and by dropping them I loose important information.
They will have to remain even they do not correspond to neither observation to dataset2 . I will have to run a panel afterwards and if those are dropped that will make me loose important information. Actually it modifies seriously the concept of the research project.

For example in year 1990 where multiple governments occurred there are two 1990 year entries(or more in other cases). The second government had the longest duration. Data should be merged to the second government without dropping the data for the first government.

In sort, to make it clearer, the final database should look like database1 merging the data of database2 only to the longest duration government in database1 Probably an option to keep only those data after the merge command
I hope now is better understood.

Mario

Last edited by Mario Ferri; 04 Feb 2020, 13:01.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29948

#11

04 Feb 2020, 13:48

That's a simple change. Before reducing dataset1 to those observations that have the longest duration, we save a copy of the original data. Then we continue as before, and, at the very end, merge back the original data.

Code:

// FOR DATASET 1
* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 ccode int ts long(edate date start end) str9 country int startyear byte(startmonth startday tap1 tap2) long(pr gr)
"Australia" 1990     .      . 3 6 "Australia" 1987  7 22 1 1 3 3
"Australia" 1990     .      . 6 5 "Australia" 1990  4  4 1 1 2 2
"Australia" 1991     .      . 6 5 "Australia" 1990  4  4 1 1 2 2
"Australia" 1991     .      . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1992     .      . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993     .      . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993 34041 199303 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1994     .      . 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1995     .      . 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1996 35126 199603 4 1 "Australia" 1993  3 24 1 1 1 1
"Australia" 1996     .      . 1 2 "Australia" 1996  3 11 3 3 4 4
"Australia" 1997     .      . 1 2 "Australia" 1996  3 11 3 3 4 4
"Australia" 1998 36071 199810 1 2 "Australia" 1996  3 11 3 3 4 4
"Australia" 1998     .      . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 1999     .      . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 2000     .      . 2 4 "Australia" 1998 10 21 2 2 5 5
end
label values start start
label def start 1 "11/3/96", modify
label def start 2 "21/10/98", modify
label def start 3 "22/7/87", modify
label def start 4 "24/3/93", modify
label def start 5 "27/12/91", modify
label def start 6 "4/4/90", modify
label values end end
label def end 1 "11/3/96", modify
label def end 2 "21/10/98", modify
label def end 3 "24/3/93", modify
label def end 4 "26/11/01", modify
label def end 5 "27/12/91", modify
label def end 6 "4/4/90", modify
label values pr pr
label def pr 1 "-0,165", modify
label def pr 2 "-14,9", modify
label def pr 3 "-4,5", modify
label def pr 4 "22,593", modify
label def pr 5 "48,458", modify
label values gr gr
label def gr 1 "-0,165", modify
label def gr 2 "-14,9", modify
label def gr 3 "-4,5", modify
label def gr 4 "22,593", modify
label def gr 5 "48,458", modify

//  CREATE USABLE DATE VARIABLES
drop start
gen start = mdy(startmonth, startday, startyear)
assert missing(start) == missing(startmonth, startday, startyear)
format start %td
decode end, gen(_end)
drop end
gen end = daily(_end, "DMY", 2020)
assert missing(end) == missing(_end)
format end %td
drop _end startmonth startday startyear // NO LONGER NEEDED
gen in_year_duration = min(end, mdy(12, 31, ts)) - max(start, mdy(1, 1, ts))
assert !missing(in_year_duration)

//  KEEP THE OBSERVATION WITH LONGEST IN YEAR DURATION FOR EACH
//  COUNTRY AND YEAR
gen long obs_no = _n
tempfile original_dataset1
save `original_dataset1'
by country ts (in_year_duration), sort: keep if _n == _N
tempfile reduced_dataset1
save `reduced_dataset1'

* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 Country str3 Symbol int year str6 YearSymbol long(rate1 rate2)
"Australia" "AUS" 1990 "YR1990" 11  5
"Australia" "AUS" 1991 "YR1991"  8 10
"Australia" "AUS" 1992 "YR1992"  3  1
"Australia" "AUS" 1993 "YR1993"  5  2
"Australia" "AUS" 1994 "YR1994"  6 11
"Australia" "AUS" 1995 "YR1995" 10  8
"Australia" "AUS" 1996 "YR1996"  7  9
"Australia" "AUS" 1997 "YR1997"  1  7
"Australia" "AUS" 1998 "YR1998"  2  6
"Australia" "AUS" 1999 "YR1999"  4  4
"Australia" "AUS" 2000 "YR2000"  9  3
end
label values rate1 rate1
label def rate1 1 "0,224887556", modify
label def rate1 2 "0,86013463", modify
label def rate1 3 "1,012231126", modify
label def rate1 4 "1,483129403", modify
label def rate1 5 "1,753653445", modify
label def rate1 6 "1,969634797", modify
label def rate1 7 "2,615384615", modify
label def rate1 8 "3,17667537", modify
label def rate1 9 "4,457435148", modify
label def rate1 10 "4,6277666", modify
label def rate1 11 "7,333021952", modify
label values rate2 rate2
label def rate2 1 "10,72879982", modify
label def rate2 2 "10,87380028", modify
label def rate2 3 "6,282599926", modify
label def rate2 4 "6,872300148", modify
label def rate2 5 "6,926000118", modify
label def rate2 6 "7,675600052", modify
label def rate2 7 "8,362199783", modify
label def rate2 8 "8,46930027", modify
label def rate2 9 "8,506199837", modify
label def rate2 10 "9,579199791", modify
label def rate2 11 "9,718700409", modify

//  CONVERT THE RATE VARIABLES TO ACTUAL NUMERIC RATES, NOT VALUE LABELED INTEGERS
foreach v of varlist rate1 rate2 {
    decode `v', gen(_`v')
    drop `v'
    rename _`v' `v'
    destring `v', dpcomma replace
}
isid Country year
rename Country country
rename year ts
tempfile dataset2
save `dataset2'

//    LINK THE TWO DATA SETS AND BRING THE DATASET 1 VARIABLES OVER
use `reduced_dataset1', clear
merge 1:1 country ts using `dataset2', nogenerate
merge 1:1 obs_no using `original_dataset1', assert(match using) nogenerate
drop obs_no

Changes from the previous code are shown in bold face.

Comment

Mario Ferri

Join Date: Jul 2019

Posts: 190
#12

04 Feb 2020, 19:02

Dear Clyde,

Thank you for your tremendous help writing the code! . Without your diligence, I would not have been able to solve the issue and to write the code .
I would like to express my deep gratitude to you for your help ! .

Best,

Mario Ferri
Comment

Mario Ferri

Join Date: Jul 2019
Posts: 190

#13

05 Feb 2020, 10:09

Originally posted by Clyde Schechter View Post

Code:

// FOR DATASET 1
* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 ccode int ts long(edate date start end) str9 country int startyear byte(startmonth startday tap1 tap2) long(pr gr)
"Australia" 1990 . . 3 6 "Australia" 1987 7 22 1 1 3 3
"Australia" 1990 . . 6 5 "Australia" 1990 4 4 1 1 2 2
"Australia" 1991 . . 6 5 "Australia" 1990 4 4 1 1 2 2
"Australia" 1991 . . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1992 . . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993 . . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993 34041 199303 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1994 . . 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1995 . . 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1996 35126 199603 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1996 . . 1 2 "Australia" 1996 3 11 3 3 4 4
"Australia" 1997 . . 1 2 "Australia" 1996 3 11 3 3 4 4
"Australia" 1998 36071 199810 1 2 "Australia" 1996 3 11 3 3 4 4
"Australia" 1998 . . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 1999 . . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 2000 . . 2 4 "Australia" 1998 10 21 2 2 5 5
end
label values start start
label def start 1 "11/3/96", modify
label def start 2 "21/10/98", modify
label def start 3 "22/7/87", modify
label def start 4 "24/3/93", modify
label def start 5 "27/12/91", modify
label def start 6 "4/4/90", modify
label values end end
label def end 1 "11/3/96", modify
label def end 2 "21/10/98", modify
label def end 3 "24/3/93", modify
label def end 4 "26/11/01", modify
label def end 5 "27/12/91", modify
label def end 6 "4/4/90", modify
label values pr pr
label def pr 1 "-0,165", modify
label def pr 2 "-14,9", modify
label def pr 3 "-4,5", modify
label def pr 4 "22,593", modify
label def pr 5 "48,458", modify
label values gr gr
label def gr 1 "-0,165", modify
label def gr 2 "-14,9", modify
label def gr 3 "-4,5", modify
label def gr 4 "22,593", modify
label def gr 5 "48,458", modify

// CREATE USABLE DATE VARIABLES
drop start
gen start = mdy(startmonth, startday, startyear)
assert missing(start) == missing(startmonth, startday, startyear)
format start %td
decode end, gen(_end)
drop end
gen end = daily(_end, "DMY", 2020)
assert missing(end) == missing(_end)
format end %td
drop _end startmonth startday startyear // NO LONGER NEEDED
gen in_year_duration = min(end, mdy(12, 31, ts)) - max(start, mdy(1, 1, ts))
assert !missing(in_year_duration)

// KEEP THE OBSERVATION WITH LONGEST IN YEAR DURATION FOR EACH
// COUNTRY AND YEAR
gen long obs_no = _n
tempfile original_dataset1
save `original_dataset1'
by country ts (in_year_duration), sort: keep if _n == _N
tempfile reduced_dataset1
save `reduced_dataset1'

* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 Country str3 Symbol int year str6 YearSymbol long(rate1 rate2)
"Australia" "AUS" 1990 "YR1990" 11 5
"Australia" "AUS" 1991 "YR1991" 8 10
"Australia" "AUS" 1992 "YR1992" 3 1
"Australia" "AUS" 1993 "YR1993" 5 2
"Australia" "AUS" 1994 "YR1994" 6 11
"Australia" "AUS" 1995 "YR1995" 10 8
"Australia" "AUS" 1996 "YR1996" 7 9
"Australia" "AUS" 1997 "YR1997" 1 7
"Australia" "AUS" 1998 "YR1998" 2 6
"Australia" "AUS" 1999 "YR1999" 4 4
"Australia" "AUS" 2000 "YR2000" 9 3
end
label values rate1 rate1
label def rate1 1 "0,224887556", modify
label def rate1 2 "0,86013463", modify
label def rate1 3 "1,012231126", modify
label def rate1 4 "1,483129403", modify
label def rate1 5 "1,753653445", modify
label def rate1 6 "1,969634797", modify
label def rate1 7 "2,615384615", modify
label def rate1 8 "3,17667537", modify
label def rate1 9 "4,457435148", modify
label def rate1 10 "4,6277666", modify
label def rate1 11 "7,333021952", modify
label values rate2 rate2
label def rate2 1 "10,72879982", modify
label def rate2 2 "10,87380028", modify
label def rate2 3 "6,282599926", modify
label def rate2 4 "6,872300148", modify
label def rate2 5 "6,926000118", modify
label def rate2 6 "7,675600052", modify
label def rate2 7 "8,362199783", modify
label def rate2 8 "8,46930027", modify
label def rate2 9 "8,506199837", modify
label def rate2 10 "9,579199791", modify
label def rate2 11 "9,718700409", modify

// CONVERT THE RATE VARIABLES TO ACTUAL NUMERIC RATES, NOT VALUE LABELED INTEGERS
foreach v of varlist rate1 rate2 {
decode `v', gen(_`v')
drop `v'
rename _`v' `v'
destring `v', dpcomma replace
}
isid Country year
rename Country country
rename year ts
tempfile dataset2
save `dataset2'

// LINK THE TWO DATA SETS AND BRING THE DATASET 1 VARIABLES OVER
use `reduced_dataset1', clear
merge 1:1 country ts using `dataset2', nogenerate
merge 1:1 obs_no using `original_dataset1', assert(match using) nogenerate
drop obs_no

Changes from the previous code are shown in bold face.

Dear Clyde,

Thank you for your tremendous help writing the code! . Without your diligence, I would not have been able to solve the issue and to write the code .
I would like to express my deep gratitude to you for your help ! .

Best,

Mario Ferri

Comment

Mario Ferri

Join Date: Jul 2019
Posts: 190

#14

06 Feb 2020, 19:00

Originally posted by Clyde Schechter View Post

Code:

// FOR DATASET 1
* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 ccode int ts long(edate date start end) str9 country int startyear byte(startmonth startday tap1 tap2) long(pr gr)
"Australia" 1990 . . 3 6 "Australia" 1987 7 22 1 1 3 3
"Australia" 1990 . . 6 5 "Australia" 1990 4 4 1 1 2 2
"Australia" 1991 . . 6 5 "Australia" 1990 4 4 1 1 2 2
"Australia" 1991 . . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1992 . . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993 . . 5 3 "Australia" 1991 12 27 1 1 2 2
"Australia" 1993 34041 199303 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1994 . . 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1995 . . 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1996 35126 199603 4 1 "Australia" 1993 3 24 1 1 1 1
"Australia" 1996 . . 1 2 "Australia" 1996 3 11 3 3 4 4
"Australia" 1997 . . 1 2 "Australia" 1996 3 11 3 3 4 4
"Australia" 1998 36071 199810 1 2 "Australia" 1996 3 11 3 3 4 4
"Australia" 1998 . . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 1999 . . 2 4 "Australia" 1998 10 21 2 2 5 5
"Australia" 2000 . . 2 4 "Australia" 1998 10 21 2 2 5 5
end
label values start start
label def start 1 "11/3/96", modify
label def start 2 "21/10/98", modify
label def start 3 "22/7/87", modify
label def start 4 "24/3/93", modify
label def start 5 "27/12/91", modify
label def start 6 "4/4/90", modify
label values end end
label def end 1 "11/3/96", modify
label def end 2 "21/10/98", modify
label def end 3 "24/3/93", modify
label def end 4 "26/11/01", modify
label def end 5 "27/12/91", modify
label def end 6 "4/4/90", modify
label values pr pr
label def pr 1 "-0,165", modify
label def pr 2 "-14,9", modify
label def pr 3 "-4,5", modify
label def pr 4 "22,593", modify
label def pr 5 "48,458", modify
label values gr gr
label def gr 1 "-0,165", modify
label def gr 2 "-14,9", modify
label def gr 3 "-4,5", modify
label def gr 4 "22,593", modify
label def gr 5 "48,458", modify

// CREATE USABLE DATE VARIABLES
drop start
gen start = mdy(startmonth, startday, startyear)
assert missing(start) == missing(startmonth, startday, startyear)
format start %td
decode end, gen(_end)
drop end
gen end = daily(_end, "DMY", 2020)
assert missing(end) == missing(_end)
format end %td
drop _end startmonth startday startyear // NO LONGER NEEDED
gen in_year_duration = min(end, mdy(12, 31, ts)) - max(start, mdy(1, 1, ts))
assert !missing(in_year_duration)

// KEEP THE OBSERVATION WITH LONGEST IN YEAR DURATION FOR EACH
// COUNTRY AND YEAR
gen long obs_no = _n
tempfile original_dataset1
save `original_dataset1'
by country ts (in_year_duration), sort: keep if _n == _N
tempfile reduced_dataset1
save `reduced_dataset1'

* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 Country str3 Symbol int year str6 YearSymbol long(rate1 rate2)
"Australia" "AUS" 1990 "YR1990" 11 5
"Australia" "AUS" 1991 "YR1991" 8 10
"Australia" "AUS" 1992 "YR1992" 3 1
"Australia" "AUS" 1993 "YR1993" 5 2
"Australia" "AUS" 1994 "YR1994" 6 11
"Australia" "AUS" 1995 "YR1995" 10 8
"Australia" "AUS" 1996 "YR1996" 7 9
"Australia" "AUS" 1997 "YR1997" 1 7
"Australia" "AUS" 1998 "YR1998" 2 6
"Australia" "AUS" 1999 "YR1999" 4 4
"Australia" "AUS" 2000 "YR2000" 9 3
end
label values rate1 rate1
label def rate1 1 "0,224887556", modify
label def rate1 2 "0,86013463", modify
label def rate1 3 "1,012231126", modify
label def rate1 4 "1,483129403", modify
label def rate1 5 "1,753653445", modify
label def rate1 6 "1,969634797", modify
label def rate1 7 "2,615384615", modify
label def rate1 8 "3,17667537", modify
label def rate1 9 "4,457435148", modify
label def rate1 10 "4,6277666", modify
label def rate1 11 "7,333021952", modify
label values rate2 rate2
label def rate2 1 "10,72879982", modify
label def rate2 2 "10,87380028", modify
label def rate2 3 "6,282599926", modify
label def rate2 4 "6,872300148", modify
label def rate2 5 "6,926000118", modify
label def rate2 6 "7,675600052", modify
label def rate2 7 "8,362199783", modify
label def rate2 8 "8,46930027", modify
label def rate2 9 "8,506199837", modify
label def rate2 10 "9,579199791", modify
label def rate2 11 "9,718700409", modify

// CONVERT THE RATE VARIABLES TO ACTUAL NUMERIC RATES, NOT VALUE LABELED INTEGERS
foreach v of varlist rate1 rate2 {
decode `v', gen(_`v')
drop `v'
rename _`v' `v'
destring `v', dpcomma replace
}
isid Country year
rename Country country
rename year ts
tempfile dataset2
save `dataset2'

// LINK THE TWO DATA SETS AND BRING THE DATASET 1 VARIABLES OVER
use `reduced_dataset1', clear
merge 1:1 country ts using `dataset2', nogenerate
merge 1:1 obs_no using `original_dataset1', assert(match using) nogenerate
drop obs_no

Changes from the previous code are shown in bold face.

Dear Clyde,

I am trying to implement the code on my main final datasets (after some changes imported to the first ones) and while the first time the code worked, when I am now trying to run it in the first part of the code when I reach the assert command(in bold) :

Code:

//  CREATE USABLE DATE VARIABLES
drop start
gen start = mdy(startmonth, startday, startyear)
assert missing(start) == missing(startmonth, startday, startyear)

It displays :

Code:

2 contradictions in 1,735 observations
assertion is false
r(9);

I would appreciate any help you can provide

thank you

Mario Ferri

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#15

06 Feb 2020, 19:27

Well, what that command does is verify that the combination of startmonth, startday, and startyear forms a valid date, or else one of them is missing. So the implication is that somewhere in your new data you have some combination of startmonth, startday and startyear that are not valid as a date, e.g. perhaps an April 31 or a month that isn't between 1 and 12.

It's easy enough to chase down the problem:

Code:

browse if missing(start) != missing(startmonth, startday, startyear)

will show you the two offending observations. Look at startmonth, startday and startyear there and you'll be able to see the problem. Then you have to fix it.
Comment

Announcement

Merging two datasets with specific dates of start and end of period within a year

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment