Merging two datasets with specific dates of start and end of period within a year

Mario Ferri

Join Date: Jul 2019

Posts: 190
#31

13 Feb 2020, 17:51

One more thing, , hope the last. I am trying to declare dataset to be a panel and I am getting and error

Code:

xtset ID ts repeated time values within panel

ID stands for the country id and ts for year

That means I am having duplicates. I am aware of the presence of duplicates ts for some years from the nature of the dataset.Any way to solve it while keeping the multiple years when that occurs?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#32

16 Feb 2020, 12:44

No. You can only do -xtset panelvar timevar- if panelvar and timevar uniquely identify the observations.

However, most likely you will be fine if you change your code to just -xtset ID- and omit the time variable. Stata will then not care about the duplicate values of ts. The time variable is not required for -xtset-, and in your situation, is not allowed. The only thing you lose by not specifying the time variable is the ability to use time series operators like lag and lead, or first difference, etc., and the ability to use models with autoregressive structure. But those things are, in any case, undefinable in the presence of duplicate values of ts, so your data would simply be unsuitable for those analyses in any case.
Comment
Mario Ferri

Join Date: Jul 2019

Posts: 190
#33

16 Feb 2020, 16:41

Originally posted by Clyde Schechter View Post

No. You can only do -xtset panelvar timevar- if panelvar and timevar uniquely identify the observations.

However, most likely you will be fine if you change your code to just -xtset ID- and omit the time variable. Stata will then not care about the duplicate values of ts. The time variable is not required for -xtset-, and in your situation, is not allowed. The only thing you lose by not specifying the time variable is the ability to use time series operators like lag and lead, or first difference, etc., and the ability to use models with autoregressive structure. But those things are, in any case, undefinable in the presence of duplicate values of ts, so your data would simply be unsuitable for those analyses in any case.

Thanks once again for your input.I have opened another thread for this question here.
https://www.statalist.org/forums/for...panel-settings

Some provided a and advise fooling stata in some way but seems I will be getting dubious results,
My concern is that I will actually have to use time series operator so a vicious cycles appears ....
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#34

16 Feb 2020, 17:06

You CAN'T do that with this kind of data. And any attempt to "trick" Stata into it will lead to nothing more than producing garbage results.

Let's be clear why this is so. You want to use time series operators. Let's be concrete for the sake of discussion and assume you want to use lags. (It doesn't matter, the same argument will apply to leads and differences.) Suppose you have an ID that has an observation with ts = 2013 and two observations with 2012. When you speak of the lagged value of some variable x, what should it be in 2013? There are two values of x in 2012 corresponding to the two different observations. Which one is the lag? If there is actually a way to choose one, then the correct solution is to -drop- the observations that do not serve as lags. If there is no way to choose between the two, then lag is simply undefinable and either choice would be wrong.

So if you really need time series operators, then you need a different data set, one that is compatible with time series operators.

Last edited by Clyde Schechter; 16 Feb 2020, 17:08.
Comment
Mario Ferri

Join Date: Jul 2019

Posts: 190
#35

03 Mar 2020, 17:27

Originally posted by Clyde Schechter View Post

You CAN'T do that with this kind of data. And any attempt to "trick" Stata into it will lead to nothing more than producing garbage results.

Let's be clear why this is so. You want to use time series operators. Let's be concrete for the sake of discussion and assume you want to use lags. (It doesn't matter, the same argument will apply to leads and differences.) Suppose you have an ID that has an observation with ts = 2013 and two observations with 2012. When you speak of the lagged value of some variable x, what should it be in 2013? There are two values of x in 2012 corresponding to the two different observations. Which one is the lag? If there is actually a way to choose one, then the correct solution is to -drop- the observations that do not serve as lags. If there is no way to choose between the two, then lag is simply undefinable and either choice would be wrong.

So if you really need time series operators, then you need a different data set, one that is compatible with time series operators.

I am getting back to you and on this thread after I gave some thought. In order to remind you the project, I had to merge two datasets. One is a dataset of politics indicators and the other contains yearly standard macroeconomics variables.

Now, since there may be more than one government in a year, the merged happened according to the longest government duration. The problem is that when I tried to xtset in panel I got an error “repeated time values within panel”. That means I am having duplicates. I am aware of the presence of duplicates ts for some years from the nature of the dataset. In years with multiple governments, there are more than one observation in the politics variables
Since I cannot go anyway with this kind of dataset, I will most likely have to change the dataset and would like further your help.

For the politics indicators, pm and gv , I would like to create for the years with multiple government for each of them a single value for each of them based on the (collapsed) average weighted sum of their value and the actual duration in days of each government in a year(divided by 365 or 366 depended on the lap year). The actual duration in days has been already calculated from the previous calculation.

As example take year 1996 for Australia or any other year with multiple governments.

The pm value for the years should be the pm value for the first governments times the in_year_duration divided the year duration in days plus the pm score of the second government times the in_year_duration divided the year duration in days (so on if more governments have occurred in other case.

Similar calculations have to be done for the gv variable, as well.

You will also notice that I am having also two other variables in the dataset t1 and t2 namely. Both they take integer values from 1 to 6. Because also these two variables need to be collapsed to a single value, I would like to create a dummy for each of the values they take, for a total of 12 dummies variables. For example, a dummy takes value 1 when t1 =1 and zero if others, A dummy takes value 1 when t1=2 and zero if others and so on. Similar goes for t2.

And after all these have been created, they need to be collapsed to a single observation for each year. I hope is clear to you my aim and would appreciate once again the help you can provide. The data here are created after the merge program you wrote for me..Note his is a panel for a large number of counties.
Thank you much in advance

Best regards,

Mario Ferri

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(t1 t2) float(pm gv start) int time float(end in_year_duration) long obs_no float(growth1 ID) str15 ccode long(ts edate date) str48 country int startyear byte(startmonth startday) float growth2 1 1 -4.5 -4.5 10064 11051 11051 93 1 . 1 "Australia" 1990 . . "Australia" 1987 7 22 . 1 1 -14.9 -14.9 11051 11683 11683 271 2 -.25998396 1 "Australia" 1990 . . "Australia" 1990 4 4 .8315406 1 1 -14.9 -14.9 11683 12136 12136 4 4 . 1 "Australia" 1991 . . "Australia" 1991 12 27 . 1 1 -14.9 -14.9 11051 11683 11683 360 3 1.3334774 1 "Australia" 1991 . . "Australia" 1990 4 4 2.1083677 1 1 -14.9 -14.9 11683 12136 12136 365 5 3.2923484 1 "Australia" 1992 . . "Australia" 1991 12 27 3.4185865 1 1 -14.9 -14.9 11683 12136 12136 82 6 . 1 "Australia" 1993 . . "Australia" 1991 12 27 . 1 1 -.165 -.165 12136 13219 13219 282 7 3.715279 1 "Australia" 1993 34041 199303 "Australia" 1993 3 24 2.6728065 1 1 -.165 -.165 12136 13219 13219 364 8 1.6375794 1 "Australia" 1994 . . "Australia" 1993 3 24 1.2805543 1 1 -.165 -.165 12136 13219 13219 364 9 -.9972533 1 "Australia" 1995 . . "Australia" 1993 3 24 -.9632453 1 1 -.165 -.165 12136 13219 13219 70 10 . 1 "Australia" 1996 35126 199603 "Australia" 1993 3 24 . 3 3 22.593 22.593 13219 14173 14173 295 11 2.770518 1 "Australia" 1996 . . "Australia" 1996 3 11 3.154481 3 3 22.593 22.593 13219 14173 14173 364 12 3.889898 1 "Australia" 1997 . . "Australia" 1996 3 11 3.9895246 2 2 48.458 48.458 14173 -21220 15305 71 14 . 1 "Australia" 1998 . . "Australia" 1998 10 21 . 3 3 22.593 22.593 13219 14173 14173 293 13 2.9572344 1 "Australia" 1998 36071 199810 "Australia" 1996 3 11 3.7896104 2 2 48.458 48.458 14173 -21220 15305 364 15 2.561527 1 "Australia" 1999 . . "Australia" 1998 10 21 2.339495 2 2 48.458 48.458 14173 -21220 15305 365 16 .3891381 1 "Australia" 2000 . . "Australia" 1998 10 21 .50802076 2 2 33.333 30.905804 15305 -20155 16370 35 18 . 1 "Australia" 2001 . . "Australia" 2001 11 26 . 2 2 48.458 48.458 14173 -21220 15305 329 17 1.1405312 1 "Australia" 2001 37205 200111 "Australia" 1998 10 21 3.1164474 2 2 33.333 30.905804 15305 -20155 16370 364 19 2.158647 1 "Australia" 2002 . . "Australia" 2001 11 26 2.3763344 end format %td start format %td end

Last edited by Mario Ferri; 03 Mar 2020, 17:47.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#36

03 Mar 2020, 19:10

Code:

foreach v of varlist t1 t2 { levelsof `v', local(levels) foreach l of local levels { gen i`v'_`l' = `l'.`v' } } foreach v of varlist it* pm gv { by country ts, sort: egen numerator = total(in_year_duration*`v') by country ts: egen denominator = total(in_year_duration) gen weighted_`v' = numerator/denominator drop numerator denominator } collapse (max) growth1 growth2 (first) ID weighted_*, by(country ts)

Note: When you reduce to one observation per country per year, the variables that differed within country, such as start and end, are no longer meaningful at this level of aggregation. So they do not appear in the -collapse- command, nor in the result. If I have overlooked some variable that needs to be brought along, and if you are confident that it is the same regardless of which government is in power during that year, then just add it to the -collapse- command in an appropriate way.
Comment
Mario Ferri

Join Date: Jul 2019

Posts: 190
#37

04 Mar 2020, 13:48

Originally posted by Clyde Schechter View Post

Code:

foreach v of varlist t1 t2 { levelsof `v', local(levels) foreach l of local levels { gen i`v'_`l' = `l'.`v' } } foreach v of varlist it* pm gv { by country ts, sort: egen numerator = total(in_year_duration*`v') by country ts: egen denominator = total(in_year_duration) gen weighted_`v' = numerator/denominator drop numerator denominator } collapse (max) growth1 growth2 (first) ID weighted_*, by(country ts)

Note: When you reduce to one observation per country per year, the variables that differed within country, such as start and end, are no longer meaningful at this level of aggregation. So they do not appear in the -collapse- command, nor in the result. If I have overlooked some variable that needs to be brought along, and if you are confident that it is the same regardless of which government is in power during that year, then just add it to the -collapse- command in an appropriate way.

Thanks Clyde once again.

Last thing to finish I hope. I would need to create an indicator showing the number of governments in a year .

That is call auch indicator as gvn taking values 1 if only one governments occurred in year , 2 if two governments occurred in a year and so on. Which later will be tabulated generating dummies .

And a last thing, I would need to create a dummy showing if multiple more than one governments occurred in a year,.A government change dummy , in other words.

All these must be then collapsed to a single value for each year, according to the collapse command showed above.

Last edited by Mario Ferri; 04 Mar 2020, 13:50.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#38

04 Mar 2020, 13:59

So, take the code in #36 and stop just before the -collapse- command. Then:

Code:

gen byte n_governments = 1
collapse (max) growth1 growth2 (count) n_governments (first) ID weighted_*, by(country ts) gen byte more_than_one_govt = (n_governments > 1)

Which later will be tabulated generating dummies .

Why? There is seldom any need to do this in Stata. If you are going to do some kind of regression and you want to include the number of governments as a discrete predictor, there is no need to create indicator ("dummy") variables for this purpose. Use factor-variable notation instead. (-help fvvarlist-)

Code:

regression_command outcome ...i.n_governments …

Stata will create "virtual" indicator variables and use them in the regression.
Comment
Mario Ferri

Join Date: Jul 2019

Posts: 190
#39

04 Mar 2020, 15:51

Originally posted by Clyde Schechter View Post

So, take the code in #36 and stop just before the -collapse- command. Then:

Code:

gen byte n_governments = 1
collapse (max) growth1 growth2 (count) n_governments (first) ID weighted_*, by(country ts) gen byte more_than_one_govt = (n_governments > 1)

Why? There is seldom any need to do this in Stata. If you are going to do some kind of regression and you want to include the number of governments as a discrete predictor, there is no need to create indicator ("dummy") variables for this purpose. Use factor-variable notation instead. (-help fvvarlist-)

Code:

regression_command outcome ...i.n_governments …

Stata will create "virtual" indicator variables and use them in the regression.

Thank you very much for one more time. By looking at the the data created I noticed the following.
The t1 and t2 variables are categorical variables taking different values for each type of governments(1 is for blue, 2 is for yellow etc) . For single years there are no problems but for multiple government in a year those are changing values. I am going to do some kind of regression and I want to include the t1 and t2 as discrete predictor, taking single values in a year .So how is it possible to include them in the data after the collapse, by respecting the values they assumed in the cases where they assumed different values? One though was to tabulate them and generate dummies, for each value. But this will capture only the effect of the single value,(yellow blue , etc) and not the entire discrete predictor..
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#40

04 Mar 2020, 16:06

Without knowing what t1 and t2 actually are and how they are supposed to be related to whatever outcomes you are modeling, I can't suggest how you might deal with this. (That is not to say that I necessarily could do so with that knowledge, only that it is impossible without it.) The proper representation of a variable is usually a matter of properly understanding the science of the situation--which is out of my area in this case.
Comment
Mario Ferri

Join Date: Jul 2019

Posts: 190
#41

04 Mar 2020, 17:10

Originally posted by Clyde Schechter View Post

Without knowing what t1 and t2 actually are and how they are supposed to be related to whatever outcomes you are modeling, I can't suggest how you might deal with this. (That is not to say that I necessarily could do so with that knowledge, only that it is impossible without it.) The proper representation of a variable is usually a matter of properly understanding the science of the situation--which is out of my area in this case.

T1 and t2 are categorical variables, some sort of indicators.,taking different values for each type the indicator. I have sent you a private message for more details.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#42

04 Mar 2020, 17:29

Well, without any commitment on my part as to whether this approach makes sense from a scientific perspective, one possibility would be to weight them according to the in-year duration of each government. In fact, the code in #37 does that: that is what the it* variables are. So if you just add -(mean) it*- to the -collapse- command you will get that.
Comment
Mario Ferri

Join Date: Jul 2019

Posts: 190
#43

06 Mar 2020, 12:55

Originally posted by Clyde Schechter View Post

Well, without any commitment on my part as to whether this approach makes sense from a scientific perspective, one possibility would be to weight them according to the in-year duration of each government. In fact, the code in #37 does that: that is what the it* variables are. So if you just add -(mean) it*- to the -collapse- command you will get that.

I have though this and here a way to move forward and would much like your help with the code.

One version of this will be to take the categorical t1 and t2 variable and create 3 dichotomous variables. Then I want to collapse these variables to the country-year, then weighting them by the percentage of the calendar year makes sense. I have pmed you with more details
Comment
Mario Ferri

Join Date: Jul 2019

Posts: 190
#44

06 Mar 2020, 15:54

Originally posted by Mario Ferri View Post

I have though this and here a way to move forward and would much like your help with the code.

One version of this will be to take the categorical t1 and t2 variable and create 3 dichotomous variables. Then I want to collapse these variables to the country-year, then weighting them by the percentage of the calendar year makes sense. I have pmed you with more details

Since my previous post was too general to support writing any code ,I am updating my post.

I will need to take the categorical t1 and t2 variable (each has 6 types of animals groups (-1-6)) and create 3 dichotomous variables. One version of this would be to include a leadership (weak leadership = 0), animal group (single animal = 0), and old animal (non-old animal = 0) variable in the analysis. Then I want to collapse these variables to the country-year, which then weighting them by the percentage of the calendar year makes sense.

Here is the meaning of the values 1-6 of the t1 and t2 valuable

1. Single animal : one animal becomes leader and takes group leadership

2 Minimal animal group: All participating animals are necessary to participate in order to
sustain a group leadership

3 Surplus animal group ; this comprises those animal groups. which exceed the minimal-leadership criterion.

4 Single animal weak leadership: the animal leading the group is not accepted by the greater number of the group members

5 Multi animal weak group leadership: The animal leading the group are not accepted by the greater number of the group members

6 old animal : the old animal is not intended to lead the group serious , but is only minding the shop temporarily

Many thanks

Mario Ferri
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#45

06 Mar 2020, 16:18

The relationship between the 6 categories and the three dichotomous variables is not entirely clear. It looks to me like this:

6-levels weak v strong single v multiple non-old v old

1 1 0 ?

2 1 1 ?

3 1 1 ?

4 0 0 ?

5 0 1 ?

6 ? ? 1

In the above table, ? signifies that from the descriptions given of the 6 levels and the three dichotomies, the dichotomy is indeterminate, indeed not applicable, in that level. Note also that the original levels 2 and 3 are not distinguishable in the new three dichotomy classification. Note also that the third dichotomy only takes on values 1 and ?, which is problematic. Perhaps there are aspects of levels 1 through 5 that qualify for a determinate value of old vs non-old that I am overlooking?

Perhaps I am misunderstanding the 6-levels and the three dichotomies--and feel free to correct my interpretation where it is wrong.

Anyway, taking this at face value, you could code this as follows:

Code:

label define weak_strong 0 "Weak" 1 "Strong" .n "N/A" forvalues i = 1/2 { gen weak_strong`i':weak_strong = inlist(t`i', 1, 2, 3) replace weak_strong`i' = .n if t`i' == 6 } label define single_multiple 0 "Single" 1 "Multiple" .n "N/A" forvalues i = 1/2 { gen single_multiple`i':single_multiple = inlist(t`i', 2, 3, 5) replace single_multilpe`i' = .n if t`i' == 6 } label define old_non_old 0 "Non-Old" 1 "Old" .n "N/A" forvalues i = 1/2 { gen old_non_old`i':old_non_old = (t`i' == 6) mvencode old_non_old`i', mv(.n = 0) }

Note: code not tested. Beware of typos or other errors.
Comment

6-levels	weak v strong	single v multiple	non-old v old
1	1	0	?
2	1	1	?
3	1	1	?
4	0	0	?
5	0	1	?
6	?	?	1

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment