Panel data with multiple observations per year

Kleopatra Koulikidou

Join Date: Jun 2014

Posts: 19
#1

Panel data with multiple observations per year

02 Dec 2015, 06:45

Dear Statalisters,

I am having a dataset with unbalanced panel data on class action suits, as stated below. The issue here is that some of the companies are having more than one records per year, but those records have different dates (see company_id=C), so they are not per se the same regarding the actual event date. The reason that a company may have multiple records is that it can receive more than one class action lawsuits within a year (which differ in date). The rest of the variables (e.g. Var1) represent a scoring for every class action suit, so it differs per company and time, but I also include other variables such as ASSETS that is the same for every company within the same year.

So, my main question is on how to treat this dataset.
Of course, if I am going to declare the dataset as a panel (xtset company_id year) I will get an error. In case that I'll type "xtset company_id " disregarding the time variable, would that solve my problem?
Further, I do not know if it helps, but my dataset's maximum number of multiple observations within a year for the same company is 5. Maybe creating a new set of variables equal to this number could be the solution?

Dropping the duplicate observations is not an option as my dataset is already quite limited.

company_id year event_date Var1 Assets etc.
A 2010 1/10/2010 55 5600
A 2012 5/30/2012 40 5550
A 2013 8/22/2013 38 5650
B 2011 7/12/2011 99 335
C 2009 3/11/2009 29 4300
C 2014 2/08/2014 35 6000
C 2014 6/24/2014 35 6000

I would really appreciate your help and thoughts.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30188
#2

02 Dec 2015, 09:08

It depends on what you want to try to do with this data. For many analyses, all you need to accomplish with -xtset- is identify the panels, and for that -xtset company_id- is enough. Sometimes, however, you really need to specify the time variable; as when you want to do regressions with autoregressive residual structure, or analyses that use lags, leads, or differences. In that case, I might do -xtset company_id event_date-. (But as your event_dates are pretty irregularly spaced, this wouldn't really work well: most lags, leads, and differences would just be missing, and you certainly can't do autoregressive analyses with these dates.) Another possibility is to set up consecutive event numbers within companies:

Code:

by company_id (event_date), sort: gen int event_num = _n

and then you could -xtset company_id event_num-. With this specification you would be able to make effective use of lag, lead, and difference operators.

And there is an altogether different approach that might work for you. It appears that Var1 and Assets are actually constant within company_id for any given year. So, if this is true for all of your "etc." variables, then depending on where you're going with this, you might want to reduce the data set to a single observation per company_id year pair, with a variable showing how many events occurred, and using the common values of the other covariates:

Code:

by company_id year (Var1), sort: assert Var1[1] == Var1[_N] by company_id year (Assets), sort: assert Assets[1] == Assets[_N] // INSERT SIMILAR -assert- STATEMENTS HERE FOR ALL OF THE etc VARIABLES // IF THERE ARE MANY OF THEM, THEN USE A -foreach- LOOP collapse (count) n_events = event_date (first) Var1 Assets etc_variables, by(company_id year)
1 like
Comment
Kleopatra Koulikidou

Join Date: Jun 2014

Posts: 19
#3

02 Dec 2015, 23:59

First of all I would like to thank you for your valuable feedback. Further, I would like to inform you that I will not use any lags, leads, difference nor an autoregressive residual structure. I suppose that your first suggestion "xtset company_id event_num" might solve the problem of multiple observations per year and I will certainly try it.
Regarding now your second suggestion, I would like to pinpoint that yes indeed Assets are constant within a company in a given year , but Var1 is not (mistakenly I have typed the same Var1 value for company C in year 2014. Please consider this observation to have a Var1 value of "39").

In addition except of your suggestion on using "xtset company_id event_num", might using pooled OLS instead of panel data solved my problem?

I would like to thank you on again for your time and efforts.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17748
#4

03 Dec 2015, 03:15

Kleopatra:
I don't think that pooled OLS can ease your problem, as it is simply a (often less efficient) way to do panel data analysis.

Kind regards,
Carlo
(Stata 19.0)
Comment

Alexandra Contreras

Join Date: Nov 2018
Posts: 3

19 Nov 2018, 05:25

Dear Statalisters,

I am facing a similar issue. My database is an unbalanced panel data on spectrum auctions (a sample is stated below). The issue here is that some of the Regulation Authorities auction spectrum only some years and others no, the dates are differents in every country. SO what I have is that in some years I have no observations but for others, and for specific operators I could have more than one spectrum allocation. It is important to me to know the different spectrum bands and prices paid for each auction even though they could have been performed in the same year to the same operator. The rest of the variables (e.g. Population, GDP, ARPS; Revenues) are the same for the company at an specific year. However, for the model estimation I will have to use some lags of some variables, and therefore I dont know if this repeated observations for the same years is going to difficult the estimation.

COUNTRIES	OPERATOR	ID	YEAR	Population	Suscriptions per Operator	Voice Revenues	Market Share by operator	Implemented Technology	Spectrum band width (MHz)	Award Metod: AUC(auction), Beauty Contest (BC), Grant (G), Direct Award (DA), First come first served (FCFS), Hybrid (HYB), Reallocation (REALL), Other	License Fee
Argentina	Telecom Personal	1	2006	38.971	8.425	920	0,28
Argentina	Telecom Personal	1	2007	39.360	10.666	990	0,27
Argentina	Telecom Personal	1	2008	39.750	12.564	1.146	0,28
Argentina	Telecom Personal	1	2009	40.130	14.475	1.021	0,30
Argentina	Telecom Personal	1	2010	40.520	16.333	1.489	0,31
Argentina	Telecom Personal	1	2011	40.900	18.193	1.653	0,33
Argentina	Telecom Personal	1	2012	41.733	18.975	1.650	0,33
Argentina	Telecom Personal	1	2013	42.203	20.088	1.054	0,32
Argentina	Telecom Personal	1	2014	42.670	19.585	742	0,31	Neutral: 4G LTE	30	AUCT	355
Argentina	Telecom Personal	1	2014	42.670	19.585	742	0,31	Neutral	5	AUCT	5
Argentina	Telecom Personal	1	2014	42.670	19.585	742	0,31	Neutral	10	AUCT	6
Argentina	Telecom Personal	1	2014	42.670	19.585	742	0,31	Neutral	20	AUCT	247
Argentina	Telecom Personal	1	2014	42.670	19.585	742	0,31	Neutral	8	AUCT	45
Argentina	Telecom Personal	1	2015	43.130	19.656	784	0,31
Argentina	Telecom Personal	1	2016	43.590	19.921	555	0,31
Argentina	Telecom Personal	1	2017	44.045	18.980	493	0,31	Neutral: 4G LTE	40	DA	56
Argentina	Telecom Personal	1	2017	44.045	18.980	493	0,31	Neutral: 4G LTE	30	DA	33
Argentina	Movistar	2	2006	38.971	11.059	1.272	0,36
Argentina	Movistar	2	2007	39.360	13.630	1.247	0,35
Argentina	Movistar	2	2008	39.750	14.829	1.408	0,34
Argentina	Movistar	2	2009	40.130	15.931	1.434	0,33
Argentina	Movistar	2	2010	40.520	16.149	1.472	0,31
Argentina	Movistar	2	2011	40.900	16.767	1.523	0,30
Argentina	Movistar	2	2012	41.733	17.604	1.531	0,30
Argentina	Movistar	2	2013	42.203	19.955	1.331	0,32
Argentina	Movistar	2	2014	42.670	20.022	1.006	0,32	Neutral	20	AUCT	$ 209,14
Argentina	Movistar	2	2015	43.130	20.389	1.113	0,32	Neutral	20	AUCT	$ 218,67
Argentina	Movistar	2	2016	43.590	20.765	747	0,33
Argentina	Movistar	2	2017	44.045	19.256	665	0,31	Neutral: 4G LTE	30	DA	$ 33,00

Sorry, I wasnt able to copy from Stata, the data didnt llok very organized.

Then I tried following Clyde suggestion in this forum:
by Operator (year), sort: gen int event_num = _n But I dont really understand how this works. Then, I tried the following (copied from the stata panel): xtset Operator event_num string variables not allowed in varlist; Operator is a string variable r(109); . xtset year event_num repeated time values within panel r(451); . xtset IdOp event_num repeated time values within panel r(451); It seems that if it is a string variable I can not do this, so thats why I tried woth IdOp(Operator identificator) but is not working because of the years repetition. any comment will be very helpful for me. Thanks a lor, Alexandra

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30188
#6

19 Nov 2018, 09:52

Code:

by Operator (year), sort: gen int event_num = _n

will not work in the situation you describe. If you have multiple observations for the same Operator in the same year, then this code will just assign event_num to those observations in an arbitrary and irreproducible order. If you then use lags and leads, etc., you will get different results every time you run the analyses. I do not know what the context from which you took that code is. The idea behind it is to order the observations chronologically first and then assign the observations consecutive numbers. That gets followed by an -xtset- command with event_num as the "time" variable. That makes "lagged" values refer to the immediately preceding value, even if that value doesn't happen to be exactly one year (or month, or whatever the time unit) before. It sounds like you need to do something like that, but you need to establish the actual order in which these events occurred. If all you know is that some Operator did a bunch of auctions in the same year and you don't know what order they were done in, then there is no possibility of defining lags.

As for the problem with string variable, you need a numeric value that corresponds to the operator. You can do that either with the -encode- command (see -help encode-) or with -egen nOperator = group(Operator)-.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
1 like
Comment

Ofir Gefen

Join Date: Sep 2017
Posts: 19

24 Nov 2018, 04:03

Hello,

I am also having a problem that is somewhat similar to the other two individuals, but one that the past comments cannot address.

Here is my dataset:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(gvkey datadate) int sics1
1001  8765 5812
1001  9131 5812
1001  9496 5812
1003  8400 5021
1003  8765 5021
1003  9131 5021
1003  9527 5021
1003  9892 5932
1003 10257 5712
1003 10623 5712
1003 10988 5712
1004  6360 4582
1004  6360 5088
1004  6725 4582
1004  6725 5088
1004  7090 4582
1004  7090 5088
1004  7456 4582
1004  7456 5084
1004  7456 5088
end
format %d datadate

------------------ copy up to and including the previous line ------------------

Listed 20 out of 466746 observations

As you can see, it's an unbalanced panel data, but here also, the problem is that some GVKEYs have 2 or more different SIC codes in the same year (database is our time variable). I would like to create a panel dataset that stata can understand through xtset, where gvkey has only one observation per date. I marked in red an instance that this is not the case.

At the end, I hope my dataset will look like so:

gvkey	datadate	sic1_1	sic1_2	sic1_3	sic1_4
1	2010	123	456	.	.
1	2011	123	456	.	.
1	2012	123	.	.	.
2	2010	876	.	.	.
2	2011	876	.	.	.
2	2012	876	.	.	.
2	2013	999	.	.	.
2	2014	999	.	.	.
2	2015	999	000	.	.
2	2016	444	333	.	.
3	2010	444	333	.	.
3	2011	444	333	.	.
3	2012	444	333	.	.
3	2013	444	333	.	.

Meaning, I simply want to make the sic variable that is now expressed per id variable per year, to be expressed in different columns, when it's relevant and be left as missing when it is not.

I had a similar issue in the past and I ended up renaming the variables and merging, and re-merging the dataset with itself, to create the additional columns, but I think there must a simpler solution.

Will appreciate any helpful comments.

Thank you.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30188

24 Nov 2018, 10:11

I don't understand how this helps you with your -xtset- problem. But there is, indeed, a simple way to get the layout you are asking for:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(gvkey datadate) int sics1
1001  8765 5812
1001  9131 5812
1001  9496 5812
1003  8400 5021
1003  8765 5021
1003  9131 5021
1003  9527 5021
1003  9892 5932
1003 10257 5712
1003 10623 5712
1003 10988 5712
1004  6360 4582
1004  6360 5088
1004  6725 4582
1004  6725 5088
1004  7090 4582
1004  7090 5088
1004  7456 4582
1004  7456 5084
1004  7456 5088
end
format %d datadate

by gvkey datadate, sort: gen _j = _n
rename sics1 sics1_
reshape wide sics1_, i(gvkey datadate) j(_j)

-reshape- is one of the most useful data management commands in Stata. Do read both -help reshape- and the section of the PDF manuals linked therein. It takes a bit of practice to "get the feel" of how it works. But once you learn it, you will find it endlessly helpful.

Comment

Ofir Gefen

Join Date: Sep 2017

Posts: 19
#9

25 Nov 2018, 18:32

Clyde this is great, thank you very much. It's exactly what I need.

I will read the reshape command carefully.

Thank you.
Comment
Vita Zhukova

Join Date: Mar 2019

Posts: 2
#10

28 Mar 2019, 05:34

Dear Statalisters,

I am working with the experimental data that I obtained from the laboratory.

The experiment lasts 21 experimental rounds with 48 individuals. Each individual (denoted by the variable id) in each experimental round (denoted by the variable period) takes 11 decisions, such that I have a balanced panel data with multiple observations per id per period.

Then, I am not able to declare the data set in Stata 13.1 as a panel data using xtset id period because of multiple observations per id per period.

Can you help me please on that matter ? Before running a regression analysis, what is the way of declaring the structure of my data?

Will appreciate your comments.
Thank you
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17748
#11

28 Mar 2019, 08:42

Vita:
welcome to this forum.
If you do not have genuine duplicates (ie, errors in data entry) and do not plan to use time-series related commands such as lags and leads, you can -xtset- yout data with -panelid- only.

Kind regards,
Carlo
(Stata 19.0)
Comment
Vita Zhukova

Join Date: Mar 2019

Posts: 2
#12

01 Apr 2019, 10:24

Hello, Carlo, thank you so much for your reply, will use your suggestion in my analysis. However, I would like to give more details on the data I am working with.

Althougth my concern is not to investigate the effect of time on the decisions taken by individuals (dependent variable), I would need to check for its possible influence.

Time dimention of my experiment allows me to change some parameters of the game played by participants. Hence, as far as the game is repeated, even with different parameters in each round, the individual might undertand better the game and refine her behavior in terms of decicions and, hence, reveal a kind of convergent/divergent/oscilating behavior in her decisions.
Therefore, individual's decisions might change across periods due to the fact of the repetition of the game (period variable ) and also due to the changes in the parameters (parameter variable).

Then, on the one hand, I believe I need to control for time in the regression analysis. But, on the other hand, assuming I declare the data as a panel data (that I do not know how to do it yet), will I obtain a biased estimation of the coefficients of the regression model because of including in the same regression a time regressor and parameter regressor, which are both highly correlated?

Kind regards,
Vita
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17748
#13

01 Apr 2019, 11:16

Vita:
thanks for clarifications.
However, you seem to have multiple dependent variables per patient plus predictors.
This design cannot be accomodated by panel data regression models, that consider one regressand only.
Maybe you can take a look at -mvreg-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Joan Marti

Join Date: Jul 2020

Posts: 119
#14

14 Oct 2020, 18:27

Dear all,

I have a similar question to the above - my dataset is similar to Vita's (multiple individuals for each country for each year) and I do not intend to use lags or leads. How do I use the "panelid" command suggested by Carlo? I have not been able to find the commands that I need to use in order to create the panel without xtset.

Thank you in advance,

Joan
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4457
#15

14 Oct 2020, 18:39

Originally posted by Joan Marti View Post

. . . my dataset is . . . multiple individuals for each country for each year . . . How do I use the "panelid" command suggested by Carlo?

Code:

xtset country_id
Comment

Announcement