Unbalanced dataset

Kate Isabella

Join Date: Jan 2021

Posts: 11
#1

Unbalanced dataset

16 Jun 2021, 15:06

Hello Stata users,
I started using Stata recently and at the moment I find myself in a dilemma. My database has millions of observations and about 30 variables. To get the regressions, I tried to balance my dataset with the worker identifiers variable (worker_id) with the time variable (year) first. The same worker can be found in several years, although it may be normal for it not to exist throughout the complete considered temporal sequence and only appear in a few years (from 2010 to 2018).
When executing the command "xtset worker_id year" the error "repeated time values within panel - r(451);" appears. How can I solve this problem?
Thank you for your attention.
Best regards!

Kate
Tags: panel data
Andrew Musau

Join Date: Oct 2014

Posts: 10084
#2

16 Jun 2021, 16:00

"repeated time values within panel - r(451);"

It means that you have at least one duplicate of worker_id and year in the dataset. Either this is accidental, in which case the resolution is simple, or you have misunderstood the structure of your data.

Code:

duplicates tag worker_id year, g(dup) list worker_id year if dup, sepby(worker_id year)

If the former

Code:

duplicates drop worker_id year, force
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29957
#3

16 Jun 2021, 16:06

To get the regressions, I tried to balance my dataset with the worker identifiers variable (worker_id) with the time variable (year) first.

It is likely that in your attempt to balance the data set, you mistakenly created surplus observations for some or all combinations of worker_id and year.

But why are you doing that anyway. Regression commands do not, in general, require balanced data sets. Are you using some specific command that does? If so, what is it? Are you sure it doesn't work with unbalanced data sets?

Probably the simplest solution to your problem is to just go back to the original unbalanced dataset, do your -xtset worker_id year- and then proceed with your regressions. In the unlikely event you really do have to have a balanced data set for your particular analysis, rather than filling things in "by hand" or with homebrew code, use the -fillin worker_id year- command: Stata won't create any surplus observations that would bother -xtset-.

If you try -xtset worker_id year- on your original unbalanced data and Stata gives you that repeated time values within panel message, then it's a different problem: the surplus observations were there from the start. In that case, there is something wrong with that data set and you need to fix that. The first step is to inspect the offending observations:

Code:

duplicates tag worker_id year, gen(flag) browse if flag

will show them to you. Perhaps your data set is in fact not yearly data but quarterly or monthly, or something like that. In that case do your -xtset- command with a time variable that marks the actual time unit of the data set. If that's not the source of the surplus observations, then you need to find out why these surplus observations are there. If this is a data set that you, or somebody working with you, created, then you (or that other person) should review the data management that created it to find out how those surplus observations got in there, and fix that. Maybe they were there in the original data supplied by an external source. If that's the case, then you should contact the supplier to clarify what is going on.

Added: Crossed with #2.

I disagree strongly with the proposal to resolve accidental duplicates with -duplicates drop worker_id year, force-. This is ok only if all of the surplus observations agree with each other on every variable you will be using. If not, then they are inconsistent, and -duplicates drop worker_id year, force- will just arbitrarily pick one of them to keep, not necessarily the correct one (if, indeed, any of them are correct!) Worse yet, if you rerun the same code, it won't necessarily keep the same ones each time, so your end results will not be reproducible. If the surplus observations disagree, then you need to figure out the most sensible way to resolve the discrepancies, and then pick the one that is correct (or most likely to be correct if you can't be sure, or closest to correct) or, depending on the variables and the context, combine the conflicting observations into a single one by, say, averaging, or taking the chronologically first (or last) or the largest (or smallest), or the median, or some other percentile, or something more complicated.

Last edited by Clyde Schechter; 16 Jun 2021, 16:11.
1 like
Comment
Kate Isabella

Join Date: Jan 2021

Posts: 11
#4

16 Jun 2021, 16:17

Thank you very much for your answer Mr. Andrew! I already applied these commands (which makes absolute sense) but when I try "xtset worker_id year" again, I still get the message that the panel variable is unbalanced...Do you have any suggestion once again? Thank you!

panel variable: worker_id (unbalanced)
time variable: year, 2010 to 2018, but with gaps
delta: 1 unit
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10084
#5

16 Jun 2021, 16:27

Originally posted by Kate Isabella View Post

Thank you very much for your answer Mr. Andrew! I already applied these commands (which makes absolute sense) but when I try "xtset worker_id year" again, I still get the message that the panel variable is unbalanced...Do you have any suggestion once again? Thank you!

panel variable: worker_id (unbalanced)
time variable: year, 2010 to 2018, but with gaps
delta: 1 unit

If you absolutely need a balanced dataset, then install xtbalance2 from SSC. Otherwise, see Clyde's comments in #2.

Code:

ssc install xtbalance2, replace help xtbalance2

I disagree strongly with the proposal to resolve accidental duplicates with -duplicates drop worker_id year, force-. This is ok only if all of the surplus observations agree with each other on every variable you will be using. If not, then they are inconsistent, and -duplicates drop worker_id year, force- will just arbitrarily pick one of them to keep, not necessarily the correct one (if, indeed, any of them are correct!) Worse yet, if you rerun the same code, it won't necessarily keep the same ones each time, so your end results will not be reproducible. If the surplus observations disagree, then you need to figure out the most sensible way to resolve the discrepancies, and then pick the one that is correct (or most likely to be correct if you can't be sure, or closest to correct) or, depending on the variables and the context, combine the conflicting observations into a single one by, say, averaging, or taking the chronologically first (or last) or the largest (or smallest), or the median, or some other percentile, or something more complicated.

Clyde, agreed! Accidental to me implies that these are perfect duplicates. If they are not, then it goes to my second point that the structure of the data is not well understood.
1 like
Comment
Kate Isabella

Join Date: Jan 2021

Posts: 11
#6

16 Jun 2021, 16:31

Thank you very much for your answer Mr. Clyde! The reason why I was trying to balance the data set was that I got some dubious results from the regressions in their current state. It's very likely that the difficulty I'm experiencing is actually due to the data that was given to me from the start...Since I still don't have much experience using STATA, I wasn't sure if I was the one proceeding the wrong way. Thank you for all the explanations you provided me. I'll see what can be done.
Best regards!

Last edited by Kate Isabella; 16 Jun 2021, 17:21.
Comment

Announcement

Unbalanced dataset

Comment

Comment

Comment

Comment

Comment