Why and when is panel data preferrable to pooled cross-sectional?

Kosmas Yeo

Join Date: Sep 2018

Posts: 20
#1

Why and when is panel data preferrable to pooled cross-sectional?

12 Mar 2019, 19:30

Not sure if non-Stata econometric questions are welcome here, but I've encountered an issue at work. I'm dealing with pooled cross-sectional data (randomly sampled in each time period), and I'm wondering how analysis works compared to panel data. What can I do with panel data that I can't do with pooled cross-sectional? What are interpretation limits to pooled cross-sectional that panel could answer? Also welcome any sources to read on this. Thanks
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#2

13 Mar 2019, 01:24

Kosmas:
if the sample of your pooled croos-sectional data changes from year to year, your data are not suitable for panel data regression, as it requires that repeated measures are performed on the same sample at equally spaced time intervals.
That said, assuming that your data are suitable for panel data regression, whenever there's evidence of panel-wise effect, pooled regression is outperformed by panel data regression.
For more details, Stata user refer to https://www.stata.com/bookstore/micr...metrics-stata/.

Kind regards,
Carlo
(Stata 19.0)
Comment
Kosmas Yeo

Join Date: Sep 2018

Posts: 20
#3

13 Mar 2019, 08:19

Thanks Carlo. That's a useful source. But I guess my real question is what can a panel regression tell us that standard regressions on pooled cross-sectional can't? We know that panel data is advantageous in terms of degrees of freedom, variation, identifying individual-level changes, but are there any actual conclusions that panel data can reveal that cross-sectional pooled over time can't? Like if I wanted to compare how height effects weight over time, what advantage does panel data give me in that analysis?

Phrased another way: if money/effort was no object, should I always collect my data in panel form? Or is there any reason I'd be indifferent between panel and pooled cross-sectional when time/money is not a factor?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#4

13 Mar 2019, 08:34

Kosmas:
the source I quoted in my previous reply clarifies all the issue you're interested in.
See also: https://stats.stackexchange.com/ques...and-panel-data.
I fail to follow your last statement: if you have a longitudinal study, you should collect both -panelid- and -timevar- (in addition to regressand and predictors). Then, you can test whether pooled regression outperforms panel data (or, as it more often occurs, the other way round), but the approach you follow in data collection is the same.

Kind regards,
Carlo
(Stata 19.0)
Comment
Kosmas Yeo

Join Date: Sep 2018

Posts: 20
#5

13 Mar 2019, 09:15

If I have a longitudinal study, I should collect it in panel format. But the realities of budgeting projects doesn't make that feasible. I'm trying to get a sense of when I should push back to invest the funds in panel data collection, or when - due to budgetary constraints - it is "good enough' to resample at each time period. Maintaining records and finding original respondents during the second/third wave of data collection is costly.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#6

13 Mar 2019, 10:52

Kosmas:
your concern seems more related to a good balance between available research funds and results that you can reasonably get from a given statistical method than to a pros/cons list related to pooled regression vs panel data regression.
That said, I fail to get the advantage (even from the budget holder's viewpoint) of sampling by scratch each year instead of following-up the same sample along a given time span (but my opinion is obviously influenced by my personal research experience, that lean towards longitudinal staudy and panel data regression).

Last edited by Carlo Lazzaro; 13 Mar 2019, 11:34.

Kind regards,
Carlo
(Stata 19.0)
Comment
Kosmas Yeo

Join Date: Sep 2018

Posts: 20
#7

13 Mar 2019, 11:18

> results that you can reasonably get from a given statitical method

This is the crux of my question. I want to know what results panel-based methods can tell us that non-panel methods cannot. So as an example, let's say we want to see how a college degree impacts income. My understanding is that pooled cross-sectional non-panel data can tell us how income differs between college graduates vs non-graduates, while panel data and methods can tell us how much one's income increases if they get a college degree.

However, aren't both these interpretations de facto the same? If hypothetically all time-invariant parameters are controlled for in the pooled OLS, then wouldn't that equal the panel regression? And as such, the only "benefit" of having a panel data over pooled cross-sectional is the greater ability to address omitted variable bias. In interpretation, they are the same. No?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#8

13 Mar 2019, 12:45

Kosmas:
as we know, things are different if we talk about fixed effect or random effect specification.
Recycling your example about income (as regressand) and college degree (as unique predictor) and limiting what follows to -xtreg-, -fe- specification, focusing on within variation, highlights the changes in income as time goes by within the same panel; unfortunately, being college degree a time-invariant predictor, no outcome will be given, as the -fe- machinery wipes it out. Conversely, -re- can do the trick, as it focuses on income variation as time goes by between the panels the dataset is composed of.
By the way, your example is a strong case for instrumental variable regression, as individual ability is correlatd with both college degree and income.
If you actually have random effects and -re- specification is the way to go, pooled -regress- cannot capture them (or, if -fe- is the right specification for your data, you can -regress- with -i.panelid- and obtain the same point estimates as with -xtreg,fe-, but the standars errors will be biased vs .xtreg,fe).
Eventually, -fe- specification allows for a sort of weak endogeneity, in that it allows panel-wise effect to be correlated with the vector of predictors (which an apparent violation of OLS requirements), whereas -re- does not.
Hence, the discussion covers many different facets but, as a general rule, pooled OLS cannot outperform -xtreg-, if you have a genuine panel dataset.

Kind regards,
Carlo
(Stata 19.0)
Comment
Kosmas Yeo

Join Date: Sep 2018

Posts: 20
#9

13 Mar 2019, 13:49

I think there was a communication error. I understand what you have said, but that wasn't the crux of my question. Let me try once more to rephrase.

Assume we have two datasets. Dataset_A is structured as panel over two rounds (follows same people over two rounds). Dataset_B is structured as cross-sectional pooled over two time periods (respondents are resampled in each time period). If I run run the standard pooled OLS model: INCOME ~ COLLEGE + CONTROL, how will the interpretation differ between the two datasets?
Comment
daniel klein

Join Date: Mar 2014

Posts: 3848
#10

13 Mar 2019, 13:49

Perhaps quite an oversimplification, however, probably the biggest advantage of panel data is the ability to control for observed and unobserved confounders that do not vary within panels. In more general terms, panel data allows us to differentiate between within-panel and between-panel variation. Therefore, panel data is much better suited for (counterfactual) causal analysis than repeated cross-sections. As pointed out by Carlo, to use this huge advantage of panel data, we need to apply the corresponding estimators; using a "classical" random-effects specification basically means throwing away the advantage of panel data.

One of the disadvantages of panel data, aside from financial considerations concerning the collection, is that repeatedly surveying the same sample cannot tell you much about (more descriptive) developments over time in the population; this problem gets larger with selective panel attrition.

As so often, the question which kind of data is best (suited) can only be judged against the precise research question.

Best
Daniel
1 like
Comment
Kosmas Yeo

Join Date: Sep 2018

Posts: 20
#11

13 Mar 2019, 14:05

Thanks Daniel. That more-or-less addresses my question. I now understand that running a pooled OLS on panel is equal to running a pooled OLS on cross-sectional pooled, since we are "throwing away" the advantage of panel using this method.

If I may ask a followup. You write, "In more general terms, panel data allows us to differentiate between within-panel and between-panel variation." What does this practically mean when interpreting regression results.

Suppose I have a model INCOME = 20*COLLEGE + B*T + B*CONTROLS, where COLLEGE is a binary. Aren't the interpretations the same regardless of data structure?

RE model on Panel: Those who went to college have on average 20% more income than those who did not go to college, controlling for observable confounders and unobversable time-invariant confounders.
OLS model on Pooled: Those who went to college have on average 20% more income than those who did not go to college, controlling for observable confounders

So is the only difference in interpretation the "unobvservable time-invariant confounders" part? What language in the interpretation would capture the "within-panel variation" that a panel data structure adds?

Last edited by Kosmas Yeo; 13 Mar 2019, 14:15.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3848
#12

13 Mar 2019, 14:37

Originally posted by Kosmas Yeo View Post

Thanks Daniel. That more-or-less addresses my question. I now understand that running a pooled OLS on panel is equal to running a pooled OLS on cross-sectional pooled, since we are "throwing away" the advantage of panel using this method.

Almost, but not quite exactly. When you have repeated cross-sections there is no within-panel/unit variation at all (except you happen to sample some units repeatedly by chance). The estimates that you are getting are based on between-panel/unit variation. Pooled OLS on panel data is basically an equally weighted average of within- and between-panel/unit variation; RE is an "optimally" weighted average of the two sources of variance, and FE is using only the within-panel/unit variance.

Originally posted by Kosmas Yeo View Post

Suppose I have a model INCOME = 20*COLLEGE + B*T + B*CONTROLS, where COLLEGE is a binary. Aren't the interpretations the same regardless of data structure?

Again, data structure alone does not make the difference. Also, "interpretation" is not really a statistical term. Different estimators make different assumptions.

Originally posted by Kosmas Yeo View Post

RE model on Panel: Those who went to college have on average 20% more income than those who did not go to college, controlling for observable confounders and unobversable time-invariant confounders.

No! Only FE, that is fixed-effects, controls for unobserved (within-panel/unit invariant) confounders; RE does not! I am also not sure about the per-cent interpretation but that is a different topic.

Originally posted by Kosmas Yeo View Post

So is the only difference in interpretation the "unobvservable time-invariant confounders" part?

Jap; but this might be quite an important difference. Think in terms of motivation, basic intelligence levels, etc. that are (i) very likely to affect both, the probability to graduate from college and income levels, (ii) are very unlikely to be observed, and (iii) hopefully more or less invariant within panels/units.

As Carlo has pointed out, unfortunately, educational levels such as college graduation are pretty much stable over time, so estimating their (causal) effect on anything will be quite tricky. As Carlo also suggest, perhaps you can find a suitable instrument; distance to college is sometimes used, I believe.

Best
Daniel
1 like
Comment

Announcement

Why and when is panel data preferrable to pooled cross-sectional?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment