Regression Analyses for Repeated Cross-sectional dataset

Emeka Dim

Join Date: Jan 2018

Posts: 44
#1

Regression Analyses for Repeated Cross-sectional dataset

12 Jan 2025, 04:41

Good day everyone,

Please, I am running an analysis on a repeated cross-sectional dataset. The data was collected in 9 rounds in a particular country. Each round was collected from different parts of the population for 9 different years (from 1999 to 2021). I merged these datasets, especially with the variables that were similar over time, to create one dataset with different time periods. I added some country-level variables that aligned with the time periods, and I intended to run cross-level interactions in the dataset (i.e., the country-level variables interacted with the individual-level variables). My dependent variable is bribery (bribe2), I coded the years of data collection as rd. I added country-level variables like cpi (Corruption Perception Index), gdp_pc, and free_ind_r (Freedom Index). My main focal independent variable at the individual level is the handling of corruption (handling_corruption2). I interacted free_ind_r and handling_corruption2 to predict bribe2. The result below is an example of an analysis I did.

Please, I would like to know if this analysis works or if you see any issues with the analyses. I would like to know if the "reg" is enough for the analyses or do I have to use "xtreg."

Also, and more importantly, should I use "vce(cluster rd)," or can I run the analysis without the clustered standard error?

I had seen that this type of analysis was possible from a chapter by https://dam.ukdataservice.ac.uk/medi...geovertime.pdf. Please see pages 9 and 10.

(MY SINCERE APOLOGIES IF THE STATA RESULTS CONTRAVENE THE RULES OF THIS FORUM; I DO NOT KNOW HOW TO ATTACH RESULTS ON THIS FORUM YET).

Thank you.

Attached Files

Last edited by Emeka Dim; 12 Jan 2025, 04:43.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29911
#2

12 Jan 2025, 08:36

I think the main limitation of that approach is that it does not account for any secular effect. You hint at the idea in raising the question of using -vce(cluster rd)-, but because you have only 9 years, clustering the standard errors is likely to worsen, rather than improve, the accuracy of the standard errors. While there is no hard and fast rule about how many clusters are needed to use clustered standard errors, I think everyone will agree that 9 is not sufficient.

If this were my model, I would just add i.rd to the model to adjust for secular effect. (Or, if you have reason to believe that the secular effect will be a linear trend over time, you can use c.rd for the purpose.)
1 like
Comment
Emeka Dim

Join Date: Jan 2018

Posts: 44
#3

12 Jan 2025, 11:46

Thank you for your reply. I ask this question to ensure that my regression models are not faulty. Following your suggestion, I ran this model:
reg bribe2 i.urban resp_age i.educ i.female i.discuss c.cpi c.elec_free c.gdp_pc c.eco_ind c.rd c.free_ind_r##i.handle_corruption2

One of the concerns I have with adding c.rd to the model is that it is highly correlated (r2 = 0.9) with gdp_pc and cpi, which are some of the other country-level variables I am interested in interacting with handling_corruption2 (the individual-level variable). I can leave rd into the model if the high correlation would not be a problem or if there is a test I can run to ensure that adding c.rd would not mess up my model or give me the wrong estimates. The collin command I ran for this model showed me that rd gdp_c had a VIF score of more than 20.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29911
#4

12 Jan 2025, 12:16

Well, I had suggested using c.rd if you have reason to believe that the secular trend in bribe2 would be linear in time. That's a substantive question, and this is not an area where I have much knowledge. FWIW, my intuition is that a linear relationship between rates of bribery and time would be surprising, and I probably would have gone with i.rd. But, again, it's out of my area of expertise, and I defer to your judgment on it.

Now, concerning your concern about the correlation between rd and gdp_c, I don't see any issue here. Even if the VIF were 20,000, that correlation seems just irrelevant for the purpose of your study. As I understand it, your goal is to investigate the relationship among handle_corruption_2, free_ind_r, their interaction, and outcome bribe2. So take a look at the standard errors for those coefficients. They are nice and small: your estimates of those parameters are clearly quite sharp. So, no matter how much collinearity there is in other variables, you don't have a colinearity probem. I recommend you get your hands on a copy of Arthur Goldberger's textbook, A Course in Econometrics. He has an extremely well-written chapter there discussing multicolinearity and why it is a bogus issue in most circumstances, and is an unsolvable problem in the few situations where it actually matters. Fortunately, in your case, it's a bogus issue. Move on.

MY SINCERE APOLOGIES IF THE STATA RESULTS CONTRAVENE THE RULES OF THIS FORUM;

We don't have rules, and even if we did, there is nobody to enforce them. We do have an FAQ, which you should read, that gives excellent advice on how to show results, and how to show example data. That advice is based on optimizing the user experience for everybody. While the way you presented your results does not accord with that advice, the results are readable. The advice to avoid attachments of results files is because they are often unreadable here on the Forum website. But all that results need to be is readable, so there is no issue. The situation for example data, which you did not show, and which is not needed to solve this problem, is a bit more narrow in that it will hardly ever be the case that deviating from the advice on how to show example data, the -dataex- program, proves helpful to those who would choose to respond to your question So keep that in mind for future problems where you might need to show example data.

But, again, this is all pragmatic: the FAQ give you advice that, if followed, will lead to good results. If you deviate from it, then you may find that you get no response, or a very delayed response, and a reminder to consult the FAQ before posting, because nobody in a position to respond can read the results/use the example data you showed. But sometimes, as in this thread, things work out fine notwithstanding.
Comment
Emeka Dim

Join Date: Jan 2018

Posts: 44
#5

12 Jan 2025, 15:34

Thank you so much for the advice. My continent (Africa) rarely has panel data, so the repeated cross-sectional data is the closest I have to longitudinal data, and I am hoping to make the most of it.

For the rd variable, I use c.rd rather than i.rd. If I were to use i.rd, my models would not be able to produce any margin plots, so I assume the years to be c.rd so that I can control for the year of data collection without any disruptions for my margin plots.

Given that I would not need to use the clustered standard errors, would it be fine to use vce(robust)? Would the vce(robust) make any difference? Or is there a situation in which vce(robust) should be used in the type of model I am running? I ran my model with and without vce(robust), and the AIC/BIC comparison scores showed no difference. Please, what do you think?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29911
#6

12 Jan 2025, 16:42

If I were to use i.rd, my models would not be able to produce any margin plots

??? This should not happen. Even given that this does happen, that is no justification for choosing a different model if you have no reason to believe it is a better approximation to the real-world data generating process.

Piecing together a few stray things mentioned earlier in the thread, I think I may know what is causing that to happen. rd is a year variable. All of your data is in a single country. And your model contains a variable gdp_pc, which I will speculate is per capita gross domestic product. If this is true, then i.rd will be exactly colinear with gdp_pc and gdp_pc will be dropped from the analysis. You will then not be able to calculate -margins- for gdp_pc, nor for the rd indicators. But the solution here is straightforward: eliminate gdp_pc from the model. You are not testing gdp_pc in this model, you are testing handle_corruption_2 and free_ind_r (and their interaction). gdp_pc is included only as a covariate, to adjust for its extraneous effect on bribe2. So you don't need any results for gdp_pc at all. But if rd is in the model, that gdp_pc effect is already automatically adjusted for by i.rd itself. And, in fact, i.rd does even better than that: its inclusion automatically adjusts for any year-specific year_level variable that might have an effect on bribe2, whether included in the model or not, and even if you aren't even aware of its existence!

In setting up and interpreting regression models it is important to be clear about what variables are included because they are variables whose effects are being estimated/tested, distinguishing them from those that are included to adjust for extraneous effects ("control variables".) The results you get for the latter should be ignored (unless they are so obviously unreasonable that they lead you to believe you have miscoded your model or there is a problem in the data.) The values of the results for these "control" variables are irrelevant to your goal of testing/estimating the focal effects of your study. And, if you have more than one variable that can effectively adjust for the same effect as a "control" variable but provides additional benefits, as is the case for rd here vis-a-vis gdp_pc, don't hesitate to omit the less useful variable from the analysis.

Now, what if gdp_pc is actually not just a covariate here but you really have a research interest in its effects. Well, in that case, you are in serious trouble because your data design precludes doing a really good test of it, due to the concurrent effect of time, with which it is exactly colinear. To test or estimate the gdp_pc effect, you would need a two-way fixed effects design that included both multiple years and multiple countries (or other geographic entities whose gdp's are measured); that design would break the colinearity between gdp_pc and rd. Your existing design does let you get an estimate of a gdp_pc effect, but that effect is likely confounded by other year-level effects that you don't have measures for in your data (and maybe don't even know exist). So your estimate of the gdp_pc effect may well be biased due to unobserved confounding (omitted variable bias).
Comment
Emeka Dim

Join Date: Jan 2018

Posts: 44
#7

12 Jan 2025, 18:51

Thank you for your response.

The idea is to run different models that examine the interaction of GDP per capita (gdp_pc) and Freedom Index (free_ind_r) on the handling of corruption with bribery as the outcome variable. I took out gdp_pc in the initial model as you suggested and ran the model with i.rd and free_ind_r interacted with handling_corruption2. However, it refused to estimate any marginsplot from these results. The correlation score between free_ind_r and rd is 0.57, which is moderate.
The model is: reg bribe2 i.partisan i.urban resp_age i.educ i.female i.handle_crime i.handle_corruption2 i.rd c.free_ind_r##handle_corruption2 and I got the result below:

However, I tried the xtreg command. I set the data at xtset rd and ran the model without the rd as I assumed that the xtreg accounts for the years. This model was able to produce margin plots.
The model is: xtreg bribe2 i.partisan i.urban resp_age i.educ i.female i.handle_crime i.handle_corruption2 c.free_ind_r##handle_corruption2

I have heard of an arma or arima analysis, but I do not want to assume that I need to use that type of analysis for my dataset, which is only 9 rounds.

Please, at this point, I just need a way forward. I would appreciate any solutions you can suggest.

Thank you.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29911
#8

12 Jan 2025, 20:21

In the -regress- model of #7, free_ind_r is omitted. You didn't show the output preceding the regression table, but I'm pretty sure that in there it says that free_ind_r is omitted due to colinearity. As this did not happen in the -regress- model of #1 that excluded rd, it is likely that the colinearity is between free_ind_r and rd. That would mean, in other words, that free_ind_r is a year-level variable that does not vary among respondents from the same survey round. Is that the case? You describe free_ind_r as a "Freedom Index." Is that, like gdp per capita, a variable that is year-level and perhaps varies among countries, but since you have only one country in the data, it is purely a year level variable here? If so, that explains what is going wrong here. And it is the same as the situation before with gdp_pc. You cannot estimate an effect for such a variable in a model with i.rd included. If estimating/testing that effect is part of the research goal, you simply cannot do that with your existing data design. If that variable is present only as a covariate, then you can just drop it from the model altogether, as i.rd will adjust for its effects and more.

Turning to your -xtreg- model, what you have there is a random-effects model. It does overcome the colinearity issue. But I'm afraid that doesn't get you off the hook. First, notice that at the bottom of the output, we have the variance component sigma_u is 0: no surprise there because free_ind_r absorbs the variance at that level and leaves none for the random effects. Moreover, just as 9 levels is not enough for use of cluster robust standard errors, it is usually not enough to use as a random effect variable either. In this case, because there actually isn't any unexplained rd-level variation in the model (because free_ind_r has proxied it), the small number of levels doesn't matter. But what it also means is that this isn't really an -xtreg- regression--because the rd level has collapsed. It's just a regress model, with rd proxied by free_ind_r (which is entered as a continuous linear variable) in an -xtreg, re- costume. But, since you entered free_ind_r as continuous, it is a weak proxy for time, which means that the secular effects are only partially and inaccurately being modeled.

I know you are looking for a way forward. Sometimes the way forward begins by taking a couple of steps back. The research questions you seem to be asking simply cannot be answered from this data set. So I suggest you reconsider your overall approach. Can you get a better data set, specifically, one that include respondents from more than one country? If so, analyzing that with a two-way fixed effects regression, will enable you to probe those questions.

If no such data set is feasible to obtain, then I suggest you revise your research goals, restricting them to questions that can be answered in this data set. The restriction you face with this data is that you cannot estimate or test the effect of any variable that is constant for all respondents in the same wave of the survey, or, rather, you can only do so with a model that produces a result that is confounded by the passage of time and cannot be truly attributed to the variable itself. So you need to restrict your research goals to questions primarily about respondent-level variables only. When I say that they must be primarily about respondent-level variables, the wiggle room here is that you can investigate whether a person level effect is modified by an rd-level variable: this can be done by including person_level_variable and person_level_variable#i.year_level_variable terms (N.B. #, not ##), along with i.rd but without an i.year_level_variable term.

I'm pretty sure this is not what you wanted to hear (read), but it's what I have to tell you.
Comment
Emeka Dim

Join Date: Jan 2018

Posts: 44
#9

12 Jan 2025, 20:35

Thank you. You are right about the composition of the dataset. It is a variable at the year-level (the same figure for a specific round of survey).

I ran the model you suggested (i.e., reg bribe2 i.partisan i.urban resp_age i.female i.dem_sup i.handle_corruption2##i.rd) and I got the result below:

I am guessing this is what I can work with. I am also guessing that I cannot run rd as a curvilinear variable.
Please, with this type of model, do I need to cluster the standard error or add robust standard errors (i.e. vce(robust))?

You asked if I could get another dataset. I have another country that was captured around the same time this current country I am using was surveyed. I could combine this country and another country (which is politically different) for a comparative analysis. Would that make sense?

Thank you.

Last edited by Emeka Dim; 12 Jan 2025, 21:23.
Comment
Emeka Dim

Join Date: Jan 2018

Posts: 44
#10

13 Jan 2025, 03:45

Hi, please what do you think of my last post?

Thank you.

Last edited by Emeka Dim; 13 Jan 2025, 04:16.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35358
#11

13 Jan 2025, 03:59

Clyde Schechter is based in California and is unlikely to respond for several hours.

In #4 he explained about our FAQ Advice. There is also advice there about bumping: https://www.statalist.org/forums/help#adviceextras #1
Comment
Emeka Dim

Join Date: Jan 2018

Posts: 44
#12

13 Jan 2025, 04:16

Thank you for the information. My initial last post did not contain much of the information I posted. I had revised it later and I felt he had moved on from the post.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29911
#13

13 Jan 2025, 08:39

What you show in #9 looks better, and examines the effect of handle_corruption2 on bribe2, and the extent of its variation over time. I agree that a curvilnear model of rd does not look at all promising here. As for robust standard errors, yes, you can use them. But, as you still have only 9 levels of rd, that is insufficient for -vce(cluster rd)-.

I have another country that was captured around the same time this current country I am using was surveyed. I could combine this country and another country (which is politically different) for a comparative analysis. Would that make sense?

I would think so. It would expand the range of questions you can ask to include some rd-level variables. Now, there are sometimes problems with multi-country data. For example, if different languages are used and some variables are survey responses, there is the possibility that some questions/responses "don't translate well" and produce data that is distorted by "the same" responses meaning different things. Or government statistics may be defined or calculated differently in different countries. So you have to consider those things, but if the variables are similarly ascertained in all of the countries, adding them to the data expands both the range of questions you can study and the generalizability of your results.
Comment
Emeka Dim

Join Date: Jan 2018

Posts: 44
#14

13 Jan 2025, 12:44

Thank you.

I will apply the ve(robust) to the models going forward.

For the country comparisons, the current data I am analyzing is for Ghana. I have Nigerian data; both nations are officially English-speaking countries. Also, for the country-level variables, I am using data from international institutions like the World Bank and foreign ranking agencies like the Heritage Foundation, Polity V, and V-Dem indicators that apply the same standards to all countries when ranking countries politically. The idea is through the period of the data collection (i.e. from 2000 to 2022), I will compare Ghana, a country that is ranked to be politically free and stable, and Nigeria, a fairly unfree and unstable country, in the regression models. My idea is to run a regression model that interacts country by time, which will predict bribery.

I am only seeking different means of making the most of what I have.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29911
#15

13 Jan 2025, 13:06

Sounds good.
Comment

Announcement

Regression Analyses for Repeated Cross-sectional dataset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment