Highly unequal subsamples sizes in OLS regression (city-level effects)

Jeanee Miller

Join Date: Feb 2017

Posts: 3
#1

Highly unequal subsamples sizes in OLS regression (city-level effects)

07 Apr 2025, 15:28

I am planning to estimate an OLS regression model, to gauge the relationship between various sociodemographic (Census) features and political data at the census tract level. As an example, this model will regress voter turnout on education level, income, age composition, and racial composition. Both the dependent and predictor variables will be continuous. This model will include data from several cities and I would like to estimate city-level differences to see if the relationships between variables differ across cities. I gather that the best approach is to estimate a single regression model and include dummies for the cities.
The problem is that the sample size for each city varies very widely (n = 200 for the largest city, but only n = 20 for the smallest).

I have 2 questions:
Would estimating city-level differences be impossible with the disparity in subsample sizes?

If so, I could use block groups instead of census tracts. This would increase the sample sizes (n = 800 for the largest city, n = 100 for the smallest). Would this still be problematic due to the disparity between the two?
Tags: ols regression, regression, small sample, statistical power
Andrew Musau

Join Date: Oct 2014

Posts: 10055
#2

08 Apr 2025, 01:40

It doesn't matter much if n = 200 for City A and n = 20 for City B. You're only estimating one dummy per city, and that's perfectly feasible even with n = 20 (so long as it's not really tiny, like n < 5). Here, you're not estimating city-specific slopes, so you're not trying to detect how education, income, etc., work differently across cities. Instead, you're just controlling for the fact that City A may have higher turnout than City B on average, regardless of sociodemographics.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10055
#3

08 Apr 2025, 02:05

Of course, if you wanted to estimate city-specific slopes, you need to include interaction terms for each predictor \(\times\) each city dummy, multiplying parameters and quickly eating up degrees of freedom — and this will be problematic for smaller cities. So your second classification using block groups will be better than your first. You'd need to examine whether the standard errors for small cities are large and use caution when interpreting their results (probably focus on larger cities for slope comparisons). For this kind of data, estimating a multilevel model is often a better approach. The model allows you to estimate city-level intercepts and slopes, but it shrinks estimates toward the overall mean, especially for small cities. See

Code:

help mixed
1 like
Comment
Jeanee Miller

Join Date: Feb 2017

Posts: 3
#4

08 Apr 2025, 12:42

Thank you very much for responding. The interactions with city dummies x predictors was what I was planning to do - it is a great point to consider the degrees of freedom in that case.

it shrinks estimates toward the overall mean, especially for small cities

I'm less experienced with estimating multilevel models. Can you provide any reading material related to this part (above)?
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 660
#5

08 Apr 2025, 13:18

This one here is great: https://www.bristol.ac.uk/cmm/learning/online-course/

Best wishes

(Stata 16.1 MP)
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29937
#6

08 Apr 2025, 14:55

Just want to add my strong personal endorsement of the Bristol University course that Felix Bittmann recommended. I took the course myself and quickly and easily learned what I needed to begin using them in practical situations. It's accommodating to Stata users, and very well taught. An excellent investment of your time.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10055
#7

08 Apr 2025, 15:47

Originally posted by Jeanee Miller View Post

I'm less experienced with estimating multilevel models. Can you provide any reading material related to this part (above)?

Michael Clark has a nice explanation, accompanied by graphical demonstrations, on this topic on his GitHub page: https://m-clark.github.io/posts/2019...-mixed-models/. If you're interested in literature on multilevel models, refer to the references in the Stata PDF manual entry for the mixed command. Stata’s documentation is an excellent resource in its own right, with discussions and applications using example datasets.
Comment

Announcement

Highly unequal subsamples sizes in OLS regression (city-level effects)

Comment

Comment

Comment

Comment

Comment

Comment