Handling missing values in multi-level models: at least 5 observations per group

Nadal Perales

Join Date: Oct 2024

Posts: 1
#1

Handling missing values in multi-level models: at least 5 observations per group

04 Oct 2024, 09:58

Hi everyone,

I am working with a multi-level dataset of individuals nested in counties/localities/municipalities (I cannot post any example because of data privacy). I have read elsewhere that, ideally, I would need 5 level-1 (individuals) observations per each group/ level 2 (counties). Before I run the multilevel models, I try to accomplish this by typing :

bysort county_code: gen n=_N
keep if n>5

However, I am aware that this only removes rows based on the number of observations per county without considering whether there are missing values in the variables that I later use in my regressions. Since I am appending various individual surveys with different numbers of observations and variables, my multilevel models end up including level-2 units or groups with less than 5 observations ("min. observations per group = 1"), which, from what I understand, is not recommended.

How could I handle this issue without simply dropping rows, as I change both the dependent and independent variables across the models I run?

Thanks in advance,
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

04 Oct 2024, 11:23

As you do not show example data, I'll illustrate an approach to this using one of the online datasets available from StataCorp, modified by sprinkling some missing values into it.

Code:

webuse pisa2000, clear // SCATTER SOME MISSING VALUES THROUGH THE DATA set seed 1234 foreach v of varlist female-pass_read { replace `v' = . if runiform() < 0.05 } // IDENTIFY OBSERVATIONS WITH MISSING VALUES // OF ANY IMPORTANT VARIABLE egen mcount = rowmiss(female-pass_read) // IDENTIFY NUMBER OF COMPLETE OBSERVATIONS PER SCHOOL by id_school, sort: egen complete_cases = total(mcount == 0) // KEEP SCHOOL IF IT HAS AT LEAST 5 COMPLETE CASES by id_school: keep if complete_cases >= 5

That said, I don't think doing this is a good idea. Missing data is only occasionally a random accident. Missingness is often informative, that is, the observations that contain missing data (or are altogether missing) may well differ systematically from those that are complete. By selecting only observations with at least 5 complete cases you may well be introducing a bias into your sample. While there is probably nothing you can do about the counties that simply have only a small number of observations available, you should still keep them in your sample for analysis. Retaining them will do no harm, and deleting them very well might.

As for observations that are present in the data set but have missing values for some variables, there may be better ways of dealing with them. Look into https://statisticalhorizons.com/wp-c...aterials-1.pdf.
1 like
Comment

Announcement

Handling missing values in multi-level models: at least 5 observations per group

Comment