Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Handling missing values in multi-level models: at least 5 observations per group

    Hi everyone,

    I am working with a multi-level dataset of individuals nested in counties/localities/municipalities (I cannot post any example because of data privacy). I have read elsewhere that, ideally, I would need 5 level-1 (individuals) observations per each group/ level 2 (counties). Before I run the multilevel models, I try to accomplish this by typing :

    bysort county_code: gen n=_N
    keep if n>5

    However, I am aware that this only removes rows based on the number of observations per county without considering whether there are missing values in the variables that I later use in my regressions. Since I am appending various individual surveys with different numbers of observations and variables, my multilevel models end up including level-2 units or groups with less than 5 observations ("min. observations per group = 1"), which, from what I understand, is not recommended.

    How could I handle this issue without simply dropping rows, as I change both the dependent and independent variables across the models I run?


    Thanks in advance,

  • #2
    As you do not show example data, I'll illustrate an approach to this using one of the online datasets available from StataCorp, modified by sprinkling some missing values into it.

    Code:
    webuse pisa2000, clear
    
    //    SCATTER SOME MISSING VALUES THROUGH THE DATA
    set seed 1234
    foreach v of varlist female-pass_read {
        replace `v' = . if runiform() < 0.05
    }
    
    //    IDENTIFY OBSERVATIONS WITH MISSING VALUES
    //    OF ANY IMPORTANT VARIABLE
    egen mcount = rowmiss(female-pass_read)
    
    //    IDENTIFY NUMBER OF COMPLETE OBSERVATIONS PER SCHOOL
    by id_school, sort: egen complete_cases = total(mcount == 0)
    
    //    KEEP SCHOOL IF IT HAS AT LEAST 5 COMPLETE CASES
    by id_school: keep if complete_cases >= 5
    That said, I don't think doing this is a good idea. Missing data is only occasionally a random accident. Missingness is often informative, that is, the observations that contain missing data (or are altogether missing) may well differ systematically from those that are complete. By selecting only observations with at least 5 complete cases you may well be introducing a bias into your sample. While there is probably nothing you can do about the counties that simply have only a small number of observations available, you should still keep them in your sample for analysis. Retaining them will do no harm, and deleting them very well might.

    As for observations that are present in the data set but have missing values for some variables, there may be better ways of dealing with them. Look into https://statisticalhorizons.com/wp-c...aterials-1.pdf.

    Comment

    Working...
    X