Handling Missing Data Using Missing Indicators

Fouziah Almouqati

Join Date: Oct 2023

Posts: 13
#1

Handling Missing Data Using Missing Indicators

28 May 2024, 04:30

Dear Stata Users,

I am managing missing data in my analysis. My dataset includes 553,426 observations with 14 predictors, all categorical. There are missing values in 8 predictors, totaling 35,810 (6.6%). I have chosen to assign a fixed value of 23 to represent missing data and created dummy indicators using:

Code:

misstable summarize total_CT_AT pre_year_cat age_grp p_sex ind_stat claim_grp triage_code pd_grp rf_source_grp arri_means_grp pre_day_type2 pre_shift treat_clin_grp covid_peak EO_Sta, generate(miss, exok) recode ind_stat claim_grp triage_code pd_grp rf_source_grp arri_means_grp treat_clin_grp EO_Sta (. =23 )

Then, I ran my regression using the following code, placing the missing dummies after each variable:

Code:

nbreg total_CT_AT ib(first).pre_year_cat ib(first).age_grp ib(first).p_sex ib(first).ind_stat ib(first).missind_stat ib(first).claim_grp ib(first).missclaim_grp ib(first).triage_code ib(first).misstriage_code ib(first).pd_grp ib(first).misspd_grp ib(first).treat_clin_grp ib(first).misstreat_clin_grp ib(first).rf_source_grp ib(first).missrf_source_grp ib(first).arri_means_grp ib(first).missarri_means_grp ib(first).pre_day_type2 ib(first).pre_shift ib(first).covid_peak ib(first).EO_Sta ib(first).missEO_Sta, dispersion(mean) vce(cluster p_ID) irr allbaselevels

Q1: Is this method correct? I am aware that it may have some limitations, so I am open to other suggestions.

Q2: If the method is acceptable, I received an output that included several notes about omitted variables due to collinearity:

nbreg total_CT_AT ib(first).pre_year_cat ib(first).age_grp ib(first).p_sex ib(first).ind_stat ib(first).missind_stat ib(first).claim_grp ib(first).missclaim_grp ib(first).triage_code ib(first).misstriage_code ib(first).pd_grp ib(first)
> .misspd_grp ib(first).treat_clin_grp ib(first).misstreat_clin_grp ib(first).rf_source_grp ib(first).missrf_source_grp ib(first).arri_means_grp ib(first).missarri_means_grp ib(first).pre_day_type2 ib(first).pre_shift ib(first).covi
> d_peak ib(first).EO_Sta ib(first).missEO_Sta , dispersion(mean) vce(cluster p_ID) irr allbaselevels
note: 1.missind_stat omitted because of collinearity.
note: 1.missclaim_grp omitted because of collinearity.
note: 1.misstriage_code omitted because of collinearity.
note: 1.misspd_grp omitted because of collinearity.
note: 1.misstreat_clin_grp omitted because of collinearity.
note: 1.missrf_source_grp omitted because of collinearity.
note: 1.missarri_means_grp omitted because of collinearity.
note: 1.missEO_Sta omitted because of collinearity.

Is this normal?

I have performed the same analysis after deleting some observations with missing values and regrouping others under the categories "Other" and "Unknown," and the coefficients have not changed significantly.

Thank you for any advice and help

Regards,
Fouziah
Tags: None

Announcement

Handling Missing Data Using Missing Indicators