How to select categorical variable in Hoteck method to impute missing data

Ali Rezaei

Join Date: Jan 2022

Posts: 15
#1

How to select categorical variable in Hoteck method to impute missing data

04 Feb 2022, 21:31

Hello Everyone
I am trying to address the missing values of four variables (price) in my research and I am using Hotdeck package in Stata. I was wondering if you guide me about categorical variable in the By option. Data structure is Panel data as below

State , Population, Longitude, Latitude, Municipality , date , Price of Commodity 1, Price of Commodity 2, Price of Commodity 3, Price of Commodity 4
…. … …. … …. …. … … … …

Panel identifiers are Municipality and Date.

I used different combinations of State Population Longitude and Latitude as Categorical variables to impute missing data using hotdeck command. I am not sure which combination should be chosen. Municipalities are grouped by states. Rationale behind the using these variable is to create more donor pool and use more relevant data for imputation.
I appreciate your comments in advance
Tags: hotdeck, imputation, missing value
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#2

05 Feb 2022, 01:33

Ali:
probably the best advice is to take a look at -mi- suite of commands once that you have diagnosed the mechanism underlying the missingness of your data.

Kind regards,
Carlo
(Stata 19.0)
Comment
John Eiler

Join Date: Nov 2019

Posts: 50
#3

08 Feb 2022, 06:02

Well, defining the cells is pretty much the art of any use of a hotdeck, and there are no hard rules here. The objective is always the same though -- to define cells such that you have enough donors in each cell to keep things random.

If your municipalities are fine grained enough but still contain enough donors, that seems like it would be the best choice in that you capture both regional variation and urban/rural variation. You could then go to lat/long as a backup but it's less clear how to define cells beyond just making them big enough to contain sufficient donors. Alternatively you could look more at a nearest neighbor (or knn) approach.

A lot of people hate hotdeck though, and much prefer the mi approach (I sometimes hotdeck and sometimes use stata's mi, it just depends on the situation for me). I will suggest looking at pmm though (try "help mi impute pmm"). The pmm stands for "predictive mean matching". You can read the documentation but very loosely it could be considered a hybrid of a regression and hotdeck.

It's conceptually very cool and seems to be more "respectable" than hotdeck to many people -- and it's always nicer when something is built into stata as opposed to "community contributed". I try to use pmm when I can but stata's implementation of it is painfully slow -- I usually have to give up on it with data that is a couple hundred thousand observations as it will show no progress after 15 minutes. My most recent experience with it is from stata 15 though, so it's possible they've improved the speed in version 17.

SAS also has hotdeck built in so you might find a friendlier reception for hotdeck questions there, if you don't hate SAS as much as me ;-)
2 likes
Comment
Ali Rezaei

Join Date: Jan 2022

Posts: 15
#4

10 Feb 2022, 12:47

Thank you John for your comment and very helpful suggestion.
Comment

Announcement

How to select categorical variable in Hoteck method to impute missing data

Comment

Comment

Comment