Including missing values in GLM

Chloe Sydney

Join Date: Mar 2017

Posts: 9
#1

Including missing values in GLM

20 Mar 2017, 04:45

Good morning all,

I'm examining the proportion of refugees returning 'home' based on conditions in their countries of asylum and countries of origin across one hundred country-dyads in a twenty year period, using the following code in Stata 13 (with dependent/independent variables renamed here for ease of use):

stepwise, pr(.2): glm dependent_variable independent_variable1 independent_variable2 independent_variable3 [etc], link(logit) family(binomial) robust nolog

However, a lot of my independent variables have missing values, and I've just discovered that none of these are included in the GLM analysis - my dataset has over 1,900 rows, but only around 400 of these are actually used in the model (i.e., only the rows for which none of the information is missing). What can I do in order to include missing values in my model?

I've heard about the MI command, and I was wondering if that might help - but I don't know how to use it, nor how it would affect my model.

Suggestions would be much appreciated!

Thank you!

Chloe
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

20 Mar 2017, 05:41

Hrllo Chloe,

You can use Stata to apply multiple imputation methods. You may wish to start by reading the Stata Manual on this topic: there is a specific 'book' in the Manual and I strongly recommend you start by taking a look at it.

You may also type in the command window - help mi impute - and get core information.

That said, beware multiple imputation methods are useful for MAR (missing at random) type of missing data. Therefore, you are also supposed to demonstrate the missing data pattern follows this track. Again, you will find great start-up information in the Stata Manual.

Last edited by Marcos Almeida; 20 Mar 2017, 05:43.

Best regards,

Marcos
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#3

21 Mar 2017, 11:33

There are also other ways to handle missing data - there is an alternative approach available in SEM/GSEM.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#4

21 Mar 2017, 12:26

Chloe:
as an aside to Marcos and Phil's helpful advice, with such a heavy burden of missing values the reliability of any inference is seriously at risk.
Stata applies a listwise deletion approach to observations with missing values in any of the variables: hence, no wonder that you ended up with 400 out of 1900 observations (see, if interested,: https://www.ncbi.nlm.nih.gov/pubmed/12589867).
Besides, as Marcos warned you about, it is very likely that you have missing not a random (MNAR or NMAR) values (that you cannot prove different from MAR, unfortunately). If that were the mechanism underpinning the missingness of your data, it probably matches with a monotone missing data patterns (i.e., the missingness for the same variable is repeated in each wave of data following the first reported missingness).
For possible fixes, I would take a look at https://www.crcpress.com/Flexible-Im.../9781439868249

Kind regards,
Carlo
(Stata 19.0)
Comment
Chloe Sydney

Join Date: Mar 2017

Posts: 9
#5

29 Mar 2017, 05:47

Thank you very much for these helpful replies. As you say Carlo, my missing values pose a serious limitation to my analysis. I am looking at data such as percentage of children out of school, and in many cases data is simply not available for certain years/countries.

I'll have a read through the ovarian cancer study article, which looks very relevant. However, after consultation with my supervisor, I'm very wary of using any form of imputation for my missing values - is there no way of somehow including the missing values in the analysis? If not in Stata, perhaps using a different software, such as R?

Again, thank you for your support!

Best wishes,

Chloe
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#6

29 Mar 2017, 06:42

is there no way of somehow including the missing values in the analysis?

You may wish to take a look at Phil's post #3. By "somehow", I assumed you meant the "lato sensu" approach.

Best regards,

Marcos
Comment
Chloe Sydney

Join Date: Mar 2017

Posts: 9
#7

29 Mar 2017, 07:31

Thank you Marcos, I will look into it. I'm afraid my knowledge of stats is very limited, but I'm doing my best to research all the suggestions.

Best wishes,

Chloe
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#8

29 Mar 2017, 08:58

Chloe:
the main issue with your data is not the software you may use, but how to deal with the mechanism underpinning the missingness of your data.
If your data are MNAR, a feasible approach is reported in https://www.crcpress.com/Flexible-Im.../9781439868249, chapter 7.2.

Kind regards,
Carlo
(Stata 19.0)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#9

29 Mar 2017, 09:54

Carlo provided great advice and Phil pointed out to an excellent alternative. I gave my view in #2.

On second thoughts, though, and assuming your "dataset has over 1,900 rows, but only around 400 of these are actually used in the model (i.e., only the rows for which none of the information is missing)", I believe that having around 79% of missing data for a given model is quite a problem to curb.

Maybe I'm missing something, but I fear such task may become unfeasible, at least if we abide by the theoretical assumptions of multiple imputation methods in general.

Best regards,

Marcos
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#10

29 Mar 2017, 10:11

Chloe:
laudably, Marcos' reply get our feet back on the ground.
I forgot to read you original post before replying today.
If the number of valid (i.e., with no missing values) observations (not rows in Stata jargon, please) is only 21% out of the theoretical sample, the results of your main (and, manadatorily, sensitive) analyses should be taken with care.
My previous reference is still valid, I think, but, if you're going to present your research to an audience, I guess that the gist of your speech cannot exceed commenting on an exercise about how to deal with extreme missing values.

Kind regards,
Carlo
(Stata 19.0)
Comment
Chloe Sydney

Join Date: Mar 2017

Posts: 9
#11

30 Mar 2017, 08:41

Thank you all, this is incredibly helpful.

I'm adding a lengthy limitations section where I can expand on all these challenges in depth, and I'm also discussing the various alternatives and their differing results, including:
1. the initial attempt, with important listwise deletion issues and only 400 observations
2. a second attempt, excluding the variable with most missing values, leaving me with 1600 observations (better!)
3. a third attempt, using Amelia for multiple imputation.

Again, thanks for your support.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#12

30 Mar 2017, 09:36

Chloe:
as far as point 1 of your well-conceived dissemination strategy is concerned, I would quote https://www.ncbi.nlm.nih.gov/pubmed/12589867.
Some concerns about point 2: most of your expalnation should be devoted to missing data mechanism and pattern of this variable.
As an aside: why using Amelia when Stata -mi- can do the same?

Kind regards,
Carlo
(Stata 19.0)
Comment
Ambra Mazzelli

Join Date: Mar 2015

Posts: 10
#13

07 Nov 2019, 20:46

Dear all,

I am facing a similar issue. I am trying to examine the effect of family ownership on the deactivation of offshore companies in the Panama Papers data leak. Unfortunately, for some of the companies, ownership information is not available because of the use of bearer shares, enabling beneficial owners to maintain their full anonymity. This is a case of NMAR.
Furthermore, I don't have data to capture the size of offshore companies in the network - which may cause omitted variable problems. To correct for the selectivity effect arising from missing ownership information and for omitted variable problems arising from data unavailability on company size, I am using the command gsem in Stata/MP 15.1.

This is the list of my covariates:

familiar_shareholders - family-owned company (binary variable)
age - company's age (quantitative variable)
llc - legal form: llc vs. corporation (binary variable)
connected_exits_L1 -the number of deactivations of tied offshore companies in the ownership network between the beginning of the prior year and time t (quantitative variable)
degree centrality (quantitative variable)
closeness centrality (quantitative variable)
betweenness centrality (quantitative variable)
local clustering (quantitative variable)
inter_incorporations_L1 - number of incorporations by the focal company's intermediary at t-1 (quantitative variable)
inter_deactivations_L1 - number of deactivations by the focal company's intermediary at t-1 (quantitative variable)
jur_deactivations_L1 - number of deactivations in the focal company's jurisdiction at t-1 (quantitative variable)
jur_incorporations_L1 - number of incorporations in the focal company's jurisdiction at t-1 (quantitative variable)
post_tiea - country-jurisdiction dyad under a tax information exchange agreement (binary variable)
i.year - year fixed effects (binary variables)
i.count_jur_dyad - country-jurisdiction dyad fixed effects (binary variables)
bearer - bearer shares, causing familiar_shareholders = . (binary variable)

I am assuming that the probability of deactivation depends on a time-varying latent variable L1 (representing size), which also affects family ownership and the presence of bearer shares. In particular, family ownership is affected by L1, other exogenous covariates, and conditional on the absence of bearer shares - I am using another latent variable L2 to correct for selection. The presence of bearer shares is affected by L1, L2, and other exogenous covariates. I am using weights (k2k Coarsened Exact Matching (CEM) approach) to match family and non-family owned offshore companies in the sample based on their network positioning (degree, closeness, betweenness, clustering), legal form, and incorporation year.

The Stata code I am using is:

Code:

gsem (deactivation <- i.familiar_shareholders connected_exits_L1 degree closeness_adj betweenness_adj clustering age i.llc inter_incorporations_L1 inter_deactivations_L1 jur_deactivations_L1 jur_incorporations_L1 i.post_tiea i.year L1, probit) /// (familiar_shareholders <- L1 L2 age inter_incorporations_L1 inter_deactivations_L1 jur_deactivations_L1 jur_incorporations_L1 i.post_tiea i.year, probit) /// (bearerother <- L1 L2@1 age inter_incorporations_L1 inter_deactivations_L1 jur_deactivations_L1 jur_incorporations_L1 i.post_tiea i.year, probit) [iweight=jc_k2k_matched_cem], var(L1@1 L2@1) nolog

I had to remove country-jurisdiction dyad fixed effects from the model because I was receiving the following error message:

Code:

Grid search failed to find values that will yield a log likelihood value.

The model has been running for 15 hours. Since I will have to run it several times to test my hypotheses, I was wondering whether you could provide me with any suggestions on how to simplify the model while still correcting for endogeneity and selectivity issues.

Thank you for considering my request.
Best regards,
Ambra

Last edited by Ambra Mazzelli; 07 Nov 2019, 21:05.
Comment

Announcement

Including missing values in GLM

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment