Analysing non-random sampled survey data

Ameer Neseyif

Join Date: Jul 2022

Posts: 5
#1

Analysing non-random sampled survey data

19 Jul 2022, 07:49

I am attempting to analyse survey data we obtained through convenience sampling (an email with the survey attached was sent to a large number of people with the option to undertake the survey). I am unsure how to approach this analysis as all the information about analysing surveys with stata seems to use data that has been collected randomly.

Should I attempt to analyse the data using svyset? If so, how do I deal with the issue that the data was not collected randomly? Or should I just try and run regressions on it without svyset and try to tackle this issue with econometric models (fixed effects?)?

The data is set up with each column being a question and each row being a respondent. Some questions are multiple-choice and others are yes and no. Numbers represent the answers in the order that they are displayed in the survey.
e.g.
What is your age?
18-25
26-30
31-35
36-40
41+

If a respondent is 32, the number for this question's column will be 3.

Any help would be much appreciated!!

Many thanks,
Ameer
Tags: None
Maxence Morlet

Join Date: Mar 2021

Posts: 650
#2

19 Jul 2022, 12:56

Do you have any information concerning the overall population? Did you want to target a specific demographic? What is the purpose of the survey? Would you have reason to believe (theoretical reasons) that non-response was driven by the response itself? Is there scope for measurement error?

These are just a few questions you should ask yourself when you're in a situation like this. Solon et al. (2015) in the journal of human resources then provide a very helpful guide on whether to include weights or not in survey estimation.
Comment
Ameer Neseyif

Join Date: Jul 2022

Posts: 5
#3

21 Jul 2022, 05:53

Thank you, Maxence for the reply. The survey is targeted at investigating intergenerational diversity in the workforce so I am looking at how age is affecting peoples' answers to questions. An example question may be "Do you have opportunities to learn at work?" or "Do you think communication between younger and older workers is an issue?".

I have started running logistic regressions with these binary variables for the answers to the above questions as the dependent variable. I have attached a photo of an issue I'm coming across. All of the independent variables are binary and I have not included 1 of them on purpose to avoid multi-collinearity. However, the final variable (representing the age group 65+) is still omitted as it predicts "perfect failure".

Is logistic regression an appropriate regression method? If so, do you know any solution to this issue? I've seen some posts on here which address this issue but could not follow them/the solution was not clearly posted there.

Many thanks
Attached Files
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#4

21 Jul 2022, 10:05

-logistic- estimates the coefficients of the model by maximum likelihood. When a perfect prediction situation exists, as here, the maximum likelihood estimate of the coefficient is going to be infinity (or negative infinity). Because attempting to estimate that would hang Stata in a loop, it checks for these situations in advance and removes the offending observations and variables. Moreover, one might argue that if a variable is known to perfectly predict the outcome, you don't need any model for it, so Stata is doing you a favor.

On the other hand, I notice that number of offending observations here is, in all instances, very small, at most 3, in a data set of more than 700. This suggests that these variables are also "almost constant," and that the perfect prediction may be the result of simple chance. For example, with only 3 observations with q5 != 0, the probability that all three of them would also have q21 = 1 by chance alone is appreciable. So one might still want to include this variable in a model. The trick is to not use maximum likelihood. The -firthlogit- command uses penalized maximum likelihood, which copes with this situation well. It is written by Joseph Coveney and is available from SSC.

That said, with these variables being so nearly constant, even with -firthlogit-, your coefficient (odds ratio) estimates are going to be very imprecise, because there will be so little information in the data about what really goes on with q21 when q5 != 0. So if your research goals require tight estimates of the effects of variables like q5, you will need a much larger sample, one that includes large numbers of observations with q5 != 0, to achieve them.
Comment
Ameer Neseyif

Join Date: Jul 2022

Posts: 5
#5

24 Jul 2022, 02:52

Many thanks for your reply Clyde, firthlogit worked a treat. One more thing, is it possible to use the penalised maximum likelihood model and get the odds ratio values instead of the coefficients (like the logistic command outputs) as these are more useful for me?
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#6

24 Jul 2022, 03:36

you don't show what you typed (as the FAQ requests), but my guess is that when you estimated using -firthlogit- you did not include the "or" option; see

Code:

h firthlogit
Comment
Ameer Neseyif

Join Date: Jul 2022

Posts: 5
#7

25 Jul 2022, 04:41

My apologies for not including the code, your assumption was correct. The or option worked, thank you very much and have a good day
Comment
Ameer Neseyif

Join Date: Jul 2022

Posts: 5
#8

26 Jul 2022, 03:42

Does the data need to be balanced for penalised maximum likelihood estimates to be unbiased and consistent?
Comment

Announcement

Analysing non-random sampled survey data

Comment

Comment

Comment

Comment

Comment

Comment

Comment