What regression model is suitable for a dataset with mixed types of variables?

John Draper

Join Date: Jan 2024

Posts: 4
#1

What regression model is suitable for a dataset with mixed types of variables?

16 Jan 2024, 06:39

Hello Statalist!

As the title suggests, I'm unsure of how I can perform a regression analysis with different types of variables. My dataset contains answers from a self-constructed survey, where I have gathered information on the following independent variables:
gender (binary),

age (continuous),

occupational status (binary, it's simply 'student' or 'non-student'),

educational level (ordinal, 1 to 3),

adults in household (ordinal, 1 to 4),

household income (categorical, 1 to 7) and a

financial literacy score which is continuous within the range 0 to 1.

Apart from these, I also have a measure of risk aversion for each individual, this is the dependent variable. This value (within the range 1 to 5) is based on calculations from three separate questions where I have calculated an average which is around 3 for my particular sample.

So to put it simply, I want to see how the independent variables potentially affects the risk aversion measure, but I am not sure if a regular OLS regression is preferable given these variables.

Do you have any advice? I apologize if the answer should be obvious. Have a good day.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#2

16 Jan 2024, 07:44

John:
welcome to this forum.
Is your dependent variable ordered (something like: 1=worst; 2=reasonable.....; 5=best)?

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
John Draper

Join Date: Jan 2024

Posts: 4
#3

16 Jan 2024, 09:13

Originally posted by Carlo Lazzaro View Post

John:
welcome to this forum.
Is your dependent variable ordered (something like: 1=worst; 2=reasonable.....; 5=best)?

Thank you Carlo.

In short, yes. To provide some context, the values 1 to 5 represent different levels of constant relative risk aversion (CRRA) utility. A value of 1 indicates a relatively low risk aversion, suggesting that an individual with such a value is inclined towards taking more risks. In contrast, a value of 5 can be seen as a relatively high fear of risk.

In the survey, I have asked participants three questions that let's me estimate their level of risk aversion. Each question has five alternatives that corresponds to a certain level of risk aversion ranging from 1 to 5. From these answers I have calculated an average level of risk aversion for each individual. While it is possible to conduct a regression analysis for each of the three questions separately, I'm uncertain about the most appropriate approach. I would be grateful for any guidance on this.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#4

16 Jan 2024, 09:39

It's the flavour of the outcome (dependent variable) that is crucial here. I guess most researchers looking at such data would start with ologit or oprobit. The use of plain or vanillla regression is hard to defend, as it would take those grades as equally spaced.

(This looks a bit like an assignment. Please note our comments in the FAQ Advice on such matters. Also, this was cross-posted at https://www.reddit.com/r/stata/comme...for_a_dataset/ It's a rule there, and a request here, that you tell people about cross-posting.)
1 like
Comment
John Draper

Join Date: Jan 2024

Posts: 4
#5

16 Jan 2024, 09:50

Originally posted by Nick Cox View Post

It's the flavour of the outcome (dependent variable) that is crucial here. I guess most researchers looking at such data would start with ologit or oprobit. The use of plain or vanillla regression is hard to defend, as it would take those grades as equally spaced.

(This looks a bit like an assignment. Please note our comments in the FAQ Advice on such matters. Also, this was cross-posted at https://www.reddit.com/r/stata/comme...for_a_dataset/ It's a rule there, and a request here, that you tell people about cross-posting.)

Hey Nick. Thank you for your input.

This is not a regular assignment (homework) question. Me and a colleague are writing a master's thesis on this subject and are unsure about the correct regression model. As for the cross-posting, I was unaware of these rules and I do apologize.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#6

16 Jan 2024, 09:55

Telling us about cross-posting is as said requested here. The only rules are unwritten rules.

https://www.statalist.org/forums/help is where to start, as every prompt advises.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#7

16 Jan 2024, 12:31

The result of averaging over 3 Likert-type items is usually considered a quasi-interval scale, at least in the social sciences. Thus, a linear model might well be a reasonable starting point. You might want to run an ordered model as kind of a robustness check. This is what you would often do when writing a paper; not sure about a master thesis.
1 like
Comment
George Ford

Join Date: Aug 2014

Posts: 3120
#8

16 Jan 2024, 13:51

I did some research on Likert data in past and as Dan said the typical approach is to treat the combination of multiple Likert questions as continuous. Otherwise, you've got 3 models with an ordered DV, which might be tricky to interpret (I'd try it since it's your thesis, just to learn something and maybe offer something interesting to the literature).

While I have no support for it, you might try summing the responses to the three questions if they all aim at measuring the same sort of thing.

Anything you do will be subject to criticism as ordered responses are not continuous. Look to the literature for guidance.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#9

17 Jan 2024, 01:06

John:
a bit off topic here, but if you go -regress- I would consider both the linear and the squared terms for -age- and search for a possible turning point.

Kind regards,
Carlo
(Stata 19.0)
Comment
John Draper

Join Date: Jan 2024

Posts: 4
#10

17 Jan 2024, 05:44

Thank you for your insights Carlo Lazzaro, Nick Cox, daniel klein and George Ford. Me and my colleague have tried several methods, but it seems like a

Code:

oprobit

regression with

Code:

mfx, predict(outcome(n))

gives us a good foundation for our analysis.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#11

17 Jan 2024, 06:01

mfx has not been a part of official Stata for over a decade now. Use margins instead.
2 likes
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#12

17 Jan 2024, 06:51

Originally posted by John Draper View Post

To provide some context, the values 1 to 5 represent different levels of constant relative risk aversion (CRRA) utility.

Given the initialism, I guess that there is a body of literature involving this type of outcome in your field of study. You might want to start there for guidance about how to analyze this kind of outcome if a consideration is to gain acceptance of your approach among your peers.

Originally posted by John Draper View Post

I . . . have a measure of risk aversion for each individual, this is the dependent variable. This value (within the range 1 to 5) is based on calculations from three separate questions where I have calculated an average . . .

On the other hand, if the convention in your field of study doesn't foreclose the possibility, then your colleague and you might want to consider fitting a MIMIC model using gsem with each of the three individual question's response as an indicator variable.

Although it does require an adequate sample size, fitting such a MIMIC model is not that difficult mechanically: see Example 36g in the user's manual (Stata Structural Equation Modeling Reference Manual) for details about how to go about it.

The advantage of this approach, as opposed to averaging or summing the three questions' responses, is that you don't need to assume that each question's response weighs equally in determining the risk-aversion score.
2 likes
Comment

Announcement

What regression model is suitable for a dataset with mixed types of variables?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment