Hello everyone,
I hope this questions fits here. I've used this forum a lot but due to my current problem I've finally made an account.
I want to run a logistic regression to predict "fail" (company bankruptcy). One of my independent variables "vvlt_twelve" has a U-shaped distribution.
Vvlt_twelve is a factor variable with 12 levels that represent an score for solvency of a company (1 being the worst score and 12 being the best possible score). It is transformed out of a continuous variable.
As you can see in the output below, 76% of all observations are reported in the two outer limits (1 and 12) of "vvlt_twelve", being the worst and best score respectively.
35% (5,426 observations) of al bankruptcy's have the highest/best possible "score" for this variable.
As I believe this distorts my results, is there a way in which I can account for this kind distribution?

I think some people might want to know why the variable (vvlt_twelve) was transformed into a factor variable. This was done because the continuous variable is a ratio whereby higher values represent lower solvency scores. However, the continuous variable can also have negative values, which is even worse for the solvency than a high value. As such, the lowest score of the factor variable (1) represents the negative values of the continuous variant. The following 11 levels of the factor variable (2 - 12) represent the positive values of the continuous variant in decreasing order.
Thank you in advance for your feedback.
I hope this questions fits here. I've used this forum a lot but due to my current problem I've finally made an account.
I want to run a logistic regression to predict "fail" (company bankruptcy). One of my independent variables "vvlt_twelve" has a U-shaped distribution.
Vvlt_twelve is a factor variable with 12 levels that represent an score for solvency of a company (1 being the worst score and 12 being the best possible score). It is transformed out of a continuous variable.
As you can see in the output below, 76% of all observations are reported in the two outer limits (1 and 12) of "vvlt_twelve", being the worst and best score respectively.
35% (5,426 observations) of al bankruptcy's have the highest/best possible "score" for this variable.
As I believe this distorts my results, is there a way in which I can account for this kind distribution?
I think some people might want to know why the variable (vvlt_twelve) was transformed into a factor variable. This was done because the continuous variable is a ratio whereby higher values represent lower solvency scores. However, the continuous variable can also have negative values, which is even worse for the solvency than a high value. As such, the lowest score of the factor variable (1) represents the negative values of the continuous variant. The following 11 levels of the factor variable (2 - 12) represent the positive values of the continuous variant in decreasing order.
Thank you in advance for your feedback.
Comment