Hello, I am trying to analyze data from a survey of 77 patients with vitiligo.
I have compared baseline characteristics using Fisher's Exact test and the Kruskall-Wallis test and found that dlqi is higher for patients who have not undergone depigmentation therapy (depigmented=0). I wanted to use OLS to see if the relationship between "dlqi" (continuous) and "depigmented" (binary) persisted when controlling for sex, race, and gender. Data are below:
I have run a few variations of this model using the "regress" command, and it seems like the best fitting model includes an interaction between "raceconsol" and "depigmented", which makes theoretical sense.
-regress dlqi depigmented##raceconsol gender-
Problem:
Regardless of the variables included, the residuals show a clear pattern, and regression diagnostics (estat hettest, linktest, etc) tell me that I have problematic heteroskedasticity, misspecification, etc. All of the transformations I've done of dlqi (squaring, square root, log/ln transformations) result in worse model fit. I've also tried including the categorical variables age and percent, but these have levels with very few observations, so i dropped them due to concern for overfitting given the small sample size; they don't improve model fit anyway. I've looked at the outlying observations, and none are obviously flawed, so I can't justify dropping them.
Should I just use robust standard errors and call it a day? Should I bother modeling this data at all? Would something like SEM (which I don't understand at all, to be honest) be more appropriate?
Thank you!
-Ashley
I have compared baseline characteristics using Fisher's Exact test and the Kruskall-Wallis test and found that dlqi is higher for patients who have not undergone depigmentation therapy (depigmented=0). I wanted to use OLS to see if the relationship between "dlqi" (continuous) and "depigmented" (binary) persisted when controlling for sex, race, and gender. Data are below:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float dlqi byte(depigmented raceconsol gender age percent) 7 1 2 2 5 3 5 0 2 2 5 3 7 0 3 1 5 3 0 0 1 2 4 3 1 0 1 2 5 5 1 0 1 2 6 2 2 0 1 2 4 2 0 0 1 1 5 5 3 0 1 2 6 3 3 0 1 2 4 4 0 1 1 2 7 4 0 1 1 2 6 5 0 1 1 2 7 4 0 1 1 2 4 4 0 1 1 2 4 3 0 1 1 2 4 4 0 1 1 2 5 3 0 1 1 2 7 5 5 0 1 2 3 3 5 0 1 2 7 4 5 0 1 2 7 4 5 0 1 2 3 1 5 0 1 2 6 6 0 1 3 2 6 3 0 1 3 2 5 4 0 1 3 2 2 4 0 1 3 2 4 3 3 0 1 1 3 1 2 1 1 2 7 4 2 1 1 2 4 4 2 1 1 2 6 4 0 1 1 1 4 5 0 1 1 1 4 5 0 1 1 1 6 4 3 1 1 2 7 3 3 1 1 2 4 4 17 0 3 2 7 5 2 1 3 2 1 1 8 0 1 2 6 4 8 0 1 2 7 4 17 0 2 2 5 4 3 1 3 2 1 1 3 1 3 2 4 3 5 1 1 2 6 5 18 0 2 2 5 4 6 1 1 2 6 4 3 1 1 1 5 3 3 1 1 1 6 4 3 1 1 1 6 3 20 0 3 2 6 4 4 1 1 1 5 2 8 1 1 2 5 4 8 1 1 2 6 3 13 0 1 2 6 3 13 0 1 2 6 3 9 1 1 2 5 4 26 1 2 2 6 4 8 1 1 1 5 2 12 1 1 2 4 4 11 1 3 2 4 4 11 1 3 2 7 2 17 0 1 2 4 3 17 0 1 2 6 4 17 0 1 2 5 4 13 1 1 2 6 4 13 1 1 2 5 4 23 0 3 1 5 3 29 1 2 2 5 3 19 0 1 2 3 3 19 0 1 2 6 3 30 0 2 2 3 1 25 0 1 2 6 3 0 1 1 . 6 4 1 0 3 . 5 3 13 0 1 . 7 6 1 0 1 . 7 4 1 0 1 . 7 4 end label values depigmented depigmented label def depigmented 0 "No", modify label def depigmented 1 "Yes", modify label values gender gender_ label def gender_ 1 "Male", modify label def gender_ 2 "Female", modify label values age age_ label def age_ 1 "< 20", modify label def age_ 2 "21-29", modify label def age_ 3 "30-39", modify label def age_ 4 "40-49", modify label def age_ 5 "50-59", modify label def age_ 6 "60-69", modify label def age_ 7 "70+", modify label values percent percent_ label def percent_ 1 "< 5%", modify label def percent_ 2 "5-10%", modify label def percent_ 3 "11-30%", modify label def percent_ 4 "51-70%", modify label def percent_ 5 "71-90%", modify label def percent_ 6 "> 90%", modify
-regress dlqi depigmented##raceconsol gender-
Problem:
Regardless of the variables included, the residuals show a clear pattern, and regression diagnostics (estat hettest, linktest, etc) tell me that I have problematic heteroskedasticity, misspecification, etc. All of the transformations I've done of dlqi (squaring, square root, log/ln transformations) result in worse model fit. I've also tried including the categorical variables age and percent, but these have levels with very few observations, so i dropped them due to concern for overfitting given the small sample size; they don't improve model fit anyway. I've looked at the outlying observations, and none are obviously flawed, so I can't justify dropping them.
Should I just use robust standard errors and call it a day? Should I bother modeling this data at all? Would something like SEM (which I don't understand at all, to be honest) be more appropriate?
Thank you!
-Ashley
Comment