Dear Statalist users,
I have a sample with 620 observations consisting of a categorical classification variable and some potential predictor variables (all continous).
The categorical classification variable GrowthNoGrowthMix has three categories of bacterial growth (0: no growth, 1: normal growth, 2: mixed growth).
From clinical point of you and logistic regression, the potential predictor variables UF_bact and UF_lc predict growth vs. no growth (AUC 0.91) while UF_cylinder and UF_epithel cells predict mixed growth vs. normal growth/no growth (AUC 0.75).
I like to create and validate a decision tree for use in clinical practice to predict the growth (avoid ordering a culture).
A straightforward approach for me would have been:
- Split the data randomly in two datasets
- Use the one dataset to determine (arbitrary) cut-offs for the prediction variables that have good sens./spec prediting mixed growth respectively normal growth.
- Use the other dataset and validate the number of misclassifications, gini index, entropy etc. in the different leaves
Now, I have read about CART, CHAID, and C4.5 algorithms and I think that these would be a better choice.
In particular, the CHAID algorithms seems to be suitable as chaid accommodates categorical data without a survival/time-to-event structure in contrast to CART in STATA.
Unfortunately, I am not confident that the used command
is correct.
The output of it was:

Using the tab-command to have a feeling of missclassifications, I got:

My questions are:
Thank you very much for your time and energy in advance
Martin
I have a sample with 620 observations consisting of a categorical classification variable and some potential predictor variables (all continous).
The categorical classification variable GrowthNoGrowthMix has three categories of bacterial growth (0: no growth, 1: normal growth, 2: mixed growth).
From clinical point of you and logistic regression, the potential predictor variables UF_bact and UF_lc predict growth vs. no growth (AUC 0.91) while UF_cylinder and UF_epithel cells predict mixed growth vs. normal growth/no growth (AUC 0.75).
I like to create and validate a decision tree for use in clinical practice to predict the growth (avoid ordering a culture).
A straightforward approach for me would have been:
- Split the data randomly in two datasets
- Use the one dataset to determine (arbitrary) cut-offs for the prediction variables that have good sens./spec prediting mixed growth respectively normal growth.
- Use the other dataset and validate the number of misclassifications, gini index, entropy etc. in the different leaves
Now, I have read about CART, CHAID, and C4.5 algorithms and I think that these would be a better choice.
In particular, the CHAID algorithms seems to be suitable as chaid accommodates categorical data without a survival/time-to-event structure in contrast to CART in STATA.
Unfortunately, I am not confident that the used command
Code:
set seed 1234567 chaid GrowthNoGrowthMix, minnode(20) minsplit(50) xtile(UF_bact UF_cylinder UF_epithel UF_lc ,n(2))
The output of it was:
Using the tab-command to have a feeling of missclassifications, I got:
My questions are:
- Is this a reasonable and a good approach or are there better approaches?
- In logistic regression mixex-growth was "best" predicted by "UF_cylinder" (smallest p-value), but it's not taken into account in the decision tree. Is that not a contradiction?
- Is there a way to optimise "minnode" and "minsplit"?
- How would you suggest to measure "goodness of fit" or validate this tree? Would you recommend to split the data set in two to validate the algorithm. I could not find a proportion e.g. 1/3 for validation and 2/3 for developing it the tree
Thank you very much for your time and energy in advance
Martin
Comment