Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Decission Tree in Stata (CHAID command)

    Dear Statalist users,

    I have a sample with 620 observations consisting of a categorical classification variable and some potential predictor variables (all continous).
    The categorical classification variable GrowthNoGrowthMix has three categories of bacterial growth (0: no growth, 1: normal growth, 2: mixed growth).

    From clinical point of you and logistic regression, the potential predictor variables UF_bact and UF_lc predict growth vs. no growth (AUC 0.91) while UF_cylinder and UF_epithel cells predict mixed growth vs. normal growth/no growth (AUC 0.75).

    I like to create and validate a decision tree for use in clinical practice to predict the growth (avoid ordering a culture).

    A straightforward approach for me would have been:
    - Split the data randomly in two datasets
    - Use the one dataset to determine (arbitrary) cut-offs for the prediction variables that have good sens./spec prediting mixed growth respectively normal growth.
    - Use the other dataset and validate the number of misclassifications, gini index, entropy etc. in the different leaves

    Now, I have read about CART, CHAID, and C4.5 algorithms and I think that these would be a better choice.
    In particular, the CHAID algorithms seems to be suitable as chaid accommodates categorical data without a survival/time-to-event structure in contrast to CART in STATA.

    Unfortunately, I am not confident that the used command
    Code:
    set seed 1234567
    chaid GrowthNoGrowthMix, minnode(20) minsplit(50) xtile(UF_bact UF_cylinder UF_epithel UF_lc ,n(2))
    is correct.

    The output of it was:
    Graph.png

    Using the tab-command to have a feeling of missclassifications, I got:

    Output.png

    My questions are:
    1. Is this a reasonable and a good approach or are there better approaches?
    2. In logistic regression mixex-growth was "best" predicted by "UF_cylinder" (smallest p-value), but it's not taken into account in the decision tree. Is that not a contradiction?
    3. Is there a way to optimise "minnode" and "minsplit"?
    4. How would you suggest to measure "goodness of fit" or validate this tree? Would you recommend to split the data set in two to validate the algorithm. I could not find a proportion e.g. 1/3 for validation and 2/3 for developing it the tree
    Maybe you can help claryfing the use of the CHAID command in this example.

    Thank you very much for your time and energy in advance
    Martin

  • #2
    Hi Martin,

    It is hard to determine whether CHAID is a reasonable approach and that depends on what you want from the analysis.

    If you are trying to generate a simple predictive model, mlogit would probably offer the best (or at least a good) value. If you believe that there a good number of interactions in the data (i.e., the model would follow a tree-like format) then CHAID would be a good choice.

    Because CHAID looks for splits and, consequently, interactions in the data, that the best linear predictor is not used in the CHAID tree is not a contradiction, but merely a disagreement. The models are different. It appears as though the CHAID tree cannot find a variable which splits out values of 2 in the response variable well as none of the clusters are characterized by a majority of "2"s - at least not with the values of minsplit() and minnode() currently set.

    In terms of the third question, that depends on what is trying to be optimized. Model fit and size of clusters? Model fit always improves with smaller values in these options - but tends to come at the cost of overfitting and would still usually be prevented from being implemented by the spltalpha() and mergalpha() options (which determine a Type I error rate allowed for splitting in the Chi-square tests).

    The chaid command has a built-in Cramer's V goodness of fit metric which is reported in e(fit) as discussed in the helpfile.

    For what it's worth, decision trees are quite unstable and tend not to validate well. This is what spurred on the development of ensembles of trees such as random forests (or in this case chaidforest). Ensembles are less interpret-able but tend to predict out-of-sample better.

    - joe
    Joseph Nicholas Luchman, Ph.D., PStatĀ® (American Statistical Association)
    ----
    Research Fellow
    Fors Marsh

    ----
    Version 18.0 MP

    Comment

    Working...
    X