Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Median split

    Hi everyone,

    my supervisor insists that I categorise a continuous variable using the median split. I have a sample of 234 observations and when performing the median split I have 10 observations that are equal to the median.

    My first question is: Do I disregard the median or allocate it to the "low" or high category"?

    Secondly, when performing a median split, do I have to have exactly 50% of the observations in each of the 2 categories?

    I know this is a rudimentary question so my apologies for that. Any help would be highly appreciated!

    Best regards,
    Eliss Millen

  • #2
    Eliss:
    set aside the wonderful https://pubmed.ncbi.nlm.nih.gov/16217841/ paper that goes clearly against what supervisor asks you to do, you can create a two-level categorical variable (level 0: <=the median value; level 1: > the median value).
    As far as your second question is concerned, what you report is what happens when you do not have values that equals the median, as in the following toy-example:
    Code:
    . use "C:\Program Files\Stata16\ado\base\a\auto.dta"
    (1978 Automobile Data)
    
    . egen median_price=median(price)
    
    . count if price==median_price
      0
    
    . count if price<=median_price
      37
    
    . count if price>median_price
      37
    
    .
    Last edited by Carlo Lazzaro; 22 Aug 2020, 06:59.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Dear Carlo,

      thank you for your response; unfortunately I do not have the freedom to go against what she says despite it being clear that the median split is not optimal. I have used the same code, namely:

      egen median=median(MS)
      count if MS==median
      10
      count if MS <= median
      118
      count if MS > median
      116

      I don't quite understand whether it's right to have different number of observations in the groups. Do you think this is alright?

      thank you!!

      Best,
      Eliss

      Comment


      • #4
        I am with Carlo Lazzaro and many, many authors in finding this practice statistically unsatisfactory, not to say obnoxious. But if several values tie with the median it would seem even more perverse to omit them just because they are awkward, And it's yet another objection to splitting on the median if it doesn't do what may be expected naively, guarantee even splits with equal frequencies.

        Further possibilities open up:

        1. If several values tie for median, then you're in effect obliged to check whether it makes a difference which way you split. So in the auto data 3 is the median for rep78 so that there are two ways to split (1,2) and (3, 4, 5) compared with (1, 2, 3) and (4, 5) and results are likely to differ.

        2. Yet another way to do it is to split on below, equal to, or above the median.

        I don't think either of these can improve on respecting the data and not binning arbitrarily, although what would be better for your analysis can't be worked out on this information.

        I guess the wording "my supervisor insists" says as much as you are going to say, but a good supervisor should be open to arguments based on results and principles.

        Comment


        • #5
          I do share any of Nick Cox ' s words.

          I would only add that, if Eliss (and her supervisor) are intended to submit a paper to a technical journal in their research field, the categorizing approach of a continuous variable will probably be considered as a relevant weakness of the original statistical plan; at the top of that, this criticism will be hard to rebout.
          Hence, I would discuss with Eliss' supervisor whether the suggested median approach is actually the way to go (and Eliss would have the chance to score at least one quant point more than her supervisor just bringing a copy of the wonderful paper mentioned in #2).
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            I didn't see #3 when writing #4 but I think my comments are unchanged.

            However, once you have a binary predictor, its values being unequally frequent is not in itself fatal. Most use of indicators or dummy variables would be out of order if that were true. What can be troublesome is a very small proportion of one value and a very large proportion of the other -- which might even arise on a median split.

            I wouldn't accept this practice from a graduate student or in a paper I was reviewing. That's not meant to seem dogmatic or authoritarian; the point is that there are several compelling arguments against it, much repeated in the literature. See also Frank Harrell's monograph, Regression Modeling Strategies (Springer, 2nd edition 2015).

            Comment


            • #7
              Dear Carlo Lazzaro and Nick Cox,

              thank you very much for the elaboration on the matter and help. As the analysis is a part of my Master thesis I will discuss it again with my supervisor and insist on the initial categorisation that was done in 3 groups.

              Wish you a great day!

              Best,
              Eliss

              Comment


              • #8
                In case it helps, one more citation recommending against median split, albeit with a different proposed solution in case your advisor insists on some kind of split:

                Gelman, A., & Park, D. K. (2009). Splitting a predictor at the upper quarter or third and the lower quarter or third. The American Statistician, 63(1), 1-8. http://www.stat.columbia.edu/~gelman...ed/thirds5.pdf
                David Radwin
                Senior Researcher, California Competes
                californiacompetes.org
                Pronouns: He/Him

                Comment

                Working...
                X