Median and sample size

Helen Chang

Join Date: Apr 2018

Posts: 104
#1

Median and sample size

01 Oct 2019, 15:29

Hi,

I use the following code to create the median of a variable, and aim to create two sub-samples based on the median (above median and below median). However, the sample size of the regressions based on the two sub-samples are very very different. Aren't they have similar sample size?

egen MedianSize=median(wLogAsset)

Last edited by Helen Chang; 01 Oct 2019, 15:33.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#2

01 Oct 2019, 15:50

Without seeing an example of your data it is hard to be certain. But questions like this come up frequently here, and the answer is usually this:

If there are many values of wLogAsset that are all equal to the median value, bear in mind that they are all going to go into the same subset: it would make no sense to put some of them in the above median and others in the below median subsets. As a result, the subset that gets all of them will be appreciably larger than the subset that gets none of them.
1 like
Comment
Helen Chang

Join Date: Apr 2018

Posts: 104
#3

01 Oct 2019, 16:08

Thanks for the explanation! Is there other better ways to divide sample into two subsamples with equal sample size based on a certain variable (e.g. size)?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#4

01 Oct 2019, 17:58

No. A median split is the closest you can come to two equal subsets where the members of one are all larger than the members of the other. Any other split will be even more unbalanced.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35444
#5

02 Oct 2019, 01:17

Why bin at all?
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#6

02 Oct 2019, 07:00

If you have other variables in your regression, it could also be that the missing values on the other variables result in different usable sample sizes.

What Nick is hinting at is that if, for example, you want separate parameters for the two groups, you can just create a dummy variable and then use factor variable notation in the regression to estimate separate values for each group.
e.g.:
reg y median x i.median#c.x

Assuming x is continuous. If x is an indicator variable, then
reg y median x i.median#i.x
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35444
#7

02 Oct 2019, 07:13

Not really, although no fault of Phil Bromiley that my intent was unclear. What I mean is very simple: Why are you doing this? if wLogAsset is a useful predictor why bin it? (I am a bit surprised that a variable such as log assets (?) shows many ties at all.)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#8

02 Oct 2019, 11:19

(I am a bit surprised that a variable such as log assets (?) shows many ties at all.)

As am I. That's why in #2 I prefaced my remark about a caution about having not seen the data and providing an answer that is simply the commonest cause of this situation. But I do think it would be good if the data were shown, along with the code for creating the bins, so we could know that this is really what we're dealing with.

I also endorse Nick's concern that taking a useful predictor and binning it usually just degrades the usefulness of the predictor. I didn't bring it up because this appears to be a finance project, and I've seen this practice come up so often here in finance analyses that I've kind of given up railing against it. Some disciplines have standard practices that, although counter-productive or even misleading, are just too entrenched to dislodge with reason. Selecting portfolios based on quantiles of some variable and tracking their performance over time in finance seems to be one of those.

(I'm not picking on finance here. I could cite similar examples from other disciplines. Even in my own, epidemiology, it is distressingly common to take continuous predictors of health outcomes and then trash them into quartiles or quintiles, and, even worse, then just contrast the outcomes in the top and bottom bin! Ugh!!)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35444
#9

02 Oct 2019, 11:31

#8 Clyde Schechter and I rarely disagree. That's possibly because we have never met.

Frank Harrell often posts about this in biostatistics and indeed his book Regression Modeling Strategies (Springer, Cham, 2015) has a terse but nevertheless incisive discussion.

There's one more plausible rationale for binning, as I understand it. Clinicians often want rules for decisions whenever possible, on whether to recommend some regimen, or to prescribe a drug. let alone to carry out some more invasive or dangerous procedure. The trouble is whether there are plausible grounds for treating BMIs or systolic blood pressure or whatever it is a smidgen above or below some threshold as qualitatively different. Still, a decision has to be made, leave as is or intervene.

As some clinicians lurk here, I will say no more before I get out of my depth.

I've done my level best to undermine quantile binning gently, or at least to underline its limitations:

https://www.stata-journal.com/articl...article=pr0054

https://journals.sagepub.com/doi/pdf...867X1801800311
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#10

02 Oct 2019, 11:38

#9 Nick Cox - the Consort guidelines for reporting the results of randomized clinical trials recommends that binning be done after modeling if needed for decision making purposes and provides a citation to an article that does that; just google "consort guidelines" to find these (and similar guide for other types of studies)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35444
#11

02 Oct 2019, 11:43

Rich Goldstein That's good to know, I guess. But even if something depends smoothly on one predictor it would be hard to know where to bin, and not easier with a more complicated function of several predictors.

Small world: it's public that Rich was a reviewer for the first edition of Frank's book.
Comment

Announcement

Median and sample size

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment