splitting a dataset based on the median of a variable

Serena Menny

Join Date: Nov 2022

Posts: 60
#1

splitting a dataset based on the median of a variable

21 Jun 2024, 09:30

Hello everyone,

I want to see the division of variable2 based on the median of another variable1.

But first, to find the median of variable1, I use this code:

egen median= median (variable1)

but when I use :count if variable1<= median, I find 1 114 observations

When I use count if variable1> median, I find 1 101 observations

Aren't the observations suppose to be equal?

thank you,
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4408
#2

21 Jun 2024, 10:06

they will not be equal if there are ties; also, missing values can cause problems; further, you are comparing "<=" to ">" and that might result in unequal N's also
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35211

21 Jun 2024, 15:36

Let's look at some data. For the auto data, setting aside repair record, there are 74 non-missing values for each numeric variable. That itself means that the median will be reported as the mean of the 37th and 38th values, so it's possible in principle for exactly half of the observations to be above and ditto below.

For some other variables, especially those with few distinct values, that need not occur, and indeed the median can even be an extreme value.

Here is some code. labmask and tabplot are from the Stata Journal.

Code:

sysuse auto, clear

gen N = . 
gen median = . 
gen probabilitybelow = . 
gen probabilityat = . 
gen probabilityabove = .  
gen varname = ""
local i = 1 

qui foreach v of var price-foreign { 
    replace varname = "`v'" in `i' 
    su `v', detail 
    replace N = r(N) in `i'
    replace median = r(p50) in `i'
    count if `v' < median[`i']
    replace probabilitybelow = r(N) / N in `i'
    count if `v' == median[`i']
    replace probabilityat = r(N) / N in `i'
    count if `v' > median[`i'] & `v' < . 
    replace probabilityabove = r(N) / N in `i'
    local ++i 
}

format probability* %4.3f 

sort probabilitybelow 

gen x = _n

list varname probability* if varname != "", noobs sep(0)

keep if varname != "" 

keep x varname probability* 

reshape long probability, i(varname) j(which) string 

labmask x, values(varname)

tabplot x which [iw=probability], showval(format(%4.3f) offset(0.24)) ///
subtitle("probability below, at, or above the median") xreverse xtitle("") ytitle("") horizontal xsc(r(0.85 .) alt)

Announcement

splitting a dataset based on the median of a variable

Comment

Comment