Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • splitting a dataset based on the median of a variable

    Hello everyone,

    I want to see the division of variable2 based on the median of another variable1.

    But first, to find the median of variable1, I use this code:

    egen median= median (variable1)

    but when I use :count if variable1<= median, I find 1 114 observations

    When I use count if variable1> median, I find 1 101 observations

    Aren't the observations suppose to be equal?


    thank you,

  • #2
    they will not be equal if there are ties; also, missing values can cause problems; further, you are comparing "<=" to ">" and that might result in unequal N's also

    Comment


    • #3
      Let's look at some data. For the auto data, setting aside repair record, there are 74 non-missing values for each numeric variable. That itself means that the median will be reported as the mean of the 37th and 38th values, so it's possible in principle for exactly half of the observations to be above and ditto below.

      For some other variables, especially those with few distinct values, that need not occur, and indeed the median can even be an extreme value.

      Here is some code. labmask and tabplot are from the Stata Journal.

      Code:
      sysuse auto, clear
      
      gen N = . 
      gen median = . 
      gen probabilitybelow = . 
      gen probabilityat = . 
      gen probabilityabove = .  
      gen varname = ""
      local i = 1 
      
      qui foreach v of var price-foreign { 
          replace varname = "`v'" in `i' 
          su `v', detail 
          replace N = r(N) in `i'
          replace median = r(p50) in `i'
          count if `v' < median[`i']
          replace probabilitybelow = r(N) / N in `i'
          count if `v' == median[`i']
          replace probabilityat = r(N) / N in `i'
          count if `v' > median[`i'] & `v' < . 
          replace probabilityabove = r(N) / N in `i'
          local ++i 
      }
      
      format probability* %4.3f 
      
      sort probabilitybelow 
      
      gen x = _n
      
      list varname probability* if varname != "", noobs sep(0)
      
      keep if varname != "" 
      
      keep x varname probability* 
      
      reshape long probability, i(varname) j(which) string 
      
      labmask x, values(varname)
      
      tabplot x which [iw=probability], showval(format(%4.3f) offset(0.24)) ///
      subtitle("probability below, at, or above the median") xreverse xtitle("") ytitle("") horizontal xsc(r(0.85 .) alt)
      Click image for larger version

Name:	median.png
Views:	1
Size:	39.6 KB
ID:	1756858

      Comment

      Working...
      X