Testing the difference between medians

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2121
#16

22 Mar 2020, 13:50

The reason I said the variable tv is discrete is because the medians turn out to be exactly 1 and 4, respectively. But it looks like it's not an integer; maybe, as Nick suggested, it's recorded every half an hour?

Unless TV takes on a wide range of values, you can't trust either the usual standard errors or those obtained via bootstrap. The median is not a smooth function of the data in that case.

You could assume that tv has a particular distribution -- such as that for Tobit model, which allows true zeros -- and then use interval regression. Then you could compute the difference in medians off of the lognormal distribution and easily obtain a valid standard error for the difference.
Comment

Joe Tuckles

Join Date: Jul 2018
Posts: 180

#17

23 Mar 2020, 03:24

It's definitely collected as hourly; the question asked was "In a typical day, how many hours do you spend watching TV. Put 0 if you do not spend any time watching TV".

Code:

. tab tv

 Total time |
      spent |
watching tv |
   in hours |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        112       11.57       11.57
          1 |         69        7.13       18.70
          2 |        147       15.19       33.88
          3 |        173       17.87       51.76
          4 |        163       16.84       68.60
          5 |        100       10.33       78.93
          6 |         77        7.95       86.88
          7 |         35        3.62       90.50
          8 |         39        4.03       94.52
          9 |         14        1.45       95.97
         10 |         18        1.86       97.83
         11 |          2        0.21       98.04
         12 |          6        0.62       98.66
         13 |          3        0.31       98.97
         14 |          1        0.10       99.07
         15 |          1        0.10       99.17
         16 |          1        0.10       99.28
         19 |          1        0.10       99.38
         20 |          1        0.10       99.48
         21 |          4        0.41       99.90
         24 |          1        0.10      100.00
------------+-----------------------------------
      Total |        968      100.00

Based on this - is interval regression still the appropriate way forward?

Thanks!

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#18

23 Mar 2020, 03:54

Joe:
are there people who scored 0 because the actually did not have (access to) a TV set?

Kind regards,
Carlo
(Stata 19.0)
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#19

23 Mar 2020, 04:10

Hi Carlo,

Unfortunately I do not have that information but it's possible, particularly given the participants are people with serious mental illness, but relatively well functioning at baseline. Also possible is that they answered incorrectly for whatever reason or underestimated.

Last edited by Joe Tuckles; 23 Mar 2020, 04:16.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#20

23 Mar 2020, 07:16

A minor point about the histograms: I would use the discrete option and not let histogram choose the bin width.

Jeff Wooldridge My point was that if the data were integers, then the median must be an integer or half-integer. I wasn't ruling out finer resolution in the data, which none of us could see, but it's now clear that the data are integers and cover the possible range [0, 24], credibly or not.
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#21

23 Mar 2020, 07:30

Thanks Nick. Please see updated Histograms. Do I also need to run intreg?
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2121
#22

23 Mar 2020, 13:29

Originally posted by Nick Cox View Post

A minor point about the histograms: I would use the discrete option and not let histogram choose the bin width.

Jeff Wooldridge My point was that if the data were integers, then the median must be an integer or half-integer. I wasn't ruling out finer resolution in the data, which none of us could see, but it's now clear that the data are integers and cover the possible range [0, 24], credibly or not.

Oh right. Stata uses the convention of averaging two values if both are medians (and then, technically, so is any point in between).
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#23

23 Mar 2020, 13:32

As I have written somewhere, non-mathematical readers are told that averaging two middle values to get a median is a rule while more mathematical readers are told that it is only a convention, for the reason you give. .

Last edited by Nick Cox; 23 Mar 2020, 13:50.
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#24

24 Mar 2020, 03:41

Hi apologies but could I request some clarification. I have the two median values, and I have been advised to formally test for a difference if I'm going to report the two median values. Should I therefore just report the qreg bootstrap finding, and/or the histograms and/or run intreg?

Thanks
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#25

24 Mar 2020, 05:27

Is this coursework?
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#26

24 Mar 2020, 05:45

No
Comment

Joe Tuckles

Join Date: Jul 2018
Posts: 180

#27

24 Mar 2020, 10:36

Do the following statistics provide differing information:

Code:

. cendif tv, by(group)
Y-variable: tv (Total time spent watching tv in hours)
Grouped by: group
Group numbers:

      group |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        806       83.26       83.26
          1 |        162       16.74      100.00
------------+-----------------------------------
      Total |        968      100.00
Transformation: Fisher's z
95% confidence interval(s) for percentile difference(s)
between values of tv in first and second groups:
   Percent    Pctl_Dif     Minimum     Maximum 
        50           0          -1           0 

. cid tv, by(group) median unpaired

Rank-based confidence interval for difference in  medians by group

Variable |     Obs     Estimate           K        [95% Conf. Interval]
---------+-------------------------------------------------------------
      tv |     968            0       58922              -1           0

Comment

Joe Tuckles

Join Date: Jul 2018
Posts: 180

#28

24 Mar 2020, 11:40

Apologies to seek further clarification. Hopefully clarification will help both myself and future vistiors. Looking at research papers it seems many people just report two medians followed by a p value. Would it be appropriate therefore for me to use the Mood's median test?

Code:

Median test

   Greater |
  than the |         group
    median |         0          1 |     Total
-----------+----------------------+----------
        no |       426         75 |       501
       yes |       380         87 |       467
-----------+----------------------+----------
     Total |       806        162 |       968

          Pearson chi2(1) =   2.3228   Pr = 0.127

   Continuity corrected:
          Pearson chi2(1) =   2.0677   Pr = 0.150

I note the p value is different though to the qreg bootstrap p value.

Comment

Stephen Jenkins

Join Date: Apr 2014

Posts: 1425
#29

24 Mar 2020, 12:42

Joe: I suggesting backing up and re-reading the comments on your post(s), including #16 and thereabouts, and also reading relevant literature before you post again. A key point in the comments was the distinction between distributions that are continuous and those that are not; this has implications for the definition of the "median" and thence tests. My understanding is that virtually all tests of differences in medians developed to date are for the continuous distribution case. That refers to tests available via centile (built-in), cendif (by Roger Newson, part of his somersd package on SSC), or cid (by Patrick Royston, posted on Statalist in 1995). I had not heard of Mood's median test, but I have just looked at https://en.wikipedia.org/wiki/Median_test and I suspect that it is also for the continuous distributions case. (Statistical experts -- not me -- can advise.)

BTW please have another look at the Forum FAQ and note the request to state the provenance of community-contributed commands.
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#30

24 Mar 2020, 13:40

Hi, I apologise for posting again. The reason I am confused is because it appears that Nick Cox and Jeff Wooldridge posts seem to be contradictory.

It appears the variable tv is discrete and it is an integer. I have run the qreg and bootstrapped it and provided histograms using the discrete option as per Nick's advice. However, Jeff states that unless TV takes on a wide range of values, you can't trust either the usual standard errors or those obtained via bootstrap.

Therefore I am not clear whether the histograms and qreg output is usable or whether I need to perform an intreg, or basically not report median differences for this variable at all.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment