t-test p-values: philosophical question

Richard Rovin

Join Date: Aug 2019

Posts: 22
#1

t-test p-values: philosophical question

19 Dec 2019, 12:53

I am using Stata 15.1 for Mac. My dataset after 1:4 case:control matching has 1174 cases (patients with brain tumors) and 4696 controls (patients without brain tumors). I am comparing the levels of a serum biomarker (the value is unit-less as it is a ratio). Using the t-test mean comparison test, two sample using groups (ttest serum, by(tumor)), I get the following output. I conclude that the difference between the means is not statistically significant.

Two-sample t test with equal variances

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

0 4,696 .282771 .000907 .062152 .2809929 .2845491
1 1,174 .2789174 .001807 .0619147 .2753721 .2824627

5,870 .2820003 .0008108 .0621185 .2804109 .2835897

diff .0038536 .0020265 -.0001191 .0078263

diff = mean(0) - mean(1) t = 1.9016
Ho: diff = 0 degrees of freedom = 5868

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.9714 Pr(T > t) = 0.0573 Pr(T > t) = 0.0286

However, when I use the immediate form t-test calculator with rounded values (ttesti 4696 0.283 0.062 1174 0.279 0.062), I get the following output, now with a "significant" p value.

Two-sample t test with equal variances

Obs Mean Std. Err. Std. Dev. [.95% Conf. Interval]

x 4,696 .283 .0009047 .062 .2812263 .2847737
y 1,174 .279 .0018095 .062 .2754498 .2825502

5,870 .2822 .0008094 .0620154 .2806132 .2837868

diff .004 .0020231 .000034 .007966

diff = mean(x) - mean(y) t = 1.9772
Ho: diff = 0 degrees of freedom = 5868

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.9760 Pr(T > t) = 0.0481 Pr(T > t) = 0.0240

I realize there is considerable controversy regarding reliance on p values to establish statistical significance.
So, how do I reconcile the above? I would certainly like to say the difference in the biomarker level between cases and controls is significant and therefore has clinical utility.

Thank you very much,
Richard

Last edited by Richard Rovin; 19 Dec 2019, 13:04. Reason: I tried to make output tables more readable and I wanted to add tags
Tags: p-value, t-test

William Lisowski

Join Date: Dec 2014
Posts: 10150

19 Dec 2019, 13:33

If you copy and paste precisely the values output by the your ttest command as arguments to your ttesti command, you get precisely the same results from ttesti.

Code:

. ttesti 4696 .282771 .062152 1174 .2789174 .0619147

Two-sample t test with equal variances
------------------------------------------------------------------------------
         |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
       x |   4,696     .282771     .000907     .062152    .2809929    .2845491
       y |   1,174    .2789174     .001807    .0619147    .2753721    .2824627
---------+--------------------------------------------------------------------
combined |   5,870    .2820003    .0008108    .0621185    .2804109    .2835897
---------+--------------------------------------------------------------------
    diff |            .0038536    .0020265               -.0001191    .0078263
------------------------------------------------------------------------------
    diff = mean(x) - mean(y)                                      t =   1.9016
Ho: diff = 0                                     degrees of freedom =     5868

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.9714         Pr(|T| > |t|) = 0.0573          Pr(T > t) = 0.0286

To get nicely formatted output like the above, copy your command and results from the Results window into a code block in the Forum editor using code delimiters [CODE] and [/CODE], as explained in section 12 of the Statalist FAQ linked to at the top of the page.

Last edited by William Lisowski; 19 Dec 2019, 14:21.

Comment

Igor Paploski

Join Date: Oct 2014

Posts: 174
#3

19 Dec 2019, 14:11

(...) difference in the biomarker level between cases and controls is significant and therefore has clinical utility.

Not really. Statistical significance of a difference and clinical utility of said difference deal with two fundamental different properties of the finding. It is possible to have virtually any small difference be statistically significant between groups given enough sample size regardless of clinical significance, and it is also possible to have clinically important differences that are not statistically significant (which can also be related to sample size).

Since as per the title of your post your interest is in the more philosophical side of the discussion, I would suggest reading this paper, published earlier this year: https://www.nature.com/articles/d41586-019-00857-9

More important than having a difference be above 0.05 (it is 0.0573) or under 0.05 (it is 0.0481), it's the discussion of how useful this difference is for your patients. It's likely that if you had ~100 patients more in your case group (also with the adequate controls), and assuming the proportion you found the biomarker on both groups to be the same, the difference could have been statistically significant, yet nothing would change regarding clinical importance. Discuss how this difference that you seem to have found is helpful for the conduction of the patients - in my opinion this is as valuable as having a "statistical significance stamp" on whatever you found.

Last edited by Igor Paploski; 19 Dec 2019, 14:13.
3 likes
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2167
#4

19 Dec 2019, 14:18

Richard: In my view, you're taking the strict p-value < 0.05 prescription much too seriously. As you can see, when you round estimates and standard errors, you can change p-values in higher decimal places. In your case, from something just above 0.05 to something just below. We generally rely too much on many digits when reporting precision, but I would just go with the initial reported p-value of 0.057. So you reject the null at the 6% level even though not at the 5%. There is nothing sacred about 5%. You have pretty strong evidence against the null of no effect. It's not a slam dunk, of course. How one proceeds depends on the context. Is the estimated effect practically large? I can't even tell if it goes in the direction you expect: did you think the mean under treatment would be higher or lower? How important is a difference of 0.0078?

Might the proper alternative be one-sided in the direction of a lower mean? In that case the p-value is about 0.0286. I think the best you can do is discuss the direction of the effect, the size of the effect, the p-value, and a confidence interval. This is an uncertain business we are in, but you should present all of the evidence.
5 likes
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#5

19 Dec 2019, 14:18

Originally posted by Richard Rovin View Post

I would certainly like to say the difference in the biomarker level between cases and controls is significant and therefore has clinical utility.

You cannot derive clinical utility from statistical significance. Clinical utility is determined by looking at the difference between 0.283 and 0.279 and saying that that 0.004 difference has a real noticeable impact on the patients. The p-value tries to quantify our uncertainty due to random sampling. The distinction between a p-value of 0.048 and 0.057 is obviously meaningless and calling one significant and the other not significant is not helpful.

So look at the p-value and see moderate evidence against the hypothesis that the means are equal, and then look at the means and look at what they mean for the patients in order to determine the clinical relevance of your finding.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
4 likes
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#6

19 Dec 2019, 14:25

A couple of notes:
First, if it is a matched study, you'd want to use a matched analysis, which will give you somewhat of a different answer than your unmatched analysis. But a more important consideration is that a t-test for the difference in mean exposure between cases and controls is, to my knowledge, pretty strongly deprecated by epidemiologists. Among other things, it involves a comparison of causes, given effects, rather than the reverse. My recollection is that there are some standard examples where no mean difference exists, but methods that treat case/control status as the response variable do reveal a difference. I might have read about this in one of the older versions of Rothman's epidemiological methods textbook. Perhaps one of our resident epi folks can comment.
3 likes
Comment
Richard Rovin

Join Date: Aug 2019

Posts: 22
#7

20 Dec 2019, 14:51

Thank you all so much for your prompt and thoughtful responses.

@William Lisowski: In a mock post, I did as you suggested and copied the command and results using the code delimiters and was rewarded with the beautiful Stata results table we know and love. I also used this method to reproduce the -clogit- output below.

@Igor Paploski I did read the paper you referenced. I found the "Quit Categorizing" section especially helpful. I am sharing it with my co authors and would like to incorporate "compatibility intervals"

@Jeff Wooldridge: We thought the mean of cases would be lower than the mean of controls. Your point is well taken, certainly one that I had not considered.

@Maarten Buis: I agree completely. Unfortunately, we still publish in a dichotomized world (the paper Igor referenced) and I was concerned that our finding would be dismissed if it wasn't labeled statistically significant. I will prepare a more nuanced discussion of the meaningfulness of our findings.

@Mike Lacy: I would welcome an epidemiologist's view point. We did run conditional logistic regression:

Code:

clogit tumor serum, group(pairid) or Iteration 0: log likelihood = -1887.4888 Iteration 1: log likelihood = -1887.4687 Iteration 2: log likelihood = -1887.4687 Conditional (fixed-effects) logistic regression Number of obs = 5,870 LR chi2(1) = 4.02 Prob > chi2 = 0.0449 Log likelihood = -1887.4687 Pseudo R2 = 0.0011 ------------------------------------------------------------------------------ tumor | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- serum | .3290493 .1824163 -2.01 0.045 .1110138 .9753148 ------------------------------------------------------------------------------

Thanks again,
Richard

Last edited by Richard Rovin; 20 Dec 2019, 15:03. Reason: I added @ to properly acknowledge the respondents
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#8

21 Dec 2019, 02:08

Richard:
I do share all previous comments.
Despite being usually educated (at least at the beginning of out frequentist statistical training) to split the reality in significant and non-significant comparisons, we learn along the way that it is not that relevant: significant and non-significant results are equally informative.
As oftentimes wisely reminded by Clyde Schechter on this forum, the relevance of p-value per se has been questioned in a pretty frequently quoted ASA position paper (that you can freely access at https://amstat.tandfonline.com/doi/f...8#.Xf3eDuRYapo). This contribution goes hand in hand with the one Igor Paploski suggested.
In your case, I woulds also wonder whether one single predictor is actually enough to get an informative result. Maybe age, gender, comorbidities, and/or other independent variables should be considered (and, in my opinion, this would imply to switch to a regression model leaving -ttest- for simpler comparison and/or basic statistical assignments).

Kind regards,
Carlo
(Stata 19.0)
2 likes
Comment
Richard Rovin

Join Date: Aug 2019

Posts: 22
#9

21 Dec 2019, 05:19

Carlo Lazzaro Thank you for your response. We did match cases and controls on age, gender, race, and multiple medical co morbidities. We first ran -clogit- analysis (output above), but then decided to also look at the difference in means
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#10

21 Dec 2019, 05:47

Richard:
I skipped one of your previous replies; my bad.
That said, it seems that -clogit- and -ttest- results are more or less consistent (the first is barely significant, the second id barely unsignificant).
However, my preference goes out to 95% CI instead of p-values examination (that convey an idea of a limited relevance, in statistical terms, of -serum- as a predictor).

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Richard Rovin

Join Date: Aug 2019

Posts: 22
#11

21 Dec 2019, 06:17

No worries. I read over the ASA statement on p-values. I agree--developing the language and changing the mindset (including mine) is proving more difficult.
I also came across this 2019 editorial from the ASA: https://www.tandfonline.com/doi/pdf/...eedAccess=true The editors recommend we stop using the term "statistically significant": "We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely."

Cheers,
Richard

Last edited by Richard Rovin; 21 Dec 2019, 06:38. Reason: I added the link to the ASA paper and its seminal quote.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#12

21 Dec 2019, 06:40

Richard:
thanks for sharing this interesting link (that I missed).

Kind regards,
Carlo
(Stata 19.0)
Comment
Dave Airey

Join Date: Apr 2014

Posts: 398
#13

23 Dec 2019, 07:21

Also take a look at a couple papers on the "2nd generation p-value" by Blume and colleagues as an alternative approach.
Comment
Richard Rovin

Join Date: Aug 2019

Posts: 22
#14

24 Dec 2019, 09:27

Dave Airey an interval p-value is interesting. Here is a link to the paper: https://journals.plos.org/plosone/ar...type=printable

On a side note, as we shift from the all important p<0.05, are there thoughts as to how we should calculate sample size?

Best,
Richard
Comment
Justin Niakamal

Join Date: Aug 2017

Posts: 760
#15

24 Dec 2019, 10:06

On a side note, as we shift from the all important p<0.05, are there thoughts as to how we should calculate sample size?

Dave Giles (an econometrician) recently discussed Ed Leamer's (also an econometrician) work regarding how significance levels should be decreased as the sample size grows. I think that's what you're asking here, it's a short blog that's worth a read.

Here's a relevant excerpt:

Should we still set α = 10%, 5%, 1% if n is very, very large? (No, we shouldn't!)

Equivalently, if n is very big, what is the appropriate magnitude of the p-value below which we should decide to reject the null hypothesis? Or, equivalently again, how should the critical value for this test be modified in very large samples?

Leamer's result tells us that we should reject the null if F > (n / q)(n^q/n - 1) ; or equivalently, if qF = χ² > n(n^q/n - 1)

Source:
https://davegiles.blogspot.com/2019/...-you-have.html
Comment

Announcement