The following is a guest post by Tony Hak of Rotterdam School of Management. I know Tony would love some discussion about it in the comments. I remain undecided either way, so would like to hear arguments.

**GOOD REASONS FOR NOT TEACHING SIGNIFICANCE TESTING**

It is now well understood that *p*-values are not informative and are not replicable. Soon null hypothesis significance testing (NHST) will be obsolete and will be replaced by the so-called “new” statistics (estimation and meta-analysis). This requires that undergraduate courses in statistics now already must teach estimation and meta-analysis as the preferred way to present and analyze empirical results. If not, then the statistical skills of the graduates from these courses will be outdated on the day these graduates leave school. But it is less evident whether or not NHST (though not preferred as an analytic tool) should still be taught. Because estimation is already routinely taught as a preparation for the teaching of NHST, the necessary reform in teaching will not require the addition of new elements in current programs but rather the *removal of the current emphasis on NHST* or *the complete removal of the teaching of NHST* from the curriculum. The current trend is to continue the teaching of NHST. In my view, however, teaching of NHST should be discontinued immediately because it is (1) ineffective and (2) dangerous, and (3) it serves no aim.

*1. Ineffective**: NHST is difficult to understand and it is very hard to teach it successfully*

We know that even good researchers often do not appreciate the fact that NHST outcomes are subject to sampling variation and believe that a “significant” result obtained in one study almost guarantees a significant result in a replication, even one with a smaller sample size. Is it then surprising that also our students do not understand what NHST outcomes do tell us and what they do not tell us? In fact, statistics teachers know that the principles and procedures of NHST are not well understood by undergraduate students who have successfully passed their courses on NHST. Courses on NHST fail to achieve their self-stated objectives, assuming that these objectives include achieving a correct understanding of the aims, assumptions, and procedures of NHST as well as a proper interpretation of its outcomes. It is very hard indeed to find a comment on NHST in any student paper (an essay, a thesis) that is close to a correct characterization of NHST or its outcomes. There are many reasons for this failure, but obviously the most important one is that NHST a very complicated and counterintuitive procedure. It requires students and researchers to understand that a *p*-value is attached to an outcome (an estimate) based on its location in (or relative to) an imaginary distribution of sample outcomes around the null. Another reason, connected to their failure to understand what NHST is and does, is that students believe that NHST “corrects for chance” and hence they cannot cognitively accept that *p*-values themselves are subject to sampling variation (i.e. chance)

*2. Dangerous: **NHST thinking is addictive*

One might argue that there is no harm in adding a *p*-value to an estimate in a research report and, hence, that there is no harm in teaching NHST, additionally to teaching estimation. However, the mixed experience with statistics reform in clinical and epidemiological research suggests that a more radical change is needed. Reports of clinical trials and of studies in clinical epidemiology now usually report estimates and confidence intervals, in addition to *p*-values. However, as Fidler et al. (2004) have shown, and contrary to what one would expect, authors continue to discuss their results in terms of significance. Fidler et al. therefore concluded that “editors can lead researchers to confidence intervals, but can’t make them think”. This suggests that a successful statistics reform requires a cognitive change that should be reflected in how results are interpreted in the Discussion sections of published reports.

The stickiness of dichotomous thinking can also be illustrated with the results of a more recent study of Coulson et al. (2010). They presented estimates and confidence intervals obtained in two studies to a group of researchers in psychology and medicine, and asked them to compare the results of the two studies and to interpret the difference between them. It appeared that a considerable proportion of these researchers, first, used the information about the confidence intervals to make a decision about the significance of the results (in one study) or the non-significance of the results (of the other study) and, then, drew the incorrect conclusion that the results of the two studies were in conflict. Note that no NHST information was provided and that participants were not asked in any way to “test” or to use dichotomous thinking. The results of this study suggest that NHST thinking can (and often will) be used by those who are familiar with it.

The fact that it appears to be very difficult for researchers to break the habit of thinking in terms of “testing” is, as with every addiction, a good reason for avoiding that future researchers come into contact with it in the first place and, if contact cannot be avoided, for providing them with robust resistance mechanisms. The implication for statistics teaching is that students should, first, learn estimation as the preferred way of presenting and analyzing research information and that they get introduced to NHST, if at all, only after estimation has become their routine statistical practice.

** 3. It serves no aim**:

*Relevant information can be found in research reports anyway*

Our experience that teaching of NHST fails its own aims consistently (because NHST is too difficult to understand) and the fact that NHST appears to be dangerous and addictive are two good reasons to immediately stop teaching NHST. But there is a seemingly strong argument for continuing to introduce students to NHST, namely that a new generation of graduates will not be able to read the (past and current) academic literature in which authors themselves routinely focus on the statistical significance of their results. It is suggested that someone who does not know NHST cannot correctly interpret outcomes of NHST practices. This argument has no value for the simple reason that it is assumed in the argument that NHST outcomes are relevant and should be interpreted. But the reason that we have the current discussion about teaching is the fact that NHST outcomes are at best uninformative (beyond the information already provided by estimation) and are at worst misleading or plain wrong. The point is all along that nothing is lost by just ignoring the information that is related to NHST in a research report and by focusing only on the information that is provided about the observed effect size and its confidence interval.

## Bibliography

Coulson, M., Healy, M., Fidler, F., & Cumming, G. (2010). Confidence Intervals Permit, But Do Not Guarantee, Better Inference than Statistical Significance Testing. *Frontiers in Quantitative Psychology and Measurement, 20*(1), 37-46.

Fidler, F., Thomason, N., Finch, S., & Leeman, J. (2004). Editors Can Lead Researchers to Confidence Intervals, But Can’t Make Them Think. Statistical Reform Lessons from Medicine. Psychological Science, *15*(2): 119-126.

This text is a condensed version of the paper “After Statistics Reform: Should We Still Teach Significance Testing?” published in the Proceedings of ICOTS9.

Its surely a good idea to emphaise estimation and confidence levels rather than NHST However, much important knowledge is reported in Mss that use NHST. Are we going to deprive our students of the essential knowledge needed to understand and evaluate this literature? It might be a useful exercise to encurage studnets to translate the exact p-value [where given] to confidence levels. By the way anyone who gives a 95% CI estimate and believes they have escaped NHST is deluded. The standard error should always be given, CI at a particular confidence level is optional, readers can decide for themselves what interval to use, although 95% is a handy guide. nb also posted this in blog best Diana

Dear Diana,

Thank you for your response to my post.

It seems we agree on the requirement that a useful research report should report at least a point estimate and a standard error or confidence interval. The p-value follows from this information; it is not additional information that was not already there. We know everything we need: an effect size estimate and a measure of the precision of this estimate. I wonder what important information is missing in this situation. Translation from a p-value to confidence levels is only needed when information about the standard error or confidence interval is missing in the report. Fortunately, in most disciplines, it is rare that serious publications lack this information. Paradoxically, your argument then implies that NHST teaching is most (or only) useful in disciplines in which the bad practice of not reporting standard errors or confidence intervals is prominent (or, in other words, in disciplines in which bad statistical praxctices are routine).

Tony’s article does a good job of highlighting the many issues with NHST and I am sure all statisticians can contribute many other issues. However, I think Tony has misdiagnosed the solution. I have taught and used NHST throughout my career and my experience tells me that the fundamental problem lies in the way it is taught. Too often, especially when teaching non-statisticians, NHST gets taught as a cookbook approach i.e. follow this recipe. The end result is often a black/white view of the world where the 95% CI and 5% sig level becomes sacrosanct which then leads to all the issues documented.

My approach has been to ignore all the maths and instead concentrate on developing understanding of the risks and rewards of correct and incorrect decisions. If students are taught to evaluate the rewards of a correct decision and the consequences of an incorrect decision that depended on a statistical analysis, they are far more likely to be able to decide whether expressing results in terms of confidence (at whatever is an appropriate level) or in terms of risk (i.e. p values and other measures). Once students start to view problems in this fashion, they find it easier to genuinely interpret results of NHST.

Hi Nigel,

It does not take long before students fully understand the reward structure in scientific research: they get rewarded for sharp and clear-cut conclusions based on statistically significant results, and punished otherwise (see e.g., Giner-Sorolla (2012) http://pps.sagepub.com/content/7/6/562.full). This leads to a range of problems including p-hacking and publication bias, all known to be detrimental to science.

In the industry, risks and incentives may be very different, and likely complex and context-dependent. But in academia no one needs to be taught reward structures. The problem is that these structures are seriously broken, and I believe this is largely due to the practice of NHST and stats courses that tend to legitimize it.

Pierre

Dear Nigel,

Thanks for your comments.

You seem to agree with my first argument about the consistent ineffectiveness of teaching NHST. If we would forbid bad or ineffective (or cookbook) teaching of NHST, my aim of abolishing the teaching of NHST altogether would be achieved almost immediately 😉

Perhaps we should be more precise about the type of issues that require a “decision”. As Geoff Cumming has emphasized regarding psychology and medicine, most disciplines need quantification of effect sizes rather than a “decision” whether the effect does or does not exist. Also, knowledge about an effect requires a synthesis of results from multiple studies and it is just a waste of valuable information (and a source of wrong conclusions) if p-values or “significance” are synthesized rather than effect size estimates. Meta-analysis is an analysis of effect sizes (point estimates and their standard errors), not of p-values.

Three cheers, Tony!

I’ve written about estimation and meta-analysis as replacing, not just supplementing, NHST in a tutorial article in Psychological Science. It’s a free download from http://tiny.cc/tnswhyhow

Part of the argument rests on the devastating (imho) case that John Ioannidis makes against over-reliance on the p value, especially .05, in his famous article “Why most published research findings are false”, a free download from http://tiny.cc/mostfalse

I’d love to see further empirical investigation of the cognition of p values and estimation, and discussion of good ways to interpret estimation results without reference to p values or NHST.

An illustration of the extreme unreliability of p values, in the sense that a replication is very likely to give a p value that’s very different, is the dance of the p values, at http://tiny.cc/dancepvals

May all your confidence intervals be short!

Geoff Cumming

“It is now well understood that p-values are not informative” – that’s nonsense. p-values formalise how compatible an outcome is with the random variation under a null model. That’s of interest in many situations.

I think the problem is that people want simple yes/no-answers quickly and therefore they ignore the subtleties in NHST, overuse it, ignore other relevant information such as effect sizes, and overinterpret the results.

Many people don’t understand NHST properly, but neither do I think that people understand confidence intervals, meta-analysis or Bayesian statistics correctly. Statistics is difficult and science needs much effort to come to reliable conclusions. Many people don’t like difficulty and effort, so they will misuse whatever is put up as alternative to NHST as well.

Christian,

The claim “p-values/NHST just needs to be taught correctly” is not a fact. It’s a guess, and is supported by almost no evidence other than wishful thinking on the part of Frequentists.

What is fact, is that Frequentists had a monopoly on teaching statistics for nearly a century. There was absolutely nothing stopping them from teaching it right. They had time, intellectual capital, monetary resources, and all the professorships and acceptance they needed to teach it right. Why couldn’t they do it? How much longer will they need? Another century?

The fact is, there have been lots of subjects far harder than significance testing. Quantum Field Theory, for example, is infinitely harder, but physics departments manage to teach it right nevertheless.

Image if 100 years after Newton, it turned out no predictions from Newton’s Equations were accurate, but Physicists kept excusing the failures by saying “no one’s teaching it right, we just need to teach it better”.

Everyone would rightly laugh at the physicists and draw the sensible conclusion that there’s something wrong with the equations. The sensible conclusion here is that there’s something wrong with Frequentist foundations of statistics and it’s causing statisticians to believe a whole bunch of things which simply aren’t true.

What’s also a fact that Laplace was using Bayesian methods to do a kind of significance testing in Astronomy and produced a mass of very good reproducible science with it over 200 years ago.

The fact is p-values/NHST was a step backwards from Laplace.

StatReform: If you read my posting again, you’ll find that I didn’t claim that the problem is that p-values are taught in the wrong way. It has much more to do with the reward structure in science, see Pierre’s posting above. This reward structure does not only affect what people do with p-values but also what people do with pretty much all other approaches.

Apologies. I took “Many people don’t understand NHST properly” to imply you were in the “we just need to teach it better” camp.

Alas, even with perfect teaching some people will misunderstand statistical concepts (any of them) and some will then even think they understand them and will go on teaching them to others in a bad way. That’s by no means exclusive to p-values.

If the world was dominated by Bayesians, you could observe the same phenomenon there (I already do).

“Everyone would rightly laugh at the physicists and draw the sensible conclusion that there’s something wrong with the equations. The sensible conclusion here is that there’s something wrong with Frequentist foundations of statistics”

Very wrong, and it is quite astonishing and depressing that it persists. OTOH, physicists are not incapable of making serious [probability theory related] conceptual / foundational errors. For example, not every QT textbook is as “psiontology” and error-free as Leslie Ballentine’s. 😉

I’m quite shocked and disappointed that what I had thought was a sensible blog would go the way of the extremists. NHST, if understood to permit inferences to substantive claims, is an illicit animal, but what about hypotheses tests taught properly? Is power going out the window too? Further, as I show, confidence intervals cannot avoid fallacies of tests, and in fact are used as in and out dichotomous affairs.With a one sided confidence interval, say corresponding to a one sided normal testing T+: mu 0 only the LOWER limit is given. How can you justify setting upper bounds? An additional principle is needed. My own account of severe testing can supply it, but confidence intervals alone cannot. I spoze you derogate the use of p-values in the discover of the Higgs Boson recently as well. Further, there is no power without a corresponding N-P test. You’re now reminding me of the members of the “task force on statistical inference” in my Saturday night spoof. Read it carefully.

http://errorstatistics.com/2015/01/31/2015-saturday-night-brainstorming-and-task-forces-1st-draft/

I will no longer recommend this blog to students as I had. Extremely disappointed in you Dr.. Nic.

” but what about hypotheses tests taught properly”

In practice all calls for teaching it “properly” amount to simply warning students that none of these methods consistently work out the way they they “should” theoretically and then get them to use qualitative, researcher-dependent, post analysis fudging or “interpretation” of the results to nudge them back to something reasonable. In fact, you seem to take this as one of the fundamental benefits of your “objective” approach to statistics.

Its fine for Philosophers to label suspect p-values as “nominal” and to suggest they are not “real”, but out the real world we’d need an objective criterion or test to determine which p-values are “nominal” which doesn’t require an ability to read other research minds. There is no such criteria, which is why all such reforms always reduce in practice to the advice “just use your qualitative judgment to fix the answers”.

Dr. Nic, thank you for nailing it. NHST has been a such a disaster the “students wont be able to read old papers if we don’t teach it” excuse for foisting it on another generation of statistics students doesn’t hold water.

P.s. Don’t worry about Mayo’s comment. Her advocacy for NHST boils down to nothing more than she has a strong philosophical bias toward it. She’s a philosopher who’s never done any applied statistics, never learned enough math to competently investigate statistical ideas, never programmed a data analysis on a computer, and never done any real world scientific inference. Those of us who’ve done all of those in abundance appreciate the article very much.

It is the way that hypothesis testing is taught that needs to be changed, and also the way that p-values are in practice interpreted. There certainly is no point exchanging a paradigm that blindly says “reject the null hypothesis if the p-value is less than 0.05, otherwise don’t” with one that blindly says ”reject the null hypothesis if the observed (say) mean is outside a 95% confidence interval, otherwise don’t reject it”. How many people really know what a 95% confidence interval is? To fully understand the concept of a confidence interval is at least as equally as difficult a task as fully understanding a p-value—after all, one way that a p-value could be interpreted is the that (1-p)% is the minimum width of confidence interval required to “capture” the observed mean (or whatever statistic we are talking about). Then there are non-minimum coverage confidence intervals, confidence intervals for the mean that do not contain the observed mean….don’t jump out of the frying pan into the fire!!!

Maybe it needs to be taught (in a reformed manner, of course) just so that the next generation is fully equipped to get into arguments with the p-value fetishists. As a practicing scientist, I can assure you that those fetishists sometimes include reviewers of my manuscripts, who did not appreciate my attempt to leave p-values out on occasion. I do not think completely removing p-values from the curriculum would be useful in accelerating the cause of downgrading them in the branches of science with which I am most familiar.

And I agree that fully understanding confidence intervals is no easier than fully understanding p-values, or may indeed be harder.

I will just add to my previous post that I would agree with Mayo about the use of p-values in relation to estimating the power of a test, and also the “type one error rate” (although she does not mention this.). The more I think about this confidence intervals and p-values are the proverbial “two sides of the one coin”, and if you throw away once side of the coin you have to throw away the other too. I do concede that less calumny is likely to occur with confidence intervals though.

Dear all,

In the last decade, a “new statistics” has emerged, particularly in psychology and medicine. An increasing number of journals are about to require that empirical results are reported as estimates, i.e. as effect sizes with confidence intervals, or have already done so. It was not the aim of my post to provide arguments for this statistical reform, which is inevitable, and hence no detailed argumentation in defense of this reform was provided. The emergence of the new statistics was taken as a given, not as something that was in need of further support. My post was aimed at raising an issue that will be practically relevant rather sooner than later, when our teaching practices need to be made useful for students who will make their careers in the context of this new set of statistical practices.

The question I wanted to raise can be specified in a very concrete way as follows: should textbooks (and courses) for undergraduate statistics teaching from now on include a chapter on NHST or not? My guess was that the default option chosen by authors would be to include such a chapter, for a number of reasons. One reason could be that inclusion of such a chapter would make the book more attractive for the old guard of teachers, i.e. for those who determine the book’s adoption (and hence the profits generated by it). A second reason could be that students with the help of this chapter could make better sense of the older research literature. A third reason could be that such a chapter suggests itself, as it were, as a logical extension of the chapter on estimation (just as is routinely done in current textbooks). In my post I have formulated three arguments as a counterweight to this “default” inclusion of a NHST in our future undergraduate teaching.

My three arguments are that (1) NHST teaching, despite our best efforts, tends to be ineffective (and hence a waste of time at best, and a training in bad practices at worst); (2) NHST thinking is addicitve and hence dangerous; and (3) knowledge of NHST is not needed for understanding good research reports. In the comments to my post so far, I see some endorsement of the first argument (quite a lot of mentionings of bad teaching), resistance to my third argument, and quite some prevalence of thinking in terms of “testing”. Seeing the main aim of using inferential statistics as to support “decision making” and interpreting confidence intervals as just another way of assessing the level of statistical significance are, in my view, evidence of the fact that it is very difficult to break the habit of thinking in terms of “testing” when this way of thinking has become routine. As stated in my post, this stickiness of dichotomous thinking seems to me a good reason for avoiding that future researchers come into contact with it in the first place and, if contact cannot be avoided, for providing them with robust resistance mechanisms. The implication for statistics teaching is that students should, first, learn estimation as the preferred way of presenting and analyzing research information and that they get introduced to NHST, if at all, only after estimation has become their routine statistical practice, i.e. not before having advanced into their graduate studies.

I hope this was a useful clarification. I am grateful for your responses.

It is possible to teach NHST correctly. For example, by having students build intuitions on statistical error that are deep enough to reveal the absurdity of dichotomous judgments, then by introducing NHST as a curious and embarrassing episode in the history of science.

Are Statisticians a kind of Mathematicians?

Hard to believe. We, the Statisticians use Maths, is true, but in a peculiar way. I prefer to think that Engineers of the Randomness is a much better name than Applied Mathematicians. In reality our environment is not at all the deductive world but the inductive one.

An classic example,

J Cohen [1] following Pollard & Richardson, 1987

____a)____If a person is an American then he is probably not member of the Congress

____b)____This person is a member of the Congress

____c)____Therefore he is probably not an American.

What a mess. With an incongruent First Premise a credible conclusion could not be stated any way. In fact in order a person be a member of the Congress strict conditions must be fulfilled, the first one is precisely to be an American citizen. I do not believe, nobody does, in Cohen’s good faith, quite unlike.

The Golden Rule relating for Significance Tests (NHST)

____a)____If A occurs, the B occurrence is very unlike

____b)____B is observed

____c)____Therefore A is strongly unlike.

There is a lot of difference with a plain syllogism I prefer to call it a common sense consequence. The occurrence of B does not excludes the A one, simply made it highly problematic: the Type I error is always set at a very low value: not withstand A is true B falls in the rejection region. Bad luck . . .

[1]-J. Cohen, The Earth is Round (p<.05)

While I agree that statistics cannot give firm answers about a “decision”, and that p-values contribute to the illusion that they can, in areas like medicine I have a hard time imagining a world where decision-making is going to go away, no matter what statisticians say. At some point, the FDA either approves something for use or it doesn’t. With or without meta-analysis, confidence intervals etc., statisticians will be asked, if not for a dichotomous outcome, at least for a trichotomous one (approve, reject to the point that no further trials are done, reject as an interim measure requiring more data). Decision rules are inevitable in the way society works.

Personally, I find it quite useful in my teaching to use p-values to illustrate the practical consequences of those decision rules via Ioannidis’ famous illustration. This could of course be done with any decision rule, but p-values are convenient for the purpose, and it reinforces a better understanding of them.

Yes, decision rules are inevitable. But p-values don’t actually let you make the 3-way decision you suggest (approve because we’re sure it works; reject because we’re sure it doesn’t work; or wait for further study). A confidence interval would let you make that 3-way decision (the interval is far from 0; the interval is tight around or near 0; the interval is wide).

But a significant p-value doesn’t distinguish “CI far from 0, so we approve it” vs “tight CI just barely missing 0, so we may as well reject this drug with negligible effect” vs “loose CI just barely missing 0, so wait for more data”.

And a non-significant p-value doesn’t distinguish “tight CI including 0, so we know it’s a failure” vs “wide CI including 0, so it requires further study”.

There *are* times when it’s sensible to use a hypothesis test, e.g. to summarize something multivariate when it’d be too hard to form or interpret a multivariate CI. But for univariate effects, the CI is far more informative than the p-value.

I agree with what you are saying, I was not defending p-values but hypothesis testing. But to be fair to p-values, that framework does suggest looking simultaneously at p-value and at power, which when coupled is a different approach to the same thing. And while confidence intervals are more intuitive after the fact, one could argue that power calculations are more intuitive prior to an experiment.

Pingback: Sunday Recap IV | Infinitas Dimensiones

Read http://pkpinc.com/files/Statistics_and_Reality.pdf for a new insight