Statistics: out with the 'p' and in with the 'T'
If statistics are the foundation upon which scientific research is based then the widely used p value is one of its pillars, having been in use for over a 100 years.
The p value is prominent in medical and psychological research where the effect of a treatment is being measured. A p value of 0.05 is traditionally taken as the cut-off point of statistical significance. Generations of students have been taught this standard test of significance for an observed effect.
Recently, however, there has been a backlash against the indiscriminate use of the p value. Statisticians say that it is often misunderstood: a common misconception is that it gives the probability that the null hypothesis is true. Another is that it is a measure of evidence for the alternative. Scientific journals are questioning its over-use. Some have even banned it from their pages.
'I've been teaching the p value all my life,' says La Trobe statistician Associate Professor Bob Staudte. 'Now we've had a falling out. The p value is really a measure of surprise about the null hypothesis,' Dr Staudte says, 'but what researchers want is a test that tells them how much evidence they have for a treatment effect.' Dr Staudte, who admits to being somewhat of a crusader, and two colleagues have developed a new method of describing evidence which overcomes this problem and addresses one of his major complaints about the misinterpretation of p values.
A p value of 0.05 is called 'significant', he says, while a value of 0.01 is regarded as 'highly significant', indicating observed results (or even more extreme results) could only have occurred by chance one in 100 times under the null hypothesis of no effect.
'Typically, if you ask someone to compare the two p values, he or she will say that the 0.01 level has five times more evidence against the null than the 0.05 level. This is not true. There is in fact only a forty per cent difference between them.'
The new method developed by Dr Staudte, Dr Elena Kulinskaya of Imperial College, London and Professor Stephen Morgenthaler of the Swiss Federal Institute of Technology, Lausanne, transform a test statistic onto a calibration scale where it is called evidence T.
This T-value has a standard error of one in estimating the expected evidence for the alternative hypothesis. This T value is also normally distributed so that it is easier to interpret and visualise and it is a direct measure of evidence for the effect. Further, it is easy to combine evidence from several independent studies such as in a meta-analysis.
'It's a simple transformation based on variance stabilisation,' Dr Staudte says. 'We were going to write a paper but it turned into a book.'
The book, Meta Analysis: A guide to Calibrating and Combining Statistical Evidence, was published by John Wiley & Sons in 2008. 'We will have to wait for the critical response to see if it catches on.'
Drs Staudte and Kulinskaya presented a short course based on the book at the Statistical Society Conference ASC-2008 in Melbourne in early July.