KNOW YOUR VITAL STATISTICS 

Year : 2019  Volume
: 67
 Issue : 6  Page : 15131514 
Statistical Significance versus Clinical Importance
Kameshwar Prasad Department of Neurology, All India Institute of Medical Sciences, New Delhi, India
Correspondence Address:
Dr. Kameshwar Prasad Department of Neurology, All India Institute of Medical Sciences, New Delhi India
How to cite this article:
Prasad K. Statistical Significance versus Clinical Importance.Neurol India 2019;67:15131514

How to cite this URL:
Prasad K. Statistical Significance versus Clinical Importance. Neurol India [serial online] 2019 [cited 2020 May 31 ];67:15131514
Available from: http://www.neurologyindia.com/text.asp?2019/67/6/1513/273608 
Full Text
Conventionally, P < 0.05 has been termed “statistically significant,” but often what is statistically significant is not always clinically important, and what is statistically nonsignificant may be a case of an unwarranted claim of “no difference.” Similarly, if a study is conducted to check replicability of the previously published results of a study, reliance on Pvalue may lead to unwarranted claims of refutation or unwarranted denial of refutation. These are some of the reasons why several experts have risen against statistical significance.[1],[2] However, others have supported retaining Pvalue but at a more stringent level of significance say P < 0.005.[3],[4]
Clinicians need to check whether statistically significant results are also clinically important. Clinical importance is a matter of judgment based on the magnitude of effects and the balance between benefits versus risks, costs, and inconvenience. Assessing the magnitude of effects requires examination of the difference that the intervention makes (mean difference, risk difference, the number needed to treat, or any other measure of effect size like risk ratio, odds ratio, hazard ratio etc.—these will be discussed in the subsequent notes in this series). The question is whether this difference in beneficial outcomes is worth the risk and costs of the intervention and whether it is meaningful for the patients. What is the range of magnitude of difference consistent with data (depicted as confidence interval and whether the lower limit of interval for the beneficial outcome is beyond the minimum clinically important difference)? This requires clinical judgment, and clinicians should always exercise this judgment to ensure that the results are not only statistically significant but also clinically important for patients.
It has been seen that once statistical “significance” is declared, clinicians assume that the finding is clinically significant as well. In fact, statistical significance only means (e.g., in a twogroup randomized controlled trial [RCT] or any other comparison between two groups) that observed differences between two groups is unlikely to be because of chance (though this likelihood is not completely ruled out). The confusion arises because of the use of the word “significance.” I propose the use of the term “statistically detectable,” which indicates that the difference observed (or any other finding) is a signal detectable beyond the noise. Statistical tests are nothing but “signal detection tests.” Once the signal is detected, clinicians have to judge the value of the signal. If signal is not detectable, then the possible reasons for this are worthy of reflection, including, is there really no worthwhile signal or is it because of smaller than required sample size (power), revealed in the width of the confidence interval (to be discussed in detail in the subsequent notes in the series) or are there flaws in the design and/or execution of the study introducing so much noise that the signal is missed?
Noise in this discussion usually means “variation” among patients in the study. The variation could have many causes, such as the patients in the study may be prognostically heterogeneous; eligibility criteria too wide or too vague to permit entry of homogeneous group of patients (please note, very narrow criteria may maximise homogeneity, but the results will have restricted applicability or generalizability); and study intervention applied at variable times in a timesensitive situation, skills of the surgeons variable in a multicentric study, dose of the intervention administered variable across patients, variation in the assessment of endpoints, interobserver and intraobserver variation in application of a measuring scale, and others. A welldesigned and executed study attempts to limit the variation by controlling the causes of variation. Still some variation remains in spite of the best possible efforts, including all patients who do not respond in a similar way and do not have the same outcome. At the end of the study, we ask using statistics, is there a signal detectable beyond the noise (which is the variation in outcome among the patients?). For example, let us have a hypothesis that the females have a different (higher) intelligence quotient (IQ) than males. How do we statistically approach to reject or accept this hypothesis? We first assume there is no difference (the null hypothesis). Next, we take a representative sample of males and females. We measure their IQ. As you know, all males or all females will not have the same IQ. There will be some variation among them. This variation is like noise. If there was no variation, we would not need any statistics. We, in fact, could just take one male and one female—measure their IQ and see whether there is any difference or not. However, as there is some variation, we need to assess many males and many females, representative of the population of males and females—find the mean IQ of males and mean IQ of females in the sample. The difference in the means is the signal. Noise is some measure of the variability (like a standard error) in the mean IQ among the male and female populations.
We take the ratio of signaltonoise. Then, we determine the probability of seeing this big ratio (or bigger), if there was no difference in IQ between males and females. If this probability is small (conventionally <5%, i.e., the Pvalue), then we reject the null hypothesis of no difference and accept the alternative, i.e. the IQ of females is different from those of males. To know whether it is higher or lower, you must look at the means and decide, and examine if the magnitude of the difference is meaningful or not (like clinical importance discussed above). The signaltonoise ratio bears different names depending on the type of data. For categorical data, it may be called Chisquare (pronounced like kaisquare). For numerical data, it may be called “t” or “F.” Accordingly, the names Chisquare test, ttest, or Ftest, etc., are used (details are beyond the scope of this series).
References
1  Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature 2019;567:3057. 
2  Wasswerstein RL, Schirm AL, Lazar NA. Moving to a world beyond P< 0.05. Am Stat 2019;73:119. 
3  Ioannidis JP. The importance of predefined rules and prespecified statistical analyses: Do not abandon significance. JAMA 2019;321:20678. 
4  Ioannidis JPA. The proposal to lower P value thresholds to 0.005. JAMA 2018;319:142930. 
