Rubriche 12 minuti di lettura

A cura di Dolores Catelan, Anniable Biggeri, Fabio Barbone

E&P 2010, 34 (5-6) settembre-dicembre, p. 91-95

DOI: —

Biostatistica - Metodi

Reporting Uncertainy

Riportare l'incertezza

Annibale Biggeri^{1, 2}, Dolores Catelan^{1, 2}, Fabio Barbone³

Affiliazioni

Biostatistics Unit, ISPO Cancer Prevention and Research Institute, Florence, Italy
Department of Statistics "G. Parenti", University of Florence, Florence, Italy
Chair of Hygiene and Epidemiology, DPMSC, University of Udine, Udine, Italy

Corrispondenza: Fabio Barbone; fabio.barbone@uniud.it

Informiamo i lettori che sul prossimo numero di E&P verrà pubblicata una versione in italiano di questo articolo rivolta a un pubblico non specialistico.

We are approaching the 25^thanniversary of the publication of Martin Gardner’s and Douglas Altman’s paper on the use of confidence intervals in reporting study results in medical research.¹Two years after its publication, the International Committee of Medical Journal Editors endorsed this policy.^2,3Notwithstanding this, there is still a misuse of confidence intervals as a surrogate test of hypothesis and the rhetoric of uncertainty hides uncritical faith on pvalues. In this contribution, we will discuss three issues:

how to interpret a confidence interval;
how to report confidence intervals in a paper;
how to report a confidence interval in an abstract.

How to Interpret a Confidence Interval

A confidence interval is a range of values for a population parameter calculated from a given sample of observations. It is meaningful when we want to make an inference outside our study, which is almost the case in every scientific investigation. In statistical textbooks, the confidence interval is described as an interval estimate of a population parameter. The width of the interval depends on the natural variability of the phenomenon under study, the sample size and the arbitrary level of confidence. Fixing this last quantity, the confidence interval will show the reliability of an estimate and, in a broad sense, the halfwidth of the confidence interval is called the margin of error. The results of a study, when reported with confidence intervals, have the same standard unit or magnitude of the investigated phenomena. This was the main justification evoked for a shift from reporting standardized effect measures or their statistical significance. The advantage of using a confidence interval is that the degree of uncertainty is translated into the width of the interval and anyone can immediately appraise how informative a study result is and determine the weakness related to a small sample size or lack of control of population variability in the factor being studied. The “interpretation of confidence intervals should focus on the implications (clinical importance) of the range of values in the interval”.⁴

There is an inherent arbitrariness in the specification of the level of confidence used to calculate the confidence interval. Under some assumptions, there is a correspondence between the test of hypothesis and the interval estimate. Therefore, if we claim results statistically significant at 5% twosided, then 95% confidence interval will exclude the null value. This is true provided that the assumptions be satisfied, but the unwanted consequence is that too often in biomedical research confidence intervals are judged only on the basis of the critel rion of excluding the null value. Warning against the acritical use of any prefixed level of confidence was expressed.⁵ In order to discourage improper interpretation of confidence intervals, Sterne and DaveySmith⁴ suggested reporting intervals at the 90% confidence level.

The frequentist derivation of the confidence interval assumes a priori infinite repetitions of the study with the same fixed sample size. For each replicate we calculate a confidence interval and by random sampling we select just one of them. Under a Gaussian probability model, we can build confidence intervals with a width that provides a given probability of selecting an interval that includes the population parameter. Under such a paradigm, once having done the study and having estimated one confidence interval, any value in it has the same probability to be equal to the population parameter. This reflects our ignorance and the exchangeability of the replicates under the random sampling paradigm. The Gaussian probability model also gives an interpretation to the arbitrary cutoff used for the confidence level: selecting 95% confidence level is the equivalent of saying that one over twenty sampled confidence intervals will exclude the population parameter.

Using likelihood theory, an interval estimate for a population parameter or supported range is the set of values of the parameter with likelihood ratios above a critical value. Having a probability model, e.g. a Bernoulli model for binary data or a Poisson model for disease event counts, we can calculate the data likelihood and derive an appropriate supported range from profile likelihood ratio function of the parameter of interest.⁶

An example

Let us consider the point source study on the high frequency radio transmitter in Rome and the incidence of childhood leukemia (table 4, modified):⁷

Table 1. Childhood leukemia incident cases and expected counts by distance from putative source (see text).

The probability model is Poisson: the disease counts Y are aspsumed to be distributed as a Poisson random variable with the mean parameter equal to θ x E; E being the expected count indirect standardization:

Pr (Y = y|θ E) = Cθ^Y exp (-θE)

where C is a constant (E^Y/Y!). The log likelihood function reduces to:

I(θ) = Y log (θ)-θE

For the first row in the table above, the standardized incidence is 1/0.16=6.25 and it is the maximum likelihood estimate the relative risk θ. In fact it is the value which corresponds to the maximum of the likelihood function

(1 x log (6.25)0.16 x (6.25) = 0.83258146).

A supported range is calculated by finding the values which satisfy:

l(θ) = Y log (θ)-θE = 1.353

where the cutoff value of 1.353 is arbitrary. We found that the equation is satisfied for θ = 0.6605; 22.7930.

The log likelihood ratio (i.e. the log likelihood minus the maximum of the function) is indeed:

(1 x log (22.7930)-0.16 x (22.7930)-0.83258146 = -1.353
(1 x log (0.6605)-0.16 x (0.6605)-0.83258146 = -1.353

The same calculations for the second row in the table will give θ = 1.14185; 3.69435.

While for the first row the empirical evidence was inconclusive, because the supported range was very wide with non-sensible values from 0.66 to 22.79 , the data for the 0-6km band supported relative risk in the range 1.14 ÷ 3.69, a range of value consistent with epidemiological literature on environmental exposures. However, the causal interpretation of such findings is not a statistical issue.

In this example we used the cutoff of -1.353 for the log likelihood ratio. Under a Gaussian approximation to the likelihood and by applying the frequentist approach, it would correspond to a 90% confidence level. The relationship is -2 log likelihood ratio = z² and we obtain for example:

(-2) x (-1.353)=1.645² for a 90% confidence level
(-2) x (-1.921)=1.96² for a 95% confidence level.

The usual approximate formula for the confidence interval is:

ss ± z _1-α/2 x se (ss)

where ss is the generic sample statistics, for example the sample average, z1α/2 is an appropriate centile of a theoretical sampling distribution, for example the normal or the student’s t, and se is the standard error of ss. In the case of the standardized ratio example above, the Gaussian approximation would be ss = Y/E and se(ss) = √(Y)/E and we obtain

1/0.16 ± 1.645 x 1/0.16 = 4.03; 16.53 and
8/3.68 ± 1.645 x √(8)/3.68 = 0.91; 3.44.

The Poisson likelihood for the small observed number of cases is strongly asymmetric and the approximation is not adequate. In the biostatistical literature, several approaches to confidence interval estimation are discussed.⁸We aimed here only to reinforce the message that the empirical evidence supports a range of plausible values for the parameter (effect measure) of interest. The uncertainty in empirical research implies that we should scrutinize not one solution but a portfolio of alternatives.

Figure 1. log Likelihood ratio function for leukemia example (see text). Solid line: 1 case, 0.16 expected; dashed line: 8 cases, 3.68 expected; dotdash lines: cutoff values at 0.23 and 1.92 corresponding to standard normal confidence levels of 50% and 95%.

Figure 2. Gamma(8,8) prior, posterior Gamma(8+8,8+3.68) and the posterior Gamma(1+8,1+3.68).

How to Report a Confidence Interval in a Paper

It is common practice in epidemiological literature to report confidence intervals (CIs) after point estimates, as for example (90% CI: low ; up). This may be confusing in two ways as shown in Louis and Zeger.⁹In the tables it could be difficult to directly compare point estimates, because of confidence intervals in adjacent columns. This seems to be a minor problem, but it underlines the fact that the researcher is pushed to consider separately point and interval estimates. Instead, both of them are summaries of the same information, which is driven by the data likelihood function. In Figure 1, we report the two log likelihood ratio functions for data reported in Table 1. The curves show that the larger the sample size (8 cases vs 1 case) the more peaked the likelihood and the shorter the supported range. Moreover, they illustrate the second argument of Louis and Zeger:⁹the likelihood is not the same for all the points (relative risks) in the supported range. The authors proposed to report the maximum likelihood estimate together with supported ranges corresponding to a confidence level of 50% and 95%, i.e. using the 25%75% and 2.5%97.5% centiles of the standard normal. Using again data from Table 1, the way to report the point estimate and the related uncertainty should be for the first relative risk (1 case event vs 0.16 expected) _{_0.36 2.92} 6.25 _{11.47 _27.66} and for the second (8 case events vs 3.68 expected) _{_0.99 1.70} 2.18 _{2.73 _4.06}.

A simpler solution would be to report only one supported range corresponding to confidence levels of 90% or 95%, e.g. for a 90% supported range we get _0.6₆6.25 _22.7₉and _1.1₄2.18 _3.6₉respectively.

The maximum likelihood estimate is written in full text and the limits of the supported range as left and right subindices, recursively if more than one supported range is reported. One great advantage of this solution, if it is sensible for the problem, is that the reader has an idea of the location of the whole likelihood function with respect to the null relative risk value. In the first case (1 case event vs 0.16 expected) the empirical evidence quantified by the likelihood ratio function is largely concentrated on relative risks above the RR=1, information completely lost when reporting the two confidence limits alone − see also Rothman for a discussion of this problem.¹⁰

This approach suffers from both the arbitrariness in the definition of the confidence levels and the Gaussian assumptions for their interpretation according to the frequentist paradigm. A Bayesian approach will provide a credibility interval which is simpler and easier to understand.¹¹Combining the data likelihood with the prior distribution, Bayesian inference uses the posterior distribution for the parameter of interest (see box 1). This is a probability distribution and it is summarized reporting a measure of central tendency (i.e. the mode, the mean or the median) together with selected centiles of the distribution. This is called the credibility interval and the associated level is the probability that the parameter of interest, given the data, has a value in that interval. Let’s consider the leukemia data of Table 1: 8 case events vs 3.68 expected and the supported ranges (50%-95%): _{_0.991.70}2.18 _{2.73 _4.06}.

Using, for mathematical convenience, the Gamma prior, which is conjugate to the Poisson distribution, we can derive in a closed form the posterior distribution: for example if the prior is a Gamma(a,b) the posterior still will be a Gamma with parameter (a+Y,b+E), Y and E being the observed number of event cases and the expected counts, respectively. In Figure 2, we show a Gamma(8,8) prior (dashed black), the posterior Gamma(8+8,8+3.68) and the posterior Gamma(1+8,1+3.68) assuming a much more dispersed prior Gamma(1,1). The centiles (2.5% 25% mean 75% 97.5%) of these posterior distributions are _{_0.791.14}1.38_{1.59_2.19}and_{_0.871.46}1.95 _{2.38_3.42}.

These credibility intervals are directly interpretable: there is a 95% probability that the true unknown relative risk lies in the interval 0.87 : 3.42 under the more dispersed Gamma(1,1) prior, and there is a 90% probability that it lies in the interval 0.99 : 3.15 . This example is useful also to underline the role of prior belief. If we choose the Gamma(8,8) prior, we are assuming that the range of plausible relative risks would be (2.5% 25% mean 75% 97.5%) _{_0.420.73}1.00 _{1.18 _1.80}

(see⁶pages 117-119 for a discussion about the choice of prior distributions). In such a case, given the small number of events, the posterior is shifted more toward the null relative risk of one.¹²

How to Report a Confidence Interval in an Abstract

Recently, there was a debate about the potential pitfalls of epidemiologic research. The credibility of scientific investigations was questioned daily by the reports in the media of unconfirmed new risk factors.¹³Sterne and DaveySmith issued warnings against subgroup analysis to limit discredit to epidemiological research.⁴In fact, current epidemiological studies have large sample sizes and focus on small risk factors for population subgroups. These studies usually assess several research hypotheses, even thousands or millions in Genomewide analysis. The papers contain large tables of relative risks and confidence intervals. (see¹⁴for examples in descriptive epidemiology) Leaving aside the problem of testing multiple hypotheses, which is typical in Genomics, here we address the arbitrariness in reporting only some results in an abstract. This is important for two reasons: abstracts are open access and may influence a large audience; there is room for arbitrary selection of the research findings. The coverage of the confidence interval under selection has been shown to be invalid. When we select from an abstract some relative risks and their confidence intervals from a large set reported in the body of the text, the width of those confidence intervals is too short and need to be adjusted for the selection process.¹⁵Suppose we aim to study population susceptibility to air pollutants and provide a list of 20 relative risks (and confidence intervals) in the body of the manuscript. We then select the two more important relative risks (and confidence intervals) to appear in the abstract. Now, while the confidence intervals given in the body of the text are valid, since they are listed together with all the others, the same cannot be said of the two reported in the abstract. The uncertainty due to the selection process is not accounted for. The correction suggested by Benjamini and Yekutieli¹⁵is simple:

ss ± z _1-α/2x se

α’=R x α/m

α is the desired confidence level, R is the number of selected confidence intervals to be reported in the abstract and m is the total number of confidence intervals in the manuscript. Then, if we report two confidence intervals out of twenty the centile of the sampling distribution should be chosen for 2 x α/20 −

i.e. for a 90% confidence interval we must use z_1a’/2= 2.576 instead of 1.645. Let us explain in detail. Suppose we calculate 100 confidence intervals at confidence level a and choose to report in the abstract those intervals that exclude the null value. Then the conditional coverage probability – Pr(θ∈ CI | CI selected), which the number of times a confidence interval includes the parameter divided by the number of times the confidence interval is selected, is no longer fixed to a. As shown in Benjamini and Yekutieli,¹⁵it varies and depends on the value of the unknown parameter being estimated. Defining the False Coverage Rate as the number of times a confidence interval does not include the parameter divided by the number of times the confidence interval is selected we can provide a way to properly control its expected value, having set to zero the proportion when no CI is selected. For example, suppose that the true value of the relative risk parameter be one, and m confidence intervals be calculated. Then select a confidence interval if it excludes the null value RR=1. The conditional coverage probability is zero and the expected False Coverage Rate is one. However, if we modify our selection procedure using for example the Bonferroni correction, i.e. setting the confidence level at α’=α/m , the expected False Coverage Rate is α. Formally:

V
E (-- I R>0) Pr(R>0)
R

V are the number of false coverages, R the number of selected CI. In the situation described before (RR=1), averaging over repetitions, the first factor will be one, as before, while the second factor will be exactly 0.05 , the familywise error rate assured by the Bonferroni correction. This procedure is too strict whenever, for some relative risks, the null hypothesis RR=1 is false. This justifies the formula reported above, which substitutes α’= α/m with α’= R x α/m

Conclusion

In this contribution, we discussed three issues: the interpretation to be given to a confidence interval; a proposal to report confidence intervals in a paper; a suggestion for reporting a confidence interval in an abstract. The interpretation may be unfamiliar and stresses the importance of the supported range and tries to weaken the connection between confidence intervals and the test of hypothesis. The proposal may seem awkward to implement in writing a paper, but it emphasizes the importance of a continuum between point and interval estimates and introduces the concept of a distribution on the parameter of interest. The Bayesian approach is natural from this point of view. The suggestion may seem provocative. We think that it is very important to limit the amount of data dredging in epidemiological research while preserving its power. We hope our suggestion stimulates the debate and avoids un critical use of statistics in scientific literature.

Bibliografia/References

Gardner MJ, Altman DG. Confidence Intervals Rather Than P values: Estimation Rather Than Hypothesis Testing. Br Med J (Clin Res Ed) 1986; 292(6522): 746-50.
International Committee of Medical Journal Editors. Uniform Requirements for Manuscripts Submitted to Biomedical Journals. Br Med J 1988; 296: 401-5.
Gardner MJ, Altman DG. Estimating with Confidence. Br Med J (Clin Res Ed) 1988; 296(6631): 1210-11.
Stern JAC, Smith DG. Sifting the Evidence. What’s Wrong with Significance Tests? BMJ 2001; 322: 226-31.
Gardner MJ, Altman DG. Using Confidence Intervals. Lancet 1987; 1(8535): 746.
Clayton D, Hills M. Statistical Models in Epidemiology. Oxford, Oxford University Press, 1993.
Michelozzi P, Capon A, Kirchmayer U et al. Mortality from Leukemia and Incidence of Childhood Leukemia Near a High Power Radio Station in Rome, Italy. American Journal of Epidemiology 2002; 155(12): 1096-103.
van Belle G, Fisher L, Heagerty PJ, Lumley T. Biostatistics: A Methodology for the Health Sciences (2nd Edition). New York, Wiley, 2004.
Louis TA, Zeger SL. Effective Communication of Standard Errors and Confidence Intervals. Biostatistics 2009; 10(1): 1-2.
Rothman KJ. Epidemiology: An Introduction. Oxford, Oxford University Press, 2002.
Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. (2nd Edition) Chapman & Hall/CRC Press, Boca Raton: 2003
Clayton DG, Kaldor J, 1987. Empirical Bayes Estimates of AgeStandardized Relative Risks for Use in Disease Mapping. Biometrics 1987; 43: 671-81.
Traubes G. Epidemiology faces its limits. Science 1995; 269: 164-69.
Catelan D, Biggeri A. Multiple Testing in Descriptive Epidemiology. GeoSpatial Health 2010; 4(2): 219-29.
Benjamini Y, Yekutieli D. False Discovery RateAdjusted Multiple Confidence Intervals for Selected Parameters. JASA 2005; 100(469): 71-81.