Replication Intervals
This article is originally published at https://ntguardian.wordpress.com
At the University of Utah I’ve taught MATH 1070 and MATH 3070. Both are introductory statistics classes, but I call MATH 1070 “Introductory Statistics for People Who Don’t Like Math” while MATH 3070 is “Introductory Statistics for People Who Do Like Math”, since the latter requires calculus and uses far more probability. In both classes, though, students need to learn what confidence intervals (CIs) say and don’t say, and I spend a lot of time debunking common misconceptions for what a confidence interval says.
Suppose we have a 95% CI, such as (for example, the interval example I just gave could be a CI in a political poll saying that Donald Trump’s approval rating is 49% with a margin of error of 3%). Here is the correct interpretation of the interval:
If we constructed many intervals using the same procedure as the one producing this interval, approximately 95% of those intervals would contain the true mean1.
Notice that the interpretation says nothing about whether this interval contains the true mean; the 95% is referring to the probability the procedure would capture the true mean, not the probability this interval contains the true mean (which in frequentist statistics is a nonsense question). We don’t know if this interval contains the true mean, but because the procedure should capture the mean 95% of the time, we think that the region we found is a good descriptor of where the true mean likely could be.
Incorrect interpretations of this interval include (but of course are not limited to):
- The probability the mean is in this interval is 95%. (Frequentist statistics never makes a statement like this since this statement implies that the mean is random, when in the frequentist framework it is not.)
- This interval contains about 95% of the population. (CIs quantify uncertainty about the location of the mean and say nothing about how spread out the population is.)
- A future observation has roughly a 95% chance of being in this interval. (CIs say nothing about future data.)
- Approximately 95% of future means will be within the margin of error of the mean we estimated. (Again, CIs say nothing about future data.)
While CIs cannot be interpreted these ways, statisticians created other intervals that can be interpreted with the above meanings; this gives a reason why the distinction of these meanings matters. The first bullet point describes Bayesian credible intervals, the second a tolerance interval, and the third (after stripping out troublesome probabilistic language) a prediction interval. Bayesian credible intervals replace confidence intervals in Bayesian statistics, while the other intervals serve purposes other than giving a location where the mean is (and quantifying our uncertainty in the estimate).
In this blog post I focus on the fourth bullet point above. I call this interval a replication interval, since the objective of this interval is not to quantify where the mean is but to give an interval that describes where the mean of a future study could be in relation to the mean found in this study, when a future author tries to replicate the study that produced the original mean.
I thought of this interval while working with a paper and trying to replicate the simulations done there. The paper reported simulated Type I error rates (the rate of rejecting the null hypothesis when the null hypothesis is true). I have been having trouble replicating the error rates reported with my own simulated data. Of course when both datasets are randomly generated you cannot expect to get the same numbers. That said, what is a tolerable amount by which the two estimates may differ before one suspects that the two estimates are not produced from the same process?
I call an interval describing this a replication interval since it describes how much an estimate from a replication study may differ from an originally produced estimate before the replication study may question whether the two studies describe the same process. This is certainly related to the so-called reproducibility crisis as it describes how much the results of two studies may differ before calling into question whether the studies came from the same phenomenon. (An interesting, related idea for hypothesis testing is the replication probability; see this article in The American Statistician.)
Formally, suppose we have an i.i.d.2 dataset ; we view as the data of the original study (in its random, unobserved form) and is the data from a replication study. Call the estimator for some quantity produced using and the estimator of the same quantity using ; the former will be an observed estimate while the latter is a theoretical future, replication estimate of . We have two functions, and that are used to produce lower and upper bounds (respectively) of the interval. The replication interval satisfies
where is the desired probability of capturing the location of .
Let’s consider a simple and very useful case. Suppose we are estimating the location of the mean using the sample mean. A large-sample confidence interval is given by
with being the sample mean produced from a dataset of size and is an estimate of the standard error. Let’s assume that a future study will attempt to replicate the study producing with exactly the same procedure; in such a case, in our above formulation.
Let be the (random version of the) sample mean of the present study and be the random future sample mean of the replication study3. Let be the variance of the data. The error between the original mean and future mean is . Notice and . It follows (invoking the central limit theorem as needed) that
If we assume we know then we can find a replication interval using the same algebra that produced confidence intervals and other two-sided intervals:
Of course we don’t know what is usually so we use the approximate interval:
Notice that conveniently the replication interval’s (RI’s) margin is times the margin of error of the confidence interval. This means that if we want a replication interval and have only a two-sided confidence interval’s margin of error, we can take that margin of error and multiply it by and have the replication interval4. As an example, take the hypothetical poll I mentioned earlier which produced a 95% CI . Then the (approximiate) RI is . We should expect future polls with the same sample size and performing the same procedures to produce an estimated approval rating within 4.5 percentage points of the original poll.
So the punchline is this: take a margin of error from a CI and multiply it by 1.5; a future, replication estimate should be within that distance of the original estimate (with the same confidence level as the CI).
EDIT: Another nifty rule-of-thumb: Suppose you have the (estimated) standard error of an estimator that follows some sort of central limit theorem (examples include the sample mean, sample median, and sample proportion). An approximate 95% RI is the estimate plus-or-minus three times that standard error. That is, future estimates should be within three standard errors of the estimate (with 95% confidence).
I have created a video course published by Packt Publishing entitled Training Your Systems with Python Statistical Modeling, the third volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course discusses how to use Python for machine learning. The course covers classical statistical methods, supervised learning including classification and regression, clustering, dimensionality reduction, and more! The course is peppered with examples demonstrating the techniques and software on real-world data and visuals to explain the concepts presented. Viewers get a hands-on experience using Python for machine learning. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.
If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.
- Notice that I’m referring to the mean when my example involves a proportion. The two are the same; a population proportion is the population mean of binary data. ↩
- We could change to a framework of two independent datasets that within-sample are not i.i.d. and not much would change. ↩
- Yes, I switched the use of and from how I originally formulated the replication interval. Sue me. ↩
- This should work for intervals too. ↩
Thanks for visiting r-craft.org
This article is originally published at https://ntguardian.wordpress.com
Please visit source website for post related comments.