Random sampling is a powerful eDiscovery tool that can provide you with reliable estimates of the prevalence of relevant materials, missed materials, and more
In “Finding out How Many Red Hots are in the Jellybean Jar,” we discussed a candy contest hypothetical and the importance of sampling techniques to eDiscovery. In this Part, we review the key sampling concepts necessary to use sampling to estimate prevalence.
In order to use sampling to estimate how many red hots are mixed into the jellybean jar, we need to understand some basic sampling concepts, including: sampling frame, prevalence, confidence level, and confidence interval, as well as how each affects required sample size. We also need to understand that whenever we refer to sampling here, we are referring to “simple random sampling” in which any item within the sampling frame has an equal chance of being randomly selected for inclusion in the sample.
“Sampling frame” refers to the set of materials from which a sample will be taken. In the context of our jellybean example, the sampling frame would be the full contents of the enormous jar of jellybeans and red hots. In the context of eDiscovery, your sampling frame will typically be the pool of materials available after any initial, objective culling has taken place (i.e., what’s left for assessment and review after initial de-NISTing, deduplication, and date restriction during processing).
In addition to being your sample source, your sampling frame also affects the size of the samples you will need to take. As we will discuss below, sample size is primarily determined by how reliable and precise you want your results to be, but the size of your sampling frame also affects your needed sample size to some extent. As your sampling frame gets bigger, your sample size will also need to get bigger – but only up to a point. Beyond that point, the effect levels off, so the sample size needed for a frame of 100,000 items (i.e., jellybeans, documents, etc.) is roughly the same as the sample size needed for 1,000,000 of them, which is roughly the same as the sample size needed for 10,000,000 of them. Sampling frame size has the weakest effect on sample size.
“Prevalence” is how much of something there is within your sampling frame. For example, it could be how many red hots there are in your jellybean jar, or it could be how many relevant documents there are in your collected materials. It could also be how many documents are privileged, how many require redaction, or any other binary property you want to measure.
In the math underlying sampling, the prevalence of what you are seeking is also a factor that can have an effect on the required sample size for some purposes. When what you are doing sampling for is to estimate prevalence, however, you need to plug in an assumption for this value, and to be safe, you plug in the most conservative value (i.e., the one that results in the largest sample size). For prevalence, this is 50%, meaning that half the sampling frame is what you’re looking for and half is not. Most sampling features in eDiscovery tools and online calculators will default to this value and may not even given you the option to change it.
“Confidence level” is a measurement of how reliable your results are (or will be). It is expressed as a percentage out of 100, and most commonly, you will see discussion of 90%, 95%, or 99% confidence levels. What these numbers technically mean is that, if you reran the same sampling process 100 times in a row, you would expect to get similar results 90 times out of 100, or 95 times out of 100, or 99 times out of 100.
The higher you want your confidence level to be, the larger the sample size you will need to use to achieve it, and confidence level has a stronger effect on sample size than sampling frame size or prevalence does. For example, if you were taking a sample from a sampling frame of 100,000 items, and you wanted a margin of error of +/-2% (which we will discuss further below), here is how your required sample size would vary with your desired confidence level:
“Confidence interval” is a measurement of how precise your results are. It is expressed as a percentage out of 100, and most commonly, you will see discussion of confidence intervals of 2%, 4%, and 10%. Even more commonly, you will see discussions refer to “margin of error” with references to +/-1%, +/-2%, and +/-5%. These margins of error are actually the equivalents of those confidence intervals. The latter is just framed in terms of plus or minus half the range, and the former is framed in terms of the full range.
The narrower you want your range of uncertainty to be, the larger the sample size you will need to use achieve it, and confidence interval (or margin of error) has the strongest effect on needed sample size. For example, if you were taking a sample from a sampling frame of 100,000 items, and you wanted a confidence level of 95%, here is how your required sample size would vary with your desired range of uncertainty:
Upcoming in this Series
Next, in the final Part of this short series, we will discuss the application of these sampling concepts to our jellybean contest and to eDiscovery.