A multi-part series on the fundamentals eDiscovery practitioners need to know about effective early case assessment in the context of electronic discovery
In “Clearing the Fog of War,” we reviewed the uncertainty inherent in new matters and the three overlapping goals of ECA. In this Part, we begin our review of available tools and techniques with an overview and with a discussion of sampling.
As we discussed in the last Part, ECA encompasses three distinct-but-connected goals: traditional ECA, early data assessment (“EDA”), and downstream activity preparation (“Downstream Prep”). Thankfully, practitioners have a wide array of tools and techniques at their disposal to work towards each of these goals during the ECA phase of an eDiscovery project.
In almost any modern document review platform (e.g., Relativity), case teams have a powerful set of tools at their disposal for investigating their collected ESI. The specific bells and whistles of those features vary, but the core functions almost always include: searching tools, email threading tools, duplicate handling tools, conceptual analysis tools, and random sampling tools. Effective ECA involves leveraging as many of these features as are helpful – like aligning a series of overlapping lenses, to bring your quarry into sharp focus. Which lenses are helpful will depend on the specifics of your matter, your ESI, and your available time and resources.
We will discuss each of these core functions in turn and how they can be leveraged for the three goals of ECA.
One of the most powerful tools in your ECA toolkit is sampling. There are a lot of ways to find materials you expect to be in a collection of ESI, but sampling is a terrific way to also find materials you didn’t know to look for, the unknown unknowns. For our purposes, sampling comes in two flavors: judgmental sampling and formal sampling.
Judgmental sampling is the informal process of looking at some randomly selected materials to get an anecdotal sense of what they contain, whether that’s sampling from a particular source or particular search results or a particular time period. You’re not reviewing a particular number of documents or taking a defined measurement with a particular strength; you’re getting an impression and making an intuitive assessment.
Formal sampling is just the opposite: you are reviewing a specified number of randomly-selected documents with the goal of taking a defined measurement with a particular strength. Typically that measurement is either of how much of a particular thing there is within a collection (estimating prevalence) or of how effective a particular search is (testing classifiers).
Estimating prevalence is the process of reviewing a simple random sample of a given collection of materials to estimate how much of a given kind of thing is present. You might estimate the prevalence of relevant materials, of privileged materials, or of materials requiring redaction or special steps. The size of the sample you need is dictated primarily by how precise you want your estimate to be (margin of error), and how certain about it you want to be (confidence level), and to a lesser extent, by how large your collection of materials is (sampling frame). Most often you will be dealing with sample sizes of a few thousand (e.g., a sample of 2,345 for a confidence level of 95% and a margin of error of +/-2% in a collection of 100,000 documents).
Testing classifiers is the process of seeing how effective and efficient a particular classifier – typically a search of some kind – actually is. Using this technique, you can estimate how much of what you’re seeking a given search is likely to return (recall) and how much irrelevant material is likely to get returned with it (precision). These measurements are taken by running the searches against a control set, which is made be pre-reviewing and coding a sufficiently-large random sample. Comparing the search results to the already-completed coding allows for the iterative refinement of searches to increase their recall and precision before they are applied to the full collection.
Sampling supports traditional ECA by helping you rapidly learn about the materials you have, what might be in them, and how best to surface more of what you need – all without the risk of missing the unknown unknowns that can come with relying on searching alone. Reviewing a randomly-selected cross section lets you see some of everything you have and some of all the different language your custodians use to help you better plan your next ECA steps.
Beyond just reviewing the random sample for traditional ECA, leveraging prevalence estimation is invaluable for EDA and for Downstream Prep. Accurately estimating what you have enables you to: (a) prioritize the materials you have; (b) find gaps requiring further collection; and, (c) estimate your needed project resources, optimal project workflows, and likely project costs and durations (including assessing the viability of a TAR solution or the need for additional objective culling). As projects progress, prevalence estimations can also provide a yardstick against which to measure progress and completeness.
Testing classifiers, too, benefits EDA and Downstream Prep. Imagine iteratively refining searches for your own use, or negotiating with another party about which searches should be used, armed with precise, reliable information about their relative efficacy. As we saw frequently in our TAR case law survey, and as has been suggested in non-TAR cases, judges prefer argument and negotiation based on actual information to that based merely on conjecture and assumption.
And, the more effective the searches you develop, the more you reduce the volume of material left for downstream review and production.
Upcoming in this Series
In the next Part, we will continue our review of available tools and techniques with a discussion of searching and filtering.