Over the nine years since it first rose to prominence in eDiscovery, technology-assisted review has expanded to include numerous new tools, more potential workflows, and a variety of legal issues
In “Alphabet Soup: TAR, CAL, and Assisted Review,” we discussed TAR’s rise to prominence and the challenges that it has created for practitioners. In this Part, we review some key terms and concepts.
Technology-Assisted Review (TAR) was already intimidating to practitioners back when it was just predictive coding. As we discussed, TAR has evolved and expanded over the years to include a dizzying array of new acronyms, alternative approaches, and branded offerings. So, what do practitioners really need to know to get a handle on TAR? What are the key terms and concepts, and how do they fit together?
Let’s start with a brief, high-level explanation of TAR, Technology-Assisted Review. TAR refers to any of several workflows in which humans’ classifications of documents (e.g., as relevant or not relevant) are used to guide additional document classifications performed by software rather than humans. Those software classifications may be presented as binary results (e.g., relevant vs. not relevant) or as probability scores (e.g., 85% certain this is relevant). Such classifications are not based on exact text matches like keyword searches, but on more complex semantic analysis of the language in the documents.
Most such workflows are iterative, with several rounds in which the software’s classifications are evaluated by, or compared to, humans to further improve the accuracy of those classifications. Most also include a variety of sampling steps to estimate the amount of relevant material, to measure the workflow’s efficacy, and to test for missed materials at the end.
When leveraged successfully, a TAR process can evaluate and classify a large volume of ESI materials using substantially fewer hours of manual, human document review than other approaches, thereby reducing cost and time, while generally achieving superior results. It is not, however, suitable for all reviews. Small reviews, very complex reviews, or reviews heavy in multimedia, for example, are all generally better off approached in other ways.
Underpinning TAR are complex maps often referred to as indexes. Generally speaking, indexes are the enormous tables of information that are used to power search features. Most common are inverted indexes, which essentially make it possible to look up documents by the specific words within them. Inverted indexes are like more elaborate versions of the indexes you find in the backs of books. TAR workflows are all powered by various kinds of multidimensional maps sometimes referred to as semantic indexes.
These kinds of maps and indexes look not just at the individual words in each document but at other details. For example, they might also evaluate what words co-occur, how frequently they co-occur, and how close together they appear. These details are then used to plot the position of each document on the multidimensional map and reveal the relative positions of documents. To oversimplify: more closely related documents will appear closer together, and unrelated ones will appear father apart. These maps can then be used to reveal more than just a binary yes or no response to a search; you can instead get results scored on how likely they are to be responsive your search (i.e., how close or far away the result was on the map).
As we noted above, technology-assisted review refers to a family of similar-but-distinct workflows that rely upon similar-but-distinct mathematical approaches and software tools. What practitioners need to know is that these approaches break down into two broad categories. We will refer to these categories as TAR 1.0 and TAR 2.0.
TAR in eDIscovery began with TAR 1.0 and the launch of predictive coding solutions from Recommind, Equivio, and various licensees of the CAAT tools from Content Analyst Company (including Relativity, who later acquired them). Broadly speaking, these workflows involve leveraging a sampling process to create a training set or seed set, which the chosen software then uses to find other similar documents. These results are then reviewed and coded, and when completed, that coding is used to improve the software’s results. This training cycle is iterated multiple times until an acceptable quality of results is achieved. The effectiveness of the whole process is measured using a previously prepared control set and/or an additional random sampling effort.
TAR 2.0 refers to more recent tools and workflows developed to leverage a mathematical approach called support vector machines (as well as logistic regression and others). Rather than being based on identifying the similarities in a large, prepared training set like TAR 1.0, these workflows are characterized by continuous active learning that updates relevance scoring and prioritization for all documents dynamically as each additional document is coded by a reviewer.
This is accomplished by focusing on a single, binary classification (i.e., relevant to topic X and not relevant to topic X) and analyzing the differences in language between successive, single example documents to identify the “hyperplane” that best divides the relevant examples from the non-relevant examples on a multidimensional map. Each additional example the software analyzes and maps can lead the software to identify a more efficient hyperplane between the two groups, improving its classifications.
Most providers now offer both TAR 1.0 and TAR 2.0 solutions. Relativity Assisted Review, for example, was initially a TAR 1.0 solution, that was later updated to add a TAR 2.0 solution.
Because TAR workflows involve delegating some amount of traditionally human decision-making to software, these workflows all include at least some use of sampling to ensure a reasonable level of quality and completeness is achieved. In the context of Technology-Assisted Review, we are referring to formal sampling, in which a specified number of documents selected by simple random sampling is reviewed with the goal of taking a defined measurement with a particular strength (e.g., 95% confidence, +/- 2%). Formal sampling may be leveraged at various points in the workflow, including:
Additionally, simple random sampling is used in most TAR workflows to select representative materials to review. Whether creating a training set, creating a control set, or creating iterative training batches, using simple random sampling helps ensure you get a representative mix of materials and helps keep elusion low by revealing unknown unknowns in your collection.
Upcoming in this Series
In the next Part, we will continue our discussion of assisted review with a look at its applications and effectiveness.