Advanced Analytic Tools, ECA Fundamentals Series Part 5

5 / 6

A multi-part series on the fundamentals eDiscovery practitioners need to know about effective early case assessment in the context of electronic discovery

In “Clearing the Fog of War,” we reviewed the uncertainty inherent in new matters and the three overlapping goals of ECA.  In “Sampling a Well-Stocked Toolkit,” we began our survey of available tools and techniques with an overview and with a discussion of sampling.  In “Searching and Filtering for Fun and Profit,” we continued our survey with a discussion of searching and filtering options.  In “Threading, Duplicates & Near-Duplicates,” we turned our attention to tools for handling threading and duplicates.  In this Part, we conclude our survey of tools and techniques with a review of advanced analytic tools and TAR workflows.

After thread and duplicate management tools, the final major tools and techniques available for pursuing the three goals of ECA are advanced analytic tools, powered by semantic indexing and other advanced mathematical analyses, including: concept searching, concept clustering, categorization, TAR 1.0 and 2.0 workflows, and new AI tools.

Semantic Indexing

As we noted briefly in our discussion of traditional searching and filtering tools, there are more sophisticated types of indices than the traditional inverted indices used to power basic search functions.  These semantic indices analyze the available materials in a different way to power different kinds of features.  Whether created by “latent semantic analysis” or “probabilistic latent semantic analysis” or another related mathematical approach, these indices are designed to go beyond just listing all of the words in a document to reveal the semantic content of those words.

This semantic analysis is accomplished by analyzing the co-occurrence of unique terms across the collection of documents (e.g., how often does the term “fire” appear with the term “employee” and how often does it appear with the term “extinguisher”).  This analysis of co-occurrences is used to create an n-dimensional map (like a traditional map of Cartesian coordinates, but with many more dimensions than just x, y, and z).  The more frequently unique terms co-occur together, then the stronger the relationship between them, and the more co-occurring terms in two documents, then the closer to each other they will appear on the map.  Dense clusters of such documents suggest key topic areas or concepts in the document collection (e.g., employee termination discussions in one area of the map and fire safety discussions in another).

Features Powered by Semantic Indexing

Semantic indices are used to power a variety of branded features in different eDiscovery software platforms, but regardless of name or variation, there are three key functions that underlie most such features:

Concept Searching

Searching against a semantic index does not require an exact match in the way that searching against an inverted index does.  Instead, the terms or phrases you search are mapped onto the existing index and documents that are close enough to those search terms on the map will be returned as results – even if none of the exact terms you searched appear.  Some concept searching features are referred to as natural language search features, and some also offer an option to search for more documents like a given example document, which may be a real sample or a synthetic one, created for the purpose.

Another advantage of searches against these indices is that they can reveal more than just a binary yes or no.  Because of the nuanced multidimensionality of the index, you can get results scored on how responsive or not responsive they are to your search (i.e., how close or far away the result was on the map).

Concept Clustering

Concept clustering is an automated, unsupervised process in which software analyzes the semantic index that has been created.  Rather than looking for the closest matches to a user-provided search, the software looks for the densest clusters of related materials it has identified and groups those results together into clusters defined by their most frequently occurring terms.  How dense a cluster must be to qualify is typically a customizable property.  Those clusters can then provide an alternative way to explore a collection of documents, to learn about the scope of topics and range of materials it contains, and to identify areas for further exploration.


Categorization is akin to a hybrid between concept searching and concept clustering.  It is a process in which a user selects a set of example documents to define a cluster for the software, and then the software attempts to find all the other documents that should go in that cluster with the examples provided.  This is one of the basic workflows underlying technology-assisted review – or what is now referred to as TAR 1.0.

Technology-Assisted Review

Technology-assisted review is used to refer to a family of workflows that leverage categorization (or similar functions), in combination with sampling, to achieve a reliable document review process that requires significantly fewer hours of manual, human review than traditional all-manual approaches.  Since its initial rise to prominence in 2011, the available array of TAR tools has expanded and evolved, and eDiscovery service providers have continued to develop new workflows to leverage them in useful ways for the diverse range of projects their clients face.

Although full deployment of a TAR workflow is typically part of the review phase of an eDiscovery project, these workflows – or limited versions of them – may also be leveraged to explore a collection during ECA, to organize and prioritize it for a more traditional review process, or to create a yardstick against which to measure a more traditional review process.

TAR 1.0 – LSI, Predictive Coding

TAR 1.0 is used to refer to the initial, categorization-based workflows offered in eDiscovery – many of which were, and are, referred to as predictive coding.  Broadly speaking, these workflows involve leveraging a sampling process to create a training set or seed set (i.e., a user-defined cluster or clusters), which the chosen software than uses to find other similar documents.  These results are then reviewed and coded, and that coding is used to improve the software’s results.  This training cycle is iterated multiple times until an acceptable quality of results is achieved.  The effectiveness of the whole process is measured using either a previously prepared control set or an additional random sampling effort.

TAR 2.0 – SVM, Continuous Active Learning

TAR 2.0 is used to refer to more recent workflows developed to leverage new tools based on a mathematical approach called “support vector machines” (SVM).  Rather than being based on identifying the similarities in a large, prepared training set like categorization and TAR 1.0, these workflows are characterized by “continuous active learning” that updates relevance scoring and prioritization for all documents dynamically as each additional document is coded by a reviewer.

This is accomplished by focusing on a single, binary classification (i.e., relevant to topic X and not relevant to topic X) and analyzing the differences in language between successive, single example documents to identify the “hyperplane” that best divides the relevant examples from the non-relevant examples on a multidimensional map.  Each additional example the software analyzes and maps can lead the software to identify a more efficient hyperplane between the two groups, improving its classifications.

These workflows emphasize speed over structure, and so, they work best in situations where there is a clear, binary classification decision to make and where family groups and other contextual factors are less important than overall speed.

New AI Tools

Beyond these advanced analytic tools, a new breed of tools billing themselves as artificial intelligence is also becoming available.  These new software solutions seek to aid in, or perform, a wide range of common legal tasks from legal research, to contract analysis, to brief review, and more.  Among them are a few tools focused on eDiscovery ECA and other investigative activities, which purport to automatically identify key documents and people, reveal gaps and suspicious behavior, and assist in other useful ways through analysis and visualization.

Advanced Analytic Tools and the Three Goals

These advanced analytic tools can yield benefits for each of the three ECA goals:

  1. For Traditional ECA, advanced analytic tools are some of the most powerful. Concept searching lets you find relevant materials even before you know the best search terms to use; concept clustering lets you explore a cross-section of topics to find unknown unknowns and identify areas for further exploration; and, categorization can let you use a few relevant examples – including synthetic ones you make – to find more.   And, if you are dealing with a time-sensitive, investigative matter, TAR 2.0 workflows may be able to rapidly surface relevant materials, and AI analysis and visualization tools may be able to provide other kinds of assistance in completing your picture of what happened.
  1. For EDA, concept clustering can provide a valuable overview of your materials – including revealing an absence of things you expected or the presence of things you don’t need, which can inform decisions about ongoing collection activities. Additionally, AI tools may be able to reveal gaps requiring further collection through their communication mapping and other analysis and visualization tools.
  1. For Downstream Prep, concept clustering can help organize and prioritize subsequent review activity – including both areas to review first and irrelevant areas to skip (e.g., travel emails, fantasy football emails, etc.). Employing categorization or a TAR workflow of some kind can also be used for the same purpose, or to create a yardstick against which subsequent manual efforts can be measured.  AI tools may also help with prioritization of subsequent review work.

Upcoming in this Series

Up next, in the final Part of this series, we will discuss integrating these available tools and techniques into effective ECA workflows.

About the Author

Matthew Verga

Director of Education

Matthew Verga is an electronic discovery expert proficient at leveraging his legal experience as an attorney, his technical knowledge as a practitioner, and his skills as a communicator to make complex eDiscovery topics accessible to diverse audiences. A fourteen-year industry veteran, Matthew has worked across every phase of the EDRM and at every level from the project trenches to enterprise program design. He leverages this background to produce engaging educational content to empower practitioners at all levels with knowledge they can use to improve their projects, their careers, and their organizations.

Whether you prefer email, text or carrier pigeons, we’re always available.

Discovery starts with listening.

(877) 545-XACT / or / Email Us