A multi-part series on the fundamentals eDiscovery practitioners need to know about the processing of electronically-stored information
In “Why Understanding Processing is Important,” we discussed understanding processing in the context of lawyers’ duty of technology competence. In this Part, we review the key steps in processing and some common tools used to complete them.
Broadly speaking, there are four main activities that take place during processing: expansion, extraction and normalization, indexing, and objective culling. In this Part, we will discuss the first three of these activities and touch on the tools commonly used to complete them. The fourth activity, objective culling, requires a whole post of its own and will be covered later in this series.
The first thing that must be accomplished when processing ESI for discovery is the expansion of all containers. Container files are files that contain other files within them. Common end-user examples would include PST files, which contain countless individual email files, or ZIP files which can contain any combination of file types. Additionally, the collection process often generates container files that must also be expanded (e.g., hard drive images). Beyond just container files that contain groups of files, many file types can also contain individual embedded objects.
Examples of embedded objects would include an attachment to an email, a logo image in an email signature, or a spreadsheet chart embedded in a Word document. Embedded objects can be handled in different ways, and it is common to separate some to handle as discrete files and to leave others embedded. For example, it is common to extract all email attachments as their own files (though tracked in family groups with their parent email), and it is common to leave in-line images in documents embedded to be reviewed in context. Decisions about how to handle different kinds of embedded objects can affect both searchability and reviewability later.
As we noted in the last Part, ESI for discovery can come in hundreds of different file formats that could each require a different piece of software to natively view. To avoid litigants on either side from having to utilize all those different source applications, and to enable integrated management, searching, and review across file types, ESI must undergo a process of extraction and normalization. In this process, all of the content is extracted from your collection of electronic files, and that extracted content is normalized into a consistent, usable format.
The content to be extracted includes all of the text from the files (e.g., the body of an email or a Word document), and it may include any objects embedded within documents, as noted above. For files with imaged text (e.g., scanned documents), it is common for optical character recognition (OCR) to be used to attempt extraction of the available text. Beyond a file’s primary textual content, its metadata must also be extracted (e.g., created by, date modified, etc.), and its relationships to any other files must be documented (i.e., if it is, or has, an attachment).
All of this extracted content is normalized into database fields that can be displayed consistently by document review software to facilitate downstream discovery activities like early case assessment and review. Decisions may need to be made at this stage about how hidden content in documents should be handled and what version of a document and its text should be extracted and displayed (e.g., tracked changes in Word, speaker notes in PowerPoint). Typically, the native files are still available too, linked to their extracted, normalized content, so that the native versions can be viewed instead when necessary. (For example, the extracted text from a spreadsheet is not very comprehensible compared to the native version of that spreadsheet.)
In order to make all of those downstream activities possible, all of the extracted content also needs to be indexed. Indexing is the process of creating the enormous tables of information that are used to power search features. Most common are inverted indices, which essentially make it possible to look up documents by the words within them. Inverted indices are like more elaborate versions of the indices you find in the backs of books. Decisions about how indices should be generated and what common words (e.g., articles, prepositions) they should skip affect the completeness of search results later. Searches can only find what indexes show.
More sophisticated semantic indices are created to power features like concept searching, concept clustering, and technology-assisted review. These multi-dimensional indices come in a few varieties, but essentially, they all document how frequently words appear in the same documents as other words. From these co-occurrences, the tools identify clusters of related terms that define a topical area and transcend any one, specific keyword. Some customization of this process is also possible, which will affect what results the index reveals.
All of these core processing activities are completed using specialized software tools. These tools are able to automatically perform expansion, extraction, normalization, and traditional indexing for many common file types, and they can generally be operated manually or customized as needed to handle more challenging or less common file types. How wide a range of common file types and issues can be handled automatically varies from tool to tool, as does how easily (and to what extent) custom solutions can be implemented to go beyond that range.
Those tools might come in the form of off-the-shelf processing software like eCapture by IPRO or LAW PreDiscovery by LexisNexis. They might come integrated with an enterprise review platform or with an enterprise collection platform. Because of the frequent need for adaptability, customizability, and scalability, many providers (like XDD) develop proprietary processing tools to handle these activities. Such organizations may use a tool like dtSearch to perform the necessary traditional indexing.
Semantic indices are created using specialized software like the CAAT tools from Content Analyst Company, which are integrated into (and now owned by) Relativity.
Upcoming in this Series
In the next Part, we will review some common errors and special cases that can arise during these processing activities.