A multi-part series on the fundamentals eDiscovery practitioners need to know about the processing of electronically-stored information
In “Why Understanding Processing is Important,” we discussed understanding processing in the context of lawyers’ duty of technology competence. In “Key Activities and Common Tools,” we discussed the core processing activities and some of the tools used to complete them. In “Common Exceptions and Special Cases,” we discussed scenarios requiring extra work and decisions during core processing activities. In this Part, we review the options for objective culling.
In addition to expansion, extraction, normalization, and handling exceptions and special cases, processing also includes several types of objective culling that are used to reduce the amount of material that must be worked with throughout the subsequent phases of a discovery project, saving both time and money. The objective culling options commonly employed during processing are de-NISTing, deduplication, and content filtering. It is important to understand the operation and limitations of these culling options so that you can make informed decisions about how to deploy them in your projects.
Basic de-NISTing is a standard step performed in almost all discovery processing efforts. De-NISTing removes the known files that make an operating system or a software program work, such as executables, device drivers, initialization files, and others, which together can make up more than half the volume captured in a drive image. The trend towards more targeted collection methods and fewer full images has reduced the impact of de-NISTing on overall volume somewhat, but it is still an important step to remove material certain to be irrelevant.
De-NISTing does not do this by file types or file names, but by identifying specific, known files that match those originally released by the developer – thereby ensuring that they are free of user alteration or content and are, therefore, irrelevant. The process works using “hashing” to compare collected files to lists of known system files.
Hashing is a technique by which sufficiently unique file fingerprints can be generated by feeding the files into a cryptographic hash function (commonly MD5 or SHA-1) that generates a fixed-length output called a “hash” or “hash value” (e.g., a string of 32 numbers and letters). Identical files produce identical hash values, and hash values can be easily compared by software to automatically identify matches.
The “NIST” in the name of this culling option is an acronym for the National Institute of Standards and Technology, which runs the National Software Reference Library (NSRL), which is the source of the Reference Data Sets used for this process. Their lists of hash values for known file are updated a few times per year, but because of the volume and diversity of software in use today, and the frequency with which software is updated, those lists are rarely complete enough to remove all of the irrelevant software and system files that are present.
To compensate, some supplemental filtering is often performed along with de-NISTing. This supplemental filtering may be accomplished using either “stop filters” (also called “exclusion filters”) or “go filters” (also called “inclusion filters”). Stop filters exclude specified file types and leave everything else included, while go filters do the opposite, excluding everything and leaving only specified file types included. The difference is what happens to your unknown unknowns (i.e., stop filters let them pass through to later steps and go filters eliminate them at this point).
Deduplication leverages the same hashing process described above for de-NISTing, except used now to compare the collection against itself rather than against an outside file list. The operation of computers and enterprise information systems naturally produces many identical copies of files and messages in many different places that add no additional relevant information. Hashing a collection and scanning the values for those duplicates typically allows for a significant volume reduction and is a standard discovery processing step for this reason.
Deduplication can be performed on either the file level or the family group level, but the family group level is far more common because the integrity of family groups is typically maintained throughout discovery. For example, if the same spreadsheet was attached to two different emails, neither copy would be removed, but if two copies of the same email with the same spreadsheet attachment were present, one of those identical family groups would be removed.
Deduplication can also be performed on either the custodian level or globally, but global deduplication is far more common (since most review programs can easily track everywhere a duplicate was for later reporting or restoration as needed.)
Finally, processing typically affords you the option to perform some content filtering, including date range filtering and keyword filtering (and potentially filtering based on certain metadata fields and values). Date range filtering at this point in a discovery project is very common, as there are often clear temporal limits on relevance that can be applied. There are still some specifics of which to be aware, however:
Keyword filtering is done less often at this point in a project, because the available search and analysis tools are often less robust than those available during later phases and are generally not directly accessible by the case team (requiring a technician to execute searches and report back, making iterative testing and improvement very cumbersome). However, in cases with keywords that are fixed (whether by negotiation or by court order), it can be safe and useful to employ.
Upcoming in this Series
Next, in the final Part of this series, we will review some additional processing steps required prior to ECA and review, as well as the key takeaways from this series.