eDiscovery AI
May 13th

How Important is Active Learning for eDiscovery?

George Socha
George Socha

How Important is Active Learning for eDiscovery?

Active Learning is a process where humans train software to identify documents that meet certain criteria. Relevance is a commonly used criterion, but Active Learning is used with other objectives in mind as well such as to locate privileged content.

Active Learning and its kin go by many names. These include computer assisted review (CAR), continuous active learning (CAL), infinite learning, machine-assisted review, predictive coding, predictive ranking, simple active learning (SAL), simple passive learning (SPL), supervised machine learning, and technology assisted review (TAR). Care should be taken when using these terms as they are not necessarily interchangeable.

With Active Learning, people start by classifying new documents. The system presents a reviewer with a document set. The reviewer classifies the documents, for example the classifier could be checking Document one as "relevant" and Document two as "non relevant".

As reviewers classify content, the software learns from their decisions in real-time. As the software learns, it continuously re-evaluates the responsiveness of remaining documents. Based on these continuous reassessments, the Active Learning system refines its results, moving what it thinks are the remaining number of relevant documents to the front of the review queue.

Active Learning in eDiscovery

Lawyers and related professionals have been using Active Learning and similar tools in electronic discovery in lieu of manual review for over a decade. The exact starting date is not clear. "For TAR, there was no 'Mr. Watson, come here!' moment of invention. Rather, TAR developed and took hold gradually", noted the authors of the 2012 book TAR for Smart People 2.0. "It is impossible to pinpoint when any of the names we now use for this process - not just TAR but also 'computer-assisted review' and 'predictive coding' first came into use."

Norton Rose Fulbright was an early adopter of these technologies. “We rolled out predictive coding in 2008, raising a lot of eyebrows,” noted Florinda Baldridge, the firm's US Director of Global E-Discovery and Litigation Technology. “The results were positive from the very beginning and as we look to the future active learning will continue to play a key role in the work we do.”

In a 2012 Da Silva Moore v. Publicis Groupe decision, Magistrate Judge Andrew Peck permitted the parties to move forward with TAR. That decision was widely cited as a judicial endorsement and accelerated adoption. Today, most eDiscovery platforms offer some form of Active Learning.

When used well, Active Learning helps legal teams deliver higher-quality results faster - saving time and money. With Active Learning, otherwise burdensome tasks for human reviewers can be scaled back or eliminated, steps such as the use of seed sets, training sets, and control sets. In addition, because Active Learning is designed to put the most responsive documents at the front of the queue, it gives attorneys more time to make good use of those documents in the matters they are handling.

Active learning algorithms can be used in myriad ways. At the onset of a matter, Active Learning can work as an early case assessment tool, helping lawyers identify potentially significant issues, people, and events. Active Learning is popular to help review teams prioritize content for review, putting potentially pertinent documents at the head of review queues. Active Learning can be deployed for more than just relevance; it can be used, for example, to help find privileged content or materials germane to specific issues in a case. Active Learning also can be used with incoming productions, both to assess the thoroughness of the production and to look for key content.

Real World Examples

Here are three case study examples of how Active Learning and its kin can deliver superior metrics, especially when compared to traditional approaches such as a linear review process.

30 TB in 60 Days

By using Reveal's Brainspace supervised learning technology, a litigation support provider (LSP) was able to help an AmLaw 100 law firm review 30 terabytes of data in under 60 days. The law firm represented a global entertainment company that received a Second Request from the U.S. Department of Justice in connection with a proposed $60 billon acquisition. The DOJ gave the company and its outside lawyers 60 days to analyze, review, and produce responsive material from a 30 terabyte document collection. The data contained a high volume of duplicates and the law firm and LSP needed to work with an overly-inclusive list of keyword terms.

The LSP used Reveal Brainspace's Intelligence Coding workflow, which leverages logistic regression to provide predictive ranks for auto-coding documents. Using that approach, the LSP and law firm were able to review on average 40% to 50% fewer training documents than with other supervised learning platforms, auto-code approximately 85% of the document review population, and still achieve agreed-upon recall and precision rates, producing just under 222,000 documents to the DOJ within the 60-day deadline.

Many Languages and a Diverse Low-Richness Data Set

As part of an international governmental investigation, a company facing a tight production deadline needed to organize a review of 12.5 million documents that included multiple languages and that had been selected with an overly-broad set of keyword terms.

A total number of 60,000 previously-reviewed documents were used to train Reveal Brainspace's Continuous Multimodal Learning (CMML) model, an integrated set of features designed to support flexible interactive supervised learning workflows. The CMML Predictive Ranks then were used to prioritize review, ensuring that the most relevant documents got reviewed first. In addition, Reveal's patented Diverse Active Learning was used to build a small training round from a widely diverse data set, further optimizing the Predictive Ranks. Using these approaches, the company reduced the review volume by 85%, from 1.8 million documents to 280,000, and was able to produce a richness level of 3.7% from a low-richness data set, a 270% improvement over traditional keyword searching results. The company saved about 19,000 review hours - over $750,000 in attorney fees.

37 Days from Collection to Production, $1 Million Saved

Facing a tight deadline and with high stakes riding on the outcome, a law firm needed to review 1.3 million documents in response to a DOJ anti-trust Second Request.

McDermott Discovery turned to Reveal AI (formerly known as NexLP Story Engine) and its Machine Learning and artificial intelligence functions. Subject matter experts trained the system in 3 1/2 days and the system reached stability after only 1,300 sample documents as opposed to the 8k to 10K documents typically needed with traditional machine learning applications. The firm produced 129,000 documents just 37 days after collecting the data, saving nearly $1 million in lawyer review fees with the aid of Reveal AI.

Learn More

This discussion just touches on the many ways Reveal's AI and Active Learning can accelerate the process of evaluating documents and data, help you locate the content you need to know about most as you investigate your matters, and enable you find more like this when you need to respond to production demands.

If your organization is interested in learning about how Reveal and Reveal AI can help you with your litigation and investigative needs, contact Reveal to learn more. We’ll be happy to show you how our authentic artificial intelligence takes review to the next level, with our AI-powered, end-to-end document review platform.