The Exquisite eDiscovery Magic of Data Anomaly Detection
Anomaly detection is a powerful tool for anyone seeking to understand the nuances of complex datasets. While many eDiscovery and machine learning tools offer some basic ability to detect data anomalies, few tap into the possibilities opened up when a wide array of anomaly detection capabilities are incorporated into the search, data analysis, and display functions of an eDiscovery platform.
Today, we take a look at the exquisite magic that is data anomaly detection.
What Is Anomaly Detection?
Anomaly detection, also called outlier detection, is a means of finding unexpected, hopefully useful patterns in big data. Definitions include:
- "[T]he process of identifying unexpected items or events in data sets, which differ from the norm." (Toward Data Science.)
- "[A] step in data mining that identifies data points, events, and/or observations that deviate from a dataset’s normal behavior." (Anodot.)
- "[A]ny process that finds the outliers of a dataset; those items that don’t belong." (BMC.)
- "[A] technique used to identify unusual patterns that do not conform to expected behavior, called outliers." (Oracle.)
Why Does Anomaly Detection Matter?
Anomaly detection is an exploratory tool, helping you find meaningful stories in your data and pointing you in directions you might not have considered. Anomaly detection is also a critical thinking tool, giving you the power to test stories, find weaknesses in them, and figure out how to make them stronger or tear them down.
Here's a real world example: I first encountered the notion of searching for anomalies or outliers in a matter we began working on in the late 1990s and took to trial in 2002. We represented the defendant. The plaintiff's case appeared to be compelling. The plaintiff alleged that a defect in a component part from our client caused plaintiff's product to fail prematurely. The plaintiff sought hundreds of millions, and then well over a billion, of dollars in damages. Information to be gleaned from the plaintiff's initial productions seemed to support their story of the case.
We were not content to accept the other side's version of events (what good defense counsel would?), so we pushed for more discovery. After several court orders, the plaintiff ultimately delivered to us the full contents of the IBM AS/400 they used to run their business. We received a huge amount of data, by the standards of the day. It included manufacturing, component parts, sales, customer complaint, complaint inspection, and other related data information, some of it structured and some unstructured. The data came to us on 40+ tapes. We restored the data from the tapes and converted it to SQL. We loaded that data onto a tower PC that sat in my office.
At the same time, we sent inspectors to examine the failed products. (There was no question as to whether the products failed; the issue was why they failed.) The inspectors dug into the failed products, sometimes literally, took tens of thousands of photographs, and recorded what may have been hundreds of thousands of measurements.
One of the expert witnesses we worked with was an econometrician, part economist and part statistician. Also on the team were experts in finance, forensic accounting, and data analytics of various stripes. Together, we explored the plaintiff's data along with the inspection data. We formulated and executed SQL searches and dumping the results in Excel and SAS for further and more nuanced analyses.
Our econometrician insisted that we search for outliers. Here is how he explained the process and our objectives, as I recall:
- Data tells stories, if you can figure out how to read it. It can verify or refute the accuracy of a story that already has been formulated. It also can help you find stories you did not realize the data contained.
- With care and creativity (and the right expertise and experience), you can map out a story's data into a visualization that looks like a somewhat jagged bell curve.
- Once you map the data, you generally can ignore most of the information in the middle part of the curve. It's only a baseline of information that conforms to what you already know.
- Rather, you want to focus on two areas. Look at the far edges of the curve and look for any upward spikes that break away from the standard deviation. These areas represent abnormalities, or pieces of data that might indicate weaknesses in the story or that could suggest a new story whose curve better matches the data.
Using this outlier approach, we crafted a new story that our experts thought better fit the data. When we mapped out the new story, we found far fewer spikes and little of interest at either end of the curve. The new story: The plaintiff grew too fast, as the company grew they lost control of the quality of their product, hence the failures.
Ever since working on that matter, I have been on the lookout for eDiscovery technology that would give me that ability to find and explore multivariate anomalies - and I have sought technology that gave me the ability to do this without having to deploy an extensive and expensive team of experts to get there.
What You Can Do With Anomaly Detection
Today's data scientists have developed algorithms that deliver on that vision for use cases many of us never would have foreseen back then. Here are some of the things for which anomaly detection is used these days:
- Bank fraud and anti-money laundering: Anomaly detection software can analyze banking transactions in real-time, alerting bank employees of deviations from typical patterns of behavior.
- Defects on concrete structures: Deep learning can be used for anomaly detection of defects on concrete structures, working image datasets.
- Medical diagnoses: Collected physiological data can be analyzed for prediction or diagnosis in areas such examining endoscopy images for ulcerative-colitis disease.
- Cyber attacks: Anomaly-based Intrusion Detection Systems (AIDS) can be used to identify unknown and obfuscated malicious software (malware).
- Catastrophic weather events: Weather data can be searched for anomalies to better predict catastrophic weather events, helping in advance with the human and financial costs of the crises.
Anomalies Detected in eDiscovery
Just as anomaly detection methods can be used to find outliers in banking, construction, and medicine, it can be used in eDiscovery to find behavioral or other patterns that help build up or tear down the story of a case.
Reveal AI helps you look for a wide range of data anomalies. Using criteria such as data range, sender, receiver, concepts, tone, named entities, and domains, Reveal AI can identify unusual behavior. It searches for patterns that break with typical activity.
Reveal AI gives you the ability to find and work with anomalous data in various ways. Anomalies are brought to light via the use of entities, sentiment, topics of interest, and related people. Specific ways of working with that data include cards and baseball cards, discussed below, as well as others we will address another day.
Reveal AI lets you look for anomalous data about entities, both named entities and custom entities. To better understand what you can do with entities and anomaly detection, first it helps to know what we mean by entities.
A named entity is an extracted piece of data identified by proper name by Reveal AI. The data might be extracted from a person, place, thing, event, category, or formatted data such as a credit card number. By default, Reveal AI identifies over 21 entity types; you can train the system to identify yet more. The standard named entity types include:
- Person (discussion about a person)
- Geo-political (city, state, country)
- Organization (company)
- Money (currency discussed)
- Temporal (dates discussed)
- Law (legal jargon)
- Quantity (metrics)
- Groups (e.g. Democrats, Republicans)
- Location (specific areas)
- Technology (jargon)
- Topic (conceptual focus of sentence)
- Category (concept contained in document; hierarchical to Super Category)
- Event (e.g. Super Bowl)
- Ordinal (numbers)
- Product (discussion about a product)
- Work of Art (discussion about music, books, etc.)
- Summary Phrase (extracted important phrases from documents)
- Law Firm (Law Firms mentioned in document)
- Super Category (broad concept contained in document; hierarchical to Category)
A custom entity is an entity type that you create with AI models. You build custom entities using your data and you design them to fit your workflow. Once you create a custom entity, you can use it the same ways you would use any other named entity, including to create yet more AI models.
Once you begin to work with entities, you can start using information associated with those entities. Here are some of the ways you can do that.
Reveal AI analyzes the use of language for sentiment. Sentiment is the tone of a communication and it can be negative or positive. A writer using words and tone associated with negative connotations is expressing negative sentiment. An author who uses words and tone having positive connotations is expressing positive sentiment.
The author of an email message might write, "Everything is a mess and we need to shut this down right away. However, the staff is nice.” This communications contains verbal classifiers for both negative sentiment ("Everything is a mess...") and positive sentiment ("....the staff is nice").
Reveal AI also looks at sentiment in the aggregate and displays the results in cards (more on cards below). For the example below (and using Enron data), I selected the card view and sorted the cards by sentiment, from highest sentiment to lowest.
In the highlighted card, you see that Jeff Dasovich sent email to Richard Shapiro with negative sentiment late at night and on weekends. He did this nine times more than normal - six times over four weeks - from July 6, 2001 to August 3, 2001.
By finding negative and positive sentiments, Reveal AI adds greater context to the patterns you find in data and the stories you can construct from those patterns.
By analyzing sentiment, Reveal AI enables users to search content for emotions that include intent, opportunity, pressure, rationalization, positivity, and negativity.
For searching, emotional scores are arranged into five groups: any score; no score; law (1-3); medium (3-7); and high (7-infinity).
Topics of Interest
Topics are the most representative and information phrases that Reveal AI finds in the text available to it. Reveal AI computes topics based on summary phrases from each communication as well as the context of the communication if it is part of a larger thread. A single communication can contain multiple topics.
Topics of interest are topics that are anomalous in one of two ways. Hotly debated topics of interest are topics from a person's documents with medium or high negative sentiment. At unusual hours topics of interest are topics from a person's documents that were sent after business hours or on weekends.
If you go to a baseball card activity page (more on baseball cards below), the topics of interest are displayed just below the baseball card.
Topics of interest vary from one person to the next, as shown by the three examples below.
In Reveal AI parlance, related people are people who have out-of-the norm but meaningful connections with an individual.
These are three categories under related people where communications fall outside the norm. Close confidants shows the top five people the individual in question communicated with most frequently. Tenuous communications displays the top four people this person communicated with where their communications show high pressure. External connections lists the top four external domains the person communicated with.
Here is how that information appears:
Aggregating Information About Anomalies
Reveal AI's cards represent patterns in a user's data, a way to organize and present anomalous information. When you go to the card view, the platform dynamically builds a set of cards.
Each card represents a story found in the dataset. Here, for example, is a card showing that during a one-week period, Ken Lay received email messages from a Hotmail account, a pattern that does not appear elsewhere in the data. This might not be important, but it could be something worth exploring.
You can sort cards by score, count, weeks, sentiment, work shifts, and start date - each ways that let you bring outliers to the forefront. Which each of these options, you can sort is descending or ascending order.
Score is the uniqueness of the pattern. In the example above, the cards are sorted by score, from most unique to least. To help you get a quick understanding, levels of uniqueness are color coded. Cards whose patterns are very unique are blue. Moderately unique cards are green. Common cards are yellow.
Count tells the number of times a pattern was detected in the dataset. In the single-card example above, the pattern was detected 127 times.
Weeks shows the number of weeks a pattern exists. In that same example, the pattern appeared during only one week.
Sentiments, discussed above, are identified in cards as positive or negative.
Work shifts show when a pattern took place. Options include late at night and on weekends, in the evening, and during business hours.
Start date lets you sort by the first date in a pattern.
A baseball card is a profile that Reveal's artificial intelligence dynamically builds around an entity. A search for communications about Vince Kaminski returns a visual communications map. Notice its also displays a baseball card for Vince Kaminski.
If you click on the icon at the bottom-left corner of the baseball card, that opens the activities page for Vince Kaminski.
Right away, we see indications of anomalous behavior.
Topics of interest, discussed above, are shown on the left. The hotly debated section lists the top seven topics Vince Kaminski wrote about with medium or high negative sentiment. For him, those topics are meeting, resume, interview, conference, research, risk, and invitation. The at unusual hours section shows the top seven topics Vince Kaminsky wrote about after work hours: meeting, interview, resume, research, conference, risk, and options.
Related people, also discussed above, are shown below topics of interest. This areas shows that Vince Kaminski's closest confidents were Shirley Crenshaw, Stinson Gibner, William X Smith, and Mike A Roberts; he has the more tenuous conversations with Shirley Crenshaw, Stinson Gibner, Vasant Shanbhogue, and Mike A Roberts; and his most frequent external connections were with the domains rice.edu, lacima.co.uk, upenn.edu, and utexas.edu
The baseball card activities page shows much more. There are sections for:
- Email addresses, showing 55 different Vince Kaminski email addresses Reveal AI found in the dataset;
- Pseudonyms, with 32 different names for Kaminski identified in the dataset;
- Business cards, indicating that Kaminski held positions as Managing Director and Managing Director - Research and including titles, addresses, phone numbers, email addresses and the like for each position;
- Concepts, 4,367 different topics from Kaminski's data;
- Communicators, listing 2,838 people in Kaminski's communication list and for each person with whom Kaminski communicated, showing communications (the number of emails between Kaminski and the individual), social status (the number of emails sent versus the number of emails received), owned documents (the number of documents for which Kaminski is the custodian), custodian (whether Kaminski is the customer), wrote (the number of emails Kaminski wrote), and read (the number of emails Kaminski read);
- Similar communicators, listing people who share similar entities with Kaminski.
Ready to Go Forth and Detect?
When implemented in an eDiscovery platform, anomaly detection systems can be exquisitely powerful tools. It can enable you to search for local outliers. This capability means you can use the system's anomaly detection algorithm to "tell me something I don't know", where it surfaces potentially potent behavioral patterns. With those patterns, you can build your story of the case, test the story told to you by your client, interviewees, and other witnesses, and look for weaknesses in what you expect to be the other side's story of the case.
You also can turn to anomaly detection techniques to "find more like this". If you find a communication expressing negativity on a key subject, anomaly detection can leverage machine learning models to help you locate other similar communications, even ones sharing none of the same words.
In this post, I discussed only a few of the many forms of anomaly detection available with Reveal. If your organization is interested in optimization of your eDiscovery process through an innovative use of anomaly detection, as well as more about our AI-powered end-to-end document review platform, contact us to learn more.