Getting to Know You: Entity Extraction in Action
As a teacher I've been learning
You'll forgive me if I boast
And I've now become an expert
On the subject I like most
Getting to know you
– The King and I, Rodgers and Hammerstein
Lawsuits and investigations often focus on individuals and their actions. As attorneys and legal professionals, we need to figure out who is key, who not. We have to learn what they did, with whom, and when. We must understand what motivated them. Yet at the start of each new matter, we generally only have a glimmer of an answer to questions like these. Even months or years into a lawsuit or investigation, we may still be trying to come up with answers.
You can get the jump on this process with the right technology. Tools in Reveal AI, for example, help you see critical details about potentially important people such as:
- Email addresses they used,
- Pseudonyms associated with them,
- Positions they have held,
- Concepts contained in their data,
- People with whom they have communicated, and
- People who discussed similar entities.
The machinery behind the scenes: Entity Extraction
Before we get into how you can make good use of these features, let’s take a pause to discuss where they come from. Everything in the list above is the result of a technology called “entity extraction”.
Entity extraction is a form of unsupervised machine learning.
Unsupervised machine learning is essentially an exercise in having computers “tell me something I don’t know”, as we noted in Legal AI Software: Taking Document Review to the Next Level. Computer algorithms are pointed at data. The algorithms organize that data based on patterns, similarities, and differences. The algorithms work on their own; they do not rely on people to train them. They can, however, learn from their own experience. Unsupervised machine learning can be used to identify entities as well as concepts and even images in documents and feed that information to legal teams.
An entity is a piece of data identified in Reveal by proper name. An entity can be a person, place, thing, event, category, or even a piece of formatted data such as a credit card number. In Reveal, entities can be merged automatically, for example pulling together multiple email addresses for a single person. Entities are identified by the system when data is processed. Having identified an entity, the system then extracts information about that entity and makes it available for you to use as you review and analyze data.
By default, Reveal identifies over 21 entity types. If you want, you can enhance entity types already built into the platform. You also can create custom entities using Reveal’s training tools.
Entity Extraction in Action: Get to know your witnesses
One powerful way to work with entities is via Reveal’s baseball cards and All Activity Pages. These give you quick access to information about people you might be interested in.
To get to a baseball card, go to an individual. You can accomplish this in various ways. You can perform a search. You might select an individual from the Communication facet. You might click on an icon in the communication view. There are other paths you can take, as well, to get to an individual.
This brings up the individual’s baseball card, shown in greater detail below.
With that baseball card, you immediately start to learn useful information about the individual.
In the upper portion of the baseball card, you can see basic information:
- Name: At the top of the card is the name by which that person most frequently appears in the data (in this example, Vince J Kaminski).
- Email address: Immediately below is the person’s most frequently used email address (firstname.lastname@example.org) as well as the total number of email addresses associated with that individual (55).
- Topics: Next are shown the three topics, or concepts, that appear most frequently in communications associated with that person (here, meeting, model, and energy).
In the bar across the bottom are actions you can take:
- View all activity: If you click on the vcard icon, that opens the person’s All Activity Page. On that page you can see email addresses, pseudonyms, positions, concepts, communications, and related people.
- Add to search: By clicking on the magnifying glass icon, you can add the person to a search.
- View documents: The View Documents button takes you to documents associated with that person.
- Selfie mode: Using the Selfie mode, you can see messages that person has sent to himself or herself.
- Ignore: By choosing the Ignore button, you tell the platform to hide that person, for example to keep information about that person from creating an unwanted distraction.
All Activity Pages
On an individual’s All Activity page, you can see email addresses that person has used, pseudonyms associated with that person, positions that person has held, concepts contained in that person’s data, people with whom that person has communicated, and people who discussed similar entities as that person has discussed. The page also shows Topics of Interest and Related people, covered in an earlier post, The Exquisite eDiscovery Magic of Data Anomaly Detection.
It can be challenging to determine what email addresses an individual has used. While some addresses are obvious, many others can be hard or even impossible to guess. Using entity extraction, Reveal’s platform scours the ESI loaded into it, searching for every email address for each person. The platform then associates all those email addresses with that person.
At the Email addresses tab, you can see all email addresses found for a person, the domain for each email address, and the number of communications containing that address.
In this example, the Email addresses tab on Vince Kaminski’s All Activity page lists 55 email addresses that the system has concluded were used in connection with Kaminski.
Even skimming the listed addresses highlights the challenges associated with attempting to guess at or search for addresses that were used. Here are some of the ways email addresses used by Kaminsky were formatted – and these are only addresses ending in @enron.com:
- [informal first name].[last name]@enron.com
- [informal first name].[last name]@enron.com
- [informal first name].[middle initial].[last name]@enron.com
- [informal first name]_[middle initial]_[last name]@enron.com
- [first initial][truncated last name]@enron.com
- [first initial][[last name]@enron.com
- [formal first name].[last name]@enron.com
- [formal first name].[middle initial].[last name]@enron.com
- [informal first name].[incorrectly spelled last name]@enron.com
In addition, there are internal email addresses, email addresses ending with other domains such as aol.com, email addresses containing misspellings such as ennron.com or eron.com.
The platform also shows the number of messages associated with each address. Of the more than 9,500 messages sent by or to Kaminski, the largest number, about 27%, were not even to or from a corporate email address. Rather, they were sent from or to an AOL address.
Most people go by more than one name in written communications. Reveal AI looks for the different names that might be used for a person, including misspellings, nicknames, acronyms, and names as they appear in email addresses. It resolves those different names – or pseudonyms – and merges them into a single person to facilitate search and analysis.
At the Pseudonym tab on the All Activity page, you can see the various names found for an individual, as well as the number of times those names appear in the data.
For Vince Kaminski, the platform found 31 pseudonyms. Some are obvious, such as vincent kaminski. Others, like v kaminski, could be predicted or found with straightforward searches. Yet others could be more difficult to locate through manual approaches, such as wincenty.
Reveal AI constructs business cards for individuals, constructing the cards from ESI loaded to the platform. With these business cards, you can get a sense as to the organizations an individual worked at, the positions the individual held, that person’s street addresses, and so on.
The first Manager Director – Research entry shown above, for example, comes from an April 10, 2001 email message containing an invitation to a presentation to be given by a group of Rice University students.
The Concepts tab shows concepts – themes or ideas – found in the documents loaded to the system.
For Kaminski, the platform shows 4,933 concepts. Some appear frequently, such as meeting at 639 times, model at 581, and energy at 563. Many others are far less frequent, like generation at 80, privileged at 20, or prison at 1.
Communicators are the people this person has communicated with, sending messages to or receiving messages from. For these purposes, a person such as Vince J Kaminski is comprised of all forms of that person that the platform found in the data (vince kaminski j, kaminski, v kaminski, v kamins, and so on).
For each communicator, you can see additional information:
- Communications: The number of emails between the person whose All Activity page you are looking at (in this example, Vince Kaminski) and the person listed in the left-most column. Here, there were 2,344 communications between Kaminski and Shirley Crenshaw.
- Social Status: The social status is the global reciprocal ratio – the number of emails sent over the number of emails received.
- Owned Documents: Owned documents is the number of documents for which the All Activity person is a custodian. In this example, Kaminski is the custodian of 41,476 documents, and Lopez is not the custodian for any documents.
- Custodian: The information in this column indicates whether the person named on the left is a custodian, true if yes and false if no.
- Wrote and Read: These two columns show the numbers of communications the person listed on the left wrote and the number that person read.
Similar communicators are people who discuss entities similar to those discussed by the person whose All Activity page you are looking at. The similar entities could be people, organizations, locations, and so on. As a practical matter, if you were interested in entities that Kaminski discussed, you might want to look at Shirley Crenshaw’s communications because those two people have a Discussed Entity Similarity
Getting to Know You
With Reveal’s baseball card and All Activity page, you have unparalleled abilities to quickly become on expert on anyone for whom you have data. You have access to the email addresses they used, which opens up opportunities to probe data you might not otherwise have thought to look at. You can see what names they used, as well as names others used for them. You get insight into the organizations they worked for and the positions they held.
You can dig into the concepts discussed in messages they sent as well as ones they received. You can see with whom they exchanged messages and learn more about the nature of how they communicated with others. You also have the option of expanding your search, examining others who also communicated with similar people.
With these capabilities, you can have a jump start when you try to figure out who to depose or put on the stand. You are better positioned to learn identify pitfalls and opportunities for the next deposition you defend. And you have the tools to better build that most critical of work products, your story of the case.
Learn more about how you can harness the power of entities
If you and your organization would like to learn more about how you can harness the AI-driven power of entities – or want learn more about how Reveal uses AI as an integral part of its AI-powered end-to-end legal document review platform – contact us to learn more.