eDiscovery Leaders Live: Dr. Irina Matveeva and Dr. David Lewis
Each week on eDiscovery Leaders Live, I chat with a leader in eDiscovery or related areas. Our guests on April 2 were Dr. Irina Matveeva and Dr. Dave Lewis. Dr. Matveeva is Reveal’s Chief Data Scientist & Head of Machine Learning. Dr. Lewis is Executive Vice President, AI Research, Development, & Ethics at Reveal.
Irina, Dave and I focused on artificial intelligence in eDiscovery. Starting with efforts to use AI to deliver “simple” solutions to complex problems, and talked about the importance of holding the AI discussion at the right level. Irina gave us a little background on DLA Piper’s Aiscenion, which her team worked on. Dave and then Irina offered their thoughts on the “every case is special” objections and how to respond to it. We also looked at whether a PhD or MS is needed to make effective use of AI in discovery and how much of the AI plumbing attorneys really need to understand. Irina and Dave shared info about their teams and what those folks do, a discussion that morphed into learning about AI more generally. Finally, at my request Dave and then Irina peered into their crystal balls to offer thoughts about where AI might take discovery in the future.
Recorded live on April 2, 2021 | Transcription below
Note: This content has been edited and condensed for clarity.
Welcome to eDiscovery Leaders Live, hosted by ACEDS, and sponsored by Reveal. I am George Socha, Senior Vice President of Brand Awareness at Reveal. Each Friday morning at 11 am Eastern, I host an episode of eDiscovery Leaders Live where I get a chance to chat with luminaries in eDiscovery and related areas.
We have two guests here with us this week; both my colleagues at Reveal, Dr. Irina Matveeva and Dr. David Lewis. Dr. Matveeva is Chief Data Scientist & Head of Machine Learning here at Reveal and Dr. Lewis is Executive Vice President, AI Research, Development, & Ethics at Reveal. Let me give you a little bit about their background and then we will launch into our discussion.
Irina, as I mentioned, is Chief Data Scientist & Head of Machine Learning here. She is responsible for Reveal’s data science organization and applying machine learning and natural language processing throughout the Reveal platform. She also is an adjunct professor at the Illinois Institute of Technology, has nearly a decade of both practical and academic experience in natural language processing, has co-chaired TextGraphs workshops in 2007, 2008, 2011, and 2012, and as a reviewer for multiple prestigious journals and publications. Dr. Matveeva has a PhD from the University of Chicago.
Dave Lewis, Executive Director, AI Research Development & Ethics at Reveal, is responsible for artificial intelligence research, development, and ethics issues throughout Reveal's software and approach to eDiscovery. Prior to joining Reveal, he held positions at Brainspace, recently merged with Reveal; in AT&T Labs; Bell Labs; and the University of Chicago; along with co-founding a machine learning start-up and consulting on numerous cases. He's been elected a Fellow of the American Association for the Advancement of Science, that was in 2006. And in 2017, he and W.A. Gale won the ACM SIGIR Test of Time Award for the invention of uncertainty sampling. Dave has his PhD from the University of Massachusetts at Amherst.
Irina, Dave, welcome.
Thanks George. Good to be here.
Hello George. Thank you. Thank you for having us.
Using AI to Deliver “Simple” Solutions to Complex Problems
Glad to have you both. In the outline of topics we can cover for this 25 or so minutes coming up. I think we've got about 12 hours of content. Obviously, we’re not going to get to all of that, but I would like to start with some of the things that are highlights of what Reveal is doing now and where it's going now, which is all about artificial intelligence. As we think about the use of artificial intelligence, especially with respect to electronic discovery, Irina, what are some of the problems that we are able to solve with AI?
We can solve, really, a variety of problems. The way we’re approaching it here at Reveal is that we're thinking about solutions. It's not even so much about individual problems, because we can tackle these with AI. It's more what are the complex solutions or the complex problems which we can solve with our AI solutions. We're trying to hide the complexity from our users. From the user perspective, it should be as simple as a click of a button and then get some results.
We've been working on the privileged communication detection solution, and sexual harassment detection solution, and a number of others. And of course we've partnered with DLA Piper’s Aiscension organization to discover cartels and potential cartel behavior. That's another example of the problems which we can solve using our AI solutions.
Holding the AI Discussion at the Right Level
Dave is it about particular machine learning technologies, is that what we should be focusing on as we look at these solutions, or is there a different area of focus?
Well, for people like Irina and I, absolutely, that's what our job is. But I think for the users of the system, it's much more about process and workflow. I don't think that attorneys and investigators need to be reading the latest deep learning literature and things like that.
What we try to do is make it easy for people to manage these technologies in the context of a workflow where their human experts are able to get their knowledge into the system, have it amplified by the software, and then get the jobs done that they need to get done.
I think the educational thing that's important for people in AI, is to help the users understand what the software can do, what it can't do, how you can maximally save costs, what some of the failure modes are, things like that. But I think sometimes people in eDiscovery have gotten a little too focused on support vector machines versus logistic regression versus deep learning or things like that where it’s probably a level or two up that's more important.
DLA’s Aiscension: A Concrete Example of Reusable AI Models
Let's try taking it, then, a level or two up. We're not going to talk about support vector machines or any of that, at least not right now. I'm going to switch back to you, Irina, to talk about a concrete example of the use of these technologies with what DLA Piper has done with Aiscension.
That's fine. That is, I believe, a perfect example. Obviously, I'm part of the project, I’m extremely excited about that, but I believe it really showcases what AI can offer. We partnered with the Aiscension team. What they brought to the table was the expertise of their lawyers with decade’s worth of expertise working on cartel cases. They worked with our data science team, they educated us about the cues and the signals and features - what they are trying to find in the data when they’re approaching a new case - because every case is different and yet there are some commonalities in the cartel behavior. They worked with us and really conveyed that knowledge to us: what they are looking for, what is their playbook, so to speak. We worked on encoding the steps of that playbook in the models and filters and NLP technologies that we are using within our system.
In the end, we have this model which will go to their own model library and that they can use and reuse in future cases. Our data science team did a very thorough evaluation on multiple existing cases or historic data, which we had available, and we did show that those models do find cartel behavior in new cases, cases that the model has not seen before.
Addressing the “Every Case is Special” Retort
For both of you, and this is something you alluded to a moment ago, Irina, there are and there will be any number of lawyers who will say something akin to what you had said a little earlier and which I’ve been hearing since I’ve started practicing in the 1980s: Every case is special, every case is unique, you cannot take a machine or anything like that and point it at the case and just have it do things because the lawyers have to look at the case and figure out how to handle this unique, special, unlike-anything-else thing. We know that's not really true, but how do you deal with that in something like this? Irina, did you run into something like that with DLA?
Sure. Even the DLA project, some team members were pretty skeptical and they were very open about that. But we're also learning to work with our clients and understand how the clients see the problems and how they see the use of technology.
The first thing is that we're using AI as augmented intelligence. Of course the lawyers will be in the loop, of course we were learning from the expertise of their Aiscension team and we will be learning in the future. We still are working on partnering with them. And of course the human will be making the final decision and yes, their lawyers will be reviewing documents, understanding the players and so on and so forth.
There will be this involvement. The model cannot solve it completely. At the same time, it can really augment the process, it will give you great candidates, and it will point you right to the pocket of data where the problem may be. It's much faster, more efficient, and you use your time in the best possible way.
Using Data Features and Signals
The second explanation, more from the machine learning perspective - so now, Dave, I'm going just a little bit under the hood - is that machine learning technology does work, and it works in really very powerful ways. We see them in our daily lives.
Here we did a very thorough evaluation. We were scientific and skeptical about the approach. We said we did not believe it until we see it works. We convinced ourselves. And then we see that our clients from the practical perspective also can convince themselves that this approach is scientifically solid.
The model learns from hundreds of thousands of parameters; it depends on the model, of course. It's a very powerful technology and it can really pick up on the most important features or signals in the data. It actually can pick up on the common things which exist across multiple cases. Although even each case might be different and unique in its own right, there will be some commonalities: What is a cartel behavior - there’s a definition, and you have to exhibit some of their components of that behavior for that document to be responsive if you're looking for cartels. Those commonalities are common, as the word says, across the cases. The model can pick up on those features and signals.
Dave, you have had to have run across the same concerns and objections, right?
Interpretability of Models
Yeah. I think this comes up particularly in the context of the Brainspace portable models capability. This is much like the things that Irina is talking about, where people are producing models that they intend to apply to multiple cases.
I think one of the big ways that we've addressed that concern is through an emphasis on interpretability of the models. We have the model insights capability, that lets you visualize what the most important features are.
Let's say for instance you've got a model that's trained for fraud detection and you’ve used it on three cases and you’re bringing it into a fourth case. As you then tune it, and we always encourage people to tune models to each new dataset, you can watch through model insights how the important terms are changing. You might have a term, a certain phrase that was very predictive on Case Number 2 but on Case Number 4 it's actually not predictive or even anti-productive. You can see things like that because we've always emphasized the visibility into how the model is working.
No PhD Required
From what I'm hearing, the two of you, PhD holders, have put a fair amount of work into developing the ability for people to have access to and make use of models and reuse those models. Does everyone need a PhD on staff to have the ability to make use of these things?
We’d be failing at our job if that was the case, right. The goal is to try to make the technology usable by people who are domain experts in this case, in the law or in investigation or whatever, and give them the insights through visual analytics and statistical measures and whatever, and training is important as well, into how to think at the level of process. How is the behavior of this AI system going to interact with the human processes, because the human processes are always extensive around these projects, but not to have to understand the math, not to have to go to school.
I can't count the number of attorneys who’ve told me, “I went to law school because I didn't like math”.We, and Irina’s team as well, emphasized giving people qualitative insights into how these systems work without them having to dig into the software or the math.
That's right. We believe our platform provides the features, the tools, solutions, the library of the models so that our users can just use them and use their own workflows and expertise and really not be experts in data science or AI.
The Plumbing: How Much Do Attorneys Need to Know?
There has been and there continues to be a lot of discussion back to something that one of you mentioned about the plumbing of all of this - SVM versus this verse is that and so on - and a lot has been written on that front. Both my brothers have PhDs, they look at me with a JD, and say well you know that dummy in the family. It has seemed to me that, as I look at this with the lawyer’s perspective, what I really care about most is getting results that help me move my case along. And, yes, I need to have an understanding about how I got at those results, but I'm not sure I can always figure out what was going on in my own head either, if I were just thinking about things. So, how does that play out, the results achieved and ways of making sure that they are supportable results, not unduly biased by something, for example, versus looking at the mechanics of how it operates? Who wants to take a stab at that? Dave?
I mean, obviously that's a big question to unpack. I think you're right in emphasizing there are several components. There's obviously the design of the overall workflow, which often includes human checks and balances and reviews and things and whatnot. And I think that's an area where attorneys have been doing that for decades, really. How do you manage teams of reviewers? How much coffee do you give them at four in the afternoon, that sort of thing?
Now there's this new level, which is that the machine informed by human knowledge is making certain discriminations between documents. If you design a workflow that way, maybe that means no human being ever looks on certain documents and you have to understand that. People do things like statistical sampling and various kinds of searching and using other analytics to try to understand maybe what's not being looked at in those circumstances.
To bring up a very topical issue, we're increasingly seeing, particularly from financial services organizations, them doing model audits. They, because of their regulatory needs, have certain requirements for understanding in a pretty formal way how AI models work and documenting that and how it affects their business. I think this is an area where the boundaries are under negotiation right now, but some of them feel that eDiscovery models come under that as well. And so we've had several audits of our technology, where we've had sort of long questionnaires and long interviews from a lot of teams. I think this is all very much to the good, that people are developing formal processes. The law is about routinizing certain processes for understanding and making sure that people with the appropriate responsibility understand how the technology is affecting the legal process.
Irina, your thoughts?
Sure. To add to that, I believe it is great that our clients are curious and that they’re aware that there are support vector machines and deep learning, and especially deep learning is such an exciting area. I think that is really beneficial for everyone. And it is good that they are keeping us honest. It’s a very good question. Dave and I, we're working and our teams are working on being up to speed, knowing all the latest academic results, results from our peers, so that we can answer those questions in the best possible way.
I also agree that the evaluation aspect of it is extremely important. Technology assisted review has been used in eDiscovery for a long time. There are very good processes around defensibility of those results. I think it's great to take that experience even further apply it to more and more AI technologies. Because George as you're saying, by the end of the day it needs to produce the results which are helpful, efficient, correct and so on, whatever is required by the given workflow.
But understanding, really, very deeply under the hood, how the exact algorithm works… If somebody is curious, would like to take some Coursera courses and educate themselves, that is wonderful, but I don't think it should be required for anyone practically to use the technology.
I always think about my phone. There is so much AI which goes into so many applications, but I can just see it detects faces on my pictures or I can search for pictures. I’m a user of that technology. I’m curious how it is done. I would like to see some blog post from them with the highlights of the technology. But by the end of the day I’d like it to work correctly.
That’s really my objective here. And I don’t want it to oversimplify the workflows that our users are facing, but I think it's a similar approach. I would like the conversation to be around, “Is it really the best?” Can we at Reveal show we are using the best possible technology? And then, how do we evaluate that it is producing the results we expect?
Reveal’s AI Team
Jay Leib, our colleague, is fond of saying some variation of we've got the strongest or the best or whatever it is, AI team in the industry. Our AI team starts with the two of you, but it doesn't end with the two of you. Tell us a little bit more about the team of people we have working on all these issues.
Sure. We have like 4 full-time team members, data science team members. One will be joining very shortly, so it'll be five very soon. Everyone has been on our team for many years, except obviously for the new person, but many years, and they really have been growing together with our team, with our application. We’ve been figuring things out, had a lot of productive discussions, some things which did not work out in the end, but we had a lot of interesting projects and I think it has created a very productive and creative dynamic within the team.
We are excited. We are always learning about new technologies. Individual team members might be more interested in one or another area of machine learning or AI, so they will bring their own ideas to the table. It's definitely not just me; we have a whole team working on all of the wonderful things that we've accomplished so far.
One of my hats is I run the AI engineering team. I've got a couple of long-time developers who have been in the industry for over 10 years.
One thing I would mention there, because this gets talked a lot in the development community, both the guys on my team, they didn't do their degrees in artificial intelligence. They are people who did masters in computer science. They are long-time developers who over the years learned a lot about AI. They are experts in their areas now.
I do think that this is something that gets talked a lot about the dev community, can you move into AI, and I really do think that it is like many other complicated technologies in computer science that good developers can take on, whether that's AI or cloud computing or parallel processing or whatnot.
I always encourage people, and Irina mentioned Coursera - the resources that are online for self-education now are just astonishing. I have always encouraged people to go learn. And statistics too, because I’m a statistician….
And the same goes for young and aspiring data scientists. The degrees and backgrounds for our part of the team, everyone joined as an intern first. They did a very good job, liked the team, everything worked out, and they've been with us for years now.
Don’t let anything stop you. If you’re excited, if you’re passionate, please pursue this career. Definitely.
Peering into a Murky Crystal Ball
We've got just a few minutes left. I'd like both of you to sit down with your murky crystal balls, peer into the future - here you go - and give us your thoughts about where possibly we may be going with all of this. Not disclosing of course anything you shouldn't disclose, but thoughts about what the future of all of this might look like for us.
I guess the question is, how far out a future are we talking about here?
This is the analog to the question I ask of some of the guests, which is, “If you could have your ideal eDiscovery platform, what would it look like? Assume no limitations of any sort whatsoever”. That is why you get to use a murky crystal ball.
I should look at these old videos to see what we should be working on then. I think we’re going to see increased sophistication in small increments of the degree of understanding that language technology has of documents.
We launched with AI in eDiscovery, and what I always try to stress is that people have this attitude of the law as kind of like pokey or conservative. The law was really early to adopt AI. In eDiscovery in particular, they've been way ahead of other industries. The early applications were search and yes/no classification and things like that.
I think increasingly, and Irina’s team works a lot on this, is bringing deeper insights, relationships between documents, and operating not just at the document level but at the important domain entities: people, transactions, organizations, and relationships between those, and understanding those over time.
We're not going to have the Star Trek kind of computer that says, “Give me the perfect legal strategy for this case”, and it comes back with a diagram and three documents or something. But what we’ll have is increased understanding of the meaningful entities in a legal case or investigation or whatever, and increasingly good analytics for displaying those to users.
I'm a great believer that while AI has these hype waves, the actual progress has largely been slow and steady for the past 60 years and I think that's what we will continue to see; and then just increased ability to kind of communicate that to users, which I think there's been a larger emphasis the past few years.
Thanks Dave. Irina, let’s close with your thoughts.
Thank you. I think it will be really moving towards understanding language. We do have natural language processing technologies, but now with deep learning we have a new set of capabilities there. We introduced the BERT technology, we're now using multilingual BERT when you build your model in one language and it works on 104 languages out of the box without doing anything extra. These are great things.
This future is now, so to speak, it's almost like you know, very tangible. I believe it will be more and more interactive. I would like to see and let our users interact with the system and really ask questions, natural language questions, and maybe insert some information about the case and the system will adapt, adjust, recommend. It will have stronger and stronger technologies for understanding the communications, content as Dave was saying, personalities and so on.
I really imagine, maybe a little bit further into the future, but I really imagine it being much more interactive in that sense, and really providing information, surfacing it almost dynamically as the user goes and interacts with the system.
Well, thank you. Thanks, Irina. Thanks, Dave. Irina is Reveal’s Chief Data Scientist & Head of Machine Learning. Dave is Reveal’s Executive Vice President, AI Research, Development, & Ethics.
I am George Socha, this has been eDiscovery Leaders Live, hosted by ACEDS and sponsored by Reveal.
Thank you for joining us today, please join us next Friday, April 9th, when our guest will be Tony Millican of Trinity industries. Thanks.