What Do Data Scientists Do?
Data Scientist. You've heard the term. It seems that everyone has one - every large corporation, major law firm, legal services provider, every eDiscovery software company. At Reveal, we have a whole team. But who are these people and what do data scientists do?
Data science is a comparatively new discipline. The term "data scientist" was only coined in 2008. In simple terms, as noted on the University of Wisconsin Data Science Program website, "a data scientist’s job is to analyze data for actionable insights."
Data Science Life Cycles
Data science follows a life cycle. As is to be expected with any new discipline, many, varied, and sometimes conflicting descriptions of that life cycle abound. UC Berkeley School of Information, for example, offers a model with five stages:
- Capture: data acquisition, data entry, signal reception, data extraction;
- Maintain: data warehousing, data cleansing, data staging, data processing, data architecture;
- Process: data mining, clustering/classification, data modeling, data summarization;
- Analyze: exploratory/confirmatory, predictive analysis, regression, text mining, qualitative analysis; and
- Communicate: data reporting, data visualization, business intelligence, decision making.
Others describe the process somewhat differently, such as this 10-step framework from Master’s in Data Science—Your Guide to Data Science Graduate Programs in 2021, where the authors note that "though no two data scientists will come up with precisely the same steps for their work, most data science projects follow a similar trajectory and will have at least some steps in common with other data science efforts."
Types of Data Scientist
There are many types of data science job titles, such as those laid out in a list from Projectpro:
- Statisticians and mathematicians, who "analyze data and apply mathematical and statistical techniques to solve problems";
- Data engineers, "responsible for finding trends in data sets and developing algorithms to help make raw data more useful to the enterprise";
- Machine learning scientists, who "work in the research and development of algorithms that are used in adaptive systems..., build methods for predicting product suggestions...and product demand..., and explore Big Data to automatically extract patterns";
- Actuarial scientists, involved in "evaluating risks and maintaining the economic stability of insurance or financial organizations";
- Business analytics practitioners, focused on "data, statistical analysis and reporting to help investigate and analyze business performance, provide insights, and drive recommendations to improve performance";
- Software programming analysts, who "assess the information and systems requirements of various departments and then work with computer programmers to develop the necessary applications"; and
- Spatial data scientists, who focus on "the special characteristics of spatial data, i.e., the importance of 'where'”.
Data Scientists' Areas of Expertise
Data scientists have varying and overlapping areas of expertise. Included in this list published on Medium, here are some of the skill sets you might see on a data scientist's LinkedIn profile:
- Data engineering and data warehousing: Transforming large amounts of data into useful formats for analysis;
- Data mining and statistical analysis: Data scientists excel at using statistics to perform exploratory data analysis and create predictive models designed to reveal patterns and trends in unstructured data;
- Cloud and distributed computing: Designing and implementing cloud and distributed computing enterprise infrastructure and platforms;
- Database management and architecture: Designing, deploying, and maintaining databases used for high volume, complex data transactions;
- Business intelligence and strategy: Leveraging software and services to transform data into actionable insights that help guide organizations' strategic business decisions;
- Machine language and cognitive computing development: Getting input needed to feed models, building data pipelines, performing testing and benchmarking as well as building the models themselves;
- Data visualization and presentation: Present data in a visually compelling fashion;
- Operations-related data analytics: Using data analytics tools provided by others to identify opportunities for improving business operations;
- Market-related data analytics: Data analysts use customer, sales, and marketing data to track performance and find opportunities;
- Sector-specific data analytics: Data analytics particular to a specific industry such as healthcare, finance, or, of course, law.
Industries Where Data Scientists Work and Things They Work On
The field of data science encompasses a wide range of industries and career paths. Data scientists are in the corporate world, health care, entertainment, government, and, of course, legal. According to Springboard, the three industries employing the most people data scientist roles are finance, which includes banks, investment firms, insurance firms, and the real estate sector; professional services; and information technology.
A sampling of other industries employing data scientists includes:
- Life sciences and pharmaceuticals;
- Utilities such as water, electricity, and natural gas;
- Food and beverage, from restaurants to beverage transportation to food manufacturing;
- Industrial goods such as construction, tool manufacturing, and metal fabrication; and
- Agricultural like farming, aquaculture, and forestry.
- Legal, where data scientists have been critical in the development, rollout and use of such a capabilities as TAR (technology assisted review) and other AI- and analytics-driven advancements for analyzing data in connection with lawsuits and investigations, legal research, contracting platforms and smart contracts, and much more.
Projects Data Scientists Work On
A data science career involves an ever-widening range of projects and provide services to all manner of stakeholders. They scrub data, investigate data, visualize it, organize it by clusters, and apply machine learning to it. They might use data and modeling to define crime hotspots and predict law enforcement needs in a city.
Data scientists have been active at the Federal level. They used address data to help respond to the devastation in Puerto Rico caused by Hurricane Irma and Hurricane Maria. They worked on building a software program to help the National Nuclear Security Administration (NNSA) proactively respond to emerging infrastructure needs by recommending building component repairs and replacements at the most opportune time. They put together a prototype through the Census Bureau’s Opportunity Project to better assess where volunteers should direct litter-clearing efforts.
Data scientists help build dashboards that allow teams to work together more effectively by, for example, visually tracking, displaying, and analyzing key performance indicators.
What Data Scientists Deliver
Whatever industry they are working in and whatever software engineering project they are working on, data scientists likely will deliver results that do some combination of the following as well as many others not listed here:
- Classify content (spam or not spam, for example);
- Provide recommendations (think Amazon or Netflix);
- Detect patterns and group similar content together (like with the Brainspace cluster wheel);
- Detect anomalies (Reveal AI can identify patterns of behavior, such a custodian typically sending email messages to a colleague during work hours using work email addresses, and then call out deviations from those baselines, such as a burst of late-night messages between the two using personal email addresses); and
- Recognize types of contents (languages used in text, attributes of pictures such as the presence of mold or industrial equipment).
Educational Background and Skills Needed to Become a Data Scientist
To land a position as a data scientist, it helps to have a relevant Bachelor's degree of Master's degree, such as one in computer science, mathematics, IT, statistics, or another related field. Work experience always is useful, as are other capabilities such as strong problem-solving skills, the ability to work individually and with a team, an understanding of data collection and analysis, and strong verbal and visual communication skills. Programming skills in widely-used programming languages like Python and SQL and experience with Hadoop are also useful in this field.
Data Science at Reveal
Reveal has a strong data science team, as far as we know the most robust in the industry. Our data science team really has two parts, the data science team itself and the AI engineering team.
Reveal's data science team is led by Dr. Irina Matveeva, Chief Data Scientist & Head of Machine Learning. One of a small number of women to lead a data science team, Dr. Matveeva is responsible for Reveal’s data science organization and applying machine learning and natural language processing approaches throughout the Reveal platform. She is an Adjunct Professor at the Illinois Institute of Technology (IIT) and has nearly a decade of both practical and academic experience in natural language processing. Dr. Matveeva received her Ph.D. from the University of Chicago. She co-chaired the TextGraphs workshops in 2012, 2011, 2008, and 2007, and is a reviewer for multiple prestigious journals and publications.
Reveal's AI engineering team is led by Dr. David Lewis, Executive Vice President, AI Research, Development, & Ethics. Dr. Lewis is responsible for artificial intelligence research, development, and ethics issues throughout Reveal's software and services. Prior to joining Reveal, he held positions at Brainspace, AT&T Labs, Bell Labs, and the University of Chicago, along with co-founding a machine learning startup and consulting on numerous legal cases. He received his Ph.D. from the University of Massachusetts at Amherst. Dr. Lewis was elected a Fellow of the American Association for the Advancement of Science in 2006, and in 2017 he and W. A. Gale won the ACM SIGIR Test of Time Award for the invention of uncertainty sampling.
Most of Reveal's data science team members (I am including both the data science and AI engineering teams) have been with the team for years. Generally they started as interns and have expanded and deepened their expertise just as we have grown our data science capabilities. They are responsible for, among other things, the AI capabilities in Reveal Review, Reveal AI, and Brainspace.
In addition to the skills they have learned on the job, our data science team members bring a wealth of academic experience with credentials including PhD, Master of Science, Bachelor of Science, Bachelor of Engineering, and Bachelor of Technology degrees from University of Chicago; University of Massachusetts Amherst; Illinois Institute of Technology; Michigan State University; University of Mumbai; Sinhgad Academy of Engineering; and Jaypee University of Information Technology.
Having such as robust data science team has enabled Reveal to build a platform powered by cutting-edge artificial intelligence and machine learning. You can see the results in Brainspace's visual analytics, in how active learning is woven into Review, and the platform's ability to work deftly with images, foreign language content, and audio files. And you can hear it in what our customers have to say about us: "Intuitive yet robust artificial intelligence", "Best in class legal technology", "Truly outstanding range of AI-driven products".
With the power and technical skills of its data science team, Reveal continues to make the platform even stronger, working every day to develop compelling solutions to the next round of challenges facing legal.
Want to learn more?
I recently chatted with Irina and Dave on eDiscovery Leaders Live, a weekly program hosted by ACEDS and sponsored by Reveal where I chat with leaders in eDiscovery and related fields. During the session, Irina, Dave and I focused on artificial intelligence in eDiscovery. We started with efforts to use AI to deliver “simple” solutions to complex problems and talked about the importance of holding the AI discussion at the right level.
Irina gave us a little background on DLA Piper’s Aiscenion, which her team worked on. Dave and then Irina offered their thoughts on the “every case is special” objections and how to respond to it. We also looked at whether a PhD or MS is needed to make effective use of AI in discovery and how much of the AI plumbing attorneys really need to understand.
Irina and Dave shared info about their teams and what those folks do, a discussion that morphed into learning about AI more generally. Finally, at my request Dave and then Irina peered into their crystal balls to offer thoughts about where AI might take discovery in the future. The video of the session is available on ACEDS social media platforms and the video and transcript of our discussion are available here.