blog.1.image

Supporting the EDRM Message ID Hash

Dinesh Karamchandani and George Socha
Dinesh Karamchandani and George Socha

Supporting the EDRM Message ID Hash


In 2020, a team of volunteers gathered to take on a long-standing eDiscovery challenge: how to better enable cross-platform email duplicate identification.

The result of the group’s efforts is the EDRM Message Identification Hash (EDRM MIH), released by EDRM on January 11, 2023. Details about the project are at the EDRM DupeID page.

Reveal supports EDRM MIH and we recently added instructions for using it to our User Documentation site.


The Project’s History

The EDRM Duplicate Identification Project was the brainchild of Beth Patterson, the director and founder of ESPconnect and an industry stalwart from Australia. To hear directly from Beth about her role in the DupeID project, check out the eDiscovery Leaders Live discussion with her earlier this year.

Beth first floated the idea for the project in 2020. The project kicked off in earnest in March 2021, with over a dozen participants. As the project progressed, the team expanded to over 20 members coming from all the constituencies you would expect: software companies, service providers, law firms, and corporate. They also came from across the globe: Australia, Finland, Japan, Israel, the UK, and the US.

The group’s initial mission was to develop a best practice specification for hashing electronic documents and data to identify exact duplicates . The hope was that this could improve efficiencies and generate significant cost savings. Over time we narrowed our focus, aiming for something more immediately achievable than our initial objective.

After more than two years of evaluation, experimentation, and testing, in January 2023 released version 1.0 of the EDRM Message Identification Hash Specification. That specification defines a process to identify duplicate email messages across platforms.

 

The EDRM MIH

The EDRM MIH is an MD5 hash value. That value is generated from the Message-ID header field of an RFC-compliant email message.

There are some qualifiers for any tools that creates EDRM MIH values:

  • If the email message does not have a Message-ID or if the Message-ID is not valid, no EDRM MIH may be generated.
  • The EDRM MIH must be generated using the complete Message-ID value, including the flanking angle brackets.
  • Once the EDRM MIH has been generated, the character case of the Message-ID value must not be altered.
  • If an email message contains more than one Message-ID value, the EDRM MIH must be generated using the first Message-ID value declared in the parent email message header.
  • EDRM MIH values must be generated only for email messages, not for any other items.

The EDRM MIH is intended to be used for cross-platform deduplication. Even so, there are situations where the EDRM MIH may not be appropriate to use. Examples are included in the project’s deliverables (described below). The deliverables call out 10 examples, discussed in greater detail in the deliverables:

  • Draft messages without Message IDs,
  • SPAM and fraudulent messages,
  • System generated emails,
  • Malformed of corrupted Message IDs,
  • Messages with prepended or appended headers, footers, and signatures,
  • Messages with message group and alias addressing,
  • Messages with BCCs,
  • Messages with stripped or corrupted attachments,
  • Messages with time anomalies, and
  • Items that are not email messages.

The Project’s Deliverables

The project’s first set of deliverables is the EDRM Email Duplicate Identification Toolkit, designed to facilitate cross platform identification of duplicate email messages. The Toolkit has six components, all available from the EDRM DupeID page. The components are:

  • Cross Platform Email Duplicate Identification. An 18-page document from the EDRM Duplicate Identification Project Team consisting of:
    • An overview of the project,
    • A list of contributors to the project,
    • The EDRM Email Duplicate Identification Specification (v1.0), and
    • The EDRM Email Duplicate Identification Guidelines (v1.0).
  • Introducing the EDRM Email Duplicate Identification Specification and Message ID Hash (MIH). A 4-page whitepaper by Craig Ball meant as a non-technical reference for those wanting to understand why and how to use the EDRM MIH.
  • EDRM Email Duplicate Identification. A one-page infographic covering the highlights of the EDRM MIH.
  • FINAL EDRM MIH Example Data 20240123.zip. A sample data set to use for testing and verifying implementation of the EDRM MIH.
  • FINAL EDRM MIH Sample Email Index. An Excel spreadsheet to be used for verifying Message ID extraction and MIH calculations for all email in the example data set.
  • Small Dataset MIH Calculator. An Excel-based tool for generating EDRM MIH values for small sets of Message IDs.

Implementation at Reveal

Earlier this year, Reveal added support for EDRM MIH to Reveal 11. This means:

  • Users of Reveal 11 can generate EDRM MIH hash values when they process data.
  • Those hash values are stored in a new field in Reveal 11 called EDRM_MSGID.
  • The values from EDRM_MSGID can be included in load files generated from Reveal 11.

Now, Reveal has added an article about the specification to our support system, the Reveal 11 Knowledgebase. The article, EDRM Message ID Hash (EDRM MIH), provides step-by-step instructions for how to use the specification in Reveal 11.

 

Process Overview

As our platform processes email messages, it looks at each message’s Message-ID header line to see whether that line contains a valid Message-ID value. It is looking for text with this format:

“<” id-left “@” id-right ”>”

Let start with the following Message-ID header line from a hypothetical email message:

Message-ID: <CALckR-a8UDkRjO4xJyjd_s0GPxQWw@mail.gmail.com>

If we compare the content of the header line with the required format, we see that the content conforms to the format:

If the platform finds information in the required format, it passes the bracketed value to its EDRM HIM generator. In this example, the value it passes looks like this:

<CALckR-a8UDkRjO4xJyjd_s0GPxQWw@mail.gmail.com>

Next, the platform uses that information to generate an EDRM MIH hash value. With this example, the hash value obtained is:

1de319c276884bd0c9e2f1621ada26cc

Finally, the platform adds this hash value to the EDRM_MSGID field for the email message.

To note, The EDRM MIH may only be generated if an email has a valid Message-ID value that has not been altered in any way. Where more than one Message-ID value is contained within an email, the MIH must be generated using only the first Message-ID value declared in the parent email message headers.

For detailed steps, go to the article in the Reveal 11 Knowledgebase.

 

Uses

EDRM MIH values are available for a variety of uses, including as part of load files generated for document productions.

 

Learn More

Today, we discussed how Reveal has incorporated the new EDRM Message Identification Hash into Reveal 11. If your organization is interested in learning more about how Reveal maintains its position as a leader, with its AI-powered end-to-end legal document review platform, contact us.