This article provides a detailed description of all the features and options associated with Email Threading.
Overview
The purpose of email threading is to:
- Identify email messages and attachments in the dataset.
- Identify duplicate email messages.
- Find messages that belong to the same email thread.
- Mark which messages contain unique content not present in any other message.
- Determine the hierarchy and sort order of messages within each thread.
Email Threads filtering may be applied on the Dashboard screen. See Filtering Email Threads at the end of this article for details.
Analyzing Email Threads
Field Mapping
Email threading relies on certain document metadata fields during processing, such as the From, To, and CC email headers. Prior to initial processing, it is important to examine the data to be processed and properly configure the field map settings so that these metadata fields are available to the processing engine.
Fields are assigned field categories by the schema, which then controls how the contents of those fields are treated. These categories give the field contents a role in email threading:
Table 1. Metadata Fields Required by Email Threading
Reveal Field Name |
Reveal Display Name |
Type / Purpose |
Body Text (The body text used is based upon order <OCR / Loaded / Extracted>) |
|
- At least 300 documents with text are required. |
ITEMID |
Item ID |
ID - The numeric identifier of the document. |
SUBJECT_OTHER |
Email Subject |
Required for faster processing time. |
ATTACHMENT_LIST |
Attachment List |
“Attachment” |
BCC |
Bcc |
Email only, Email's "BCC" field. |
CC_ADDRESSES |
Cc |
Email only, Email's "CC" field. |
SENT_DATE |
Date Sent |
All known (non-custom) DATE, TIME pairs are combined into a single field value when ingested for the histogram. |
SENDER |
From |
Email only, Email's "From" field. |
PARENT_ITEMID |
Parent ID |
Review internal family identifier. Auto generated upon family build. |
RECIPIENT |
To |
Email only, Email's "To" field. |
Identifying email
Processing begins by scanning the metadata of each document in the dataset to determine which documents are email messages or attachments. Any documents referenced in the attachment field, or which have a non-empty parent_id field, are classified as email attachments. Documents with a non-empty from field that are not attachments are classified as email messages.
Identifying duplicates
Duplicate email messages are detected by the same process that creates exact duplicate groups for the entire data set. The set of fields used for exact duplicate detection is configurable at data set creation time. The attachments of a document are taken into account in exact duplicate detection, though only the contents of the attachments not their key or filename.
Grouping Messages into Threads
The next step finds all messages that belong together in the same email thread. A pair of messages is considered to be part of the same thread if either (a) Conversation Index information is available indicating that they are in the same thread, or (b) as described next, they have a similar subject and share a significant portion of body text in common.
For efficiency, messages are first separated into groups with the same normalized subject. The normalized subject is the original subject with prefixes such as Re: and Fw: removed. Messages with the same normalized subject are then compared by their body content.
To compare the body content of messages an approach called shingling is used. In this approach each document is represented as a set of unique shingles, where shingles are the n-grams found in the body text.
As an example, the list of 3-grams for the document "a rose is a rose is a rose" would be as follows:
- “a rose is”
- “rose is a”
- “is a rose”
- “a rose is”
- “rose is a”
- “is a rose”
Removing duplicates we are left with the following set of unique shingles:
{ “a rose is”, “rose is a”, “is a rose” }
Once documents are represented as sets of shingles, set intersection operations are used to measure the percentage of a message’s content that is contained within another message.
When comparing two messages (A and B) the engine determines the percentage of message A’s shingles that are contained within message B. If this percentage exceeds the containment threshold (100% by default), then messages are assigned to the same thread and message B is marked as a response to message A.
This property setting is important and can help you to fine tune your Email Threading session:
- Containment-Threshold - This property sets the percentage of an email’s shingles that need to be contained within another email for it to be considered contained within that email. To make inclusiveness more conservative set the containment-threshold to 1.0.
Identifying unique messages
After assigning messages to threads, the next step is marking each message unique if it contains content that is not contained in any other message in the thread.
A message is marked as a unique message when comparing the body of the message to the bodies of the other messages in the thread and none are found that meet the containment threshold. A message is marked as unique attachment if it has an attachment that is not contained in any of the responses to the message. When comparing attachments, the actual content of the attachments is used, not the filenames or document keys.
Note
Messages are always marked unique unless there’s another message (that is. reply or forward) that contains the text of that message. In the case where a response edits the original text inline, unless the edits were very small or trivial, this causes both the original and response to be marked unique. Date is not used for any unique/nonunique identification.
Assigning hierarchy and sort order
Messages within a thread are assigned an overall sort order and an indentation level. The indentation level of each message is first computed based on the response relationships of the messages. Messages have an indentation level one greater than the message that they are a response to. Messages that are not a response to any other message have an indentation level of 0.
Messages are sorted in a hierarchical manner such that the message with the lowest indentation level and earliest date comes first, followed by all messages that are a response to that message. A second sort order is available that instead puts more inclusive messages earlier in the ranking.
Note
Attachments also get their own unique threading information, that is, ThreadId, Sort Order, Indent Level, etc.
Filtering Email Threads
Reveal 11.10 added Email Threads filters in the left Sidebar which may be applied in reviewing project documents. This filter type has two subsets:
- Action Type offers options for filtering by message send, reply, reply-all, or forward, as well as whether an email thread document contains any value or no value for the action type metadata.
- Is Unique offers options for filtering messages by True or False (whether or not a message contains content that is not found in any other message in the thread). This subset can also filter on whether an email thread document contains any value or no value for Is Unique.
Last Updated 12/27/2023