*This functionality is available for Advanced users only.
Why use Deduplication?
Many collection efforts result in multiple versions of the same files (documents, mailboxes, etc). For example, if you collect John Smith’s and Cindy Loh’s mailboxes, and they often communicate, there will likely be duplicative emails across the two custodians.
Uploading and storing a file multiple times wastes not only storage space, but also causes unnecessary complexity, confusion, time and cost increases.
Introduction to Deduplication settings
When you open a new Nextpoint database, Deduplication is turned on by default. However, we suggest you review your import settings before initializing any imports to ensure the desired settings are selected.
To view and/or edit your Deduplication settings, go to SETTINGS > Import and click Edit.
Note: If working in a Prep database, navigate to MORE > SETTINGS > Import and click Edit.
Disable or Enable Deduplication using the pill-shaped toggle at the top right of the Deduplication Settings screen.
When off / disabled, no deduplication will occur and all files will be imported (most often recommended when importing data that has been produced to you).
When on / enabled, you can also change the definition of a duplicate by editing the criteria as described below in further detail.
Once you’re satisfied with your Deduplication settings, click Back.
Understanding Deduplication settings
When enabled, Nextpoint processes only one copy of duplicate files as defined by your Settings. Upon a confirmed duplicate occurrence, the following fields will be merged from the new file into the existing/duplicate document:
- mailbox_file
- mailbox_path
- recipients
- cc
- root_folder
- file_path
- shortcut
- custodians (assigned during import)
Below is a closer look at the criteria you can employ in your settings to define “what makes a duplicate?”:
Content Hash
When Deduplication is enabled, documents/emails with the same content hash value (or “electronic fingerprint”) are always considered to be an exact match. That is why the setting says “Always On".
The hash is built from the contents of the file, and any change in the content of the file changes the hash. Therefore, if any attribute of one file is different from the next, they will not have the same content hash, will not be subject to deduplication, and both files will be preserved.
This is our least aggressive deduplication option (assuming Message ID is Ignored and Context Criteria is Included).
Outlined below is a description as to how loose documents and/or emails are handled by this deduplication setting:
LOOSE DOCUMENTS: If Content Hash is different for two documents, we don't import, and move on to check the Context Settings (tab #3).
EMAILS: If Content Hash is different for two emails, we don't import and move on to check the Email Message ID (tab #2).
Email Message ID
In addition to content hash, emails may contain another unique identifier — the email-message-ID —which is generated by the client program (Outlook, Gmail, etc.) or the first email server at the moment the email is sent.
Outlined below is a description as to how emails are handled by this deduplication setting (loose documents do not apply to this setting):
Ignore Email Message ID
If Message ID is set to Ignore, then we import the email because only Content Hash matches as described above will be considered duplicates.
Use Ignore Email-Message-ID to dedupe conservatively.
Include Email Message ID
If Message ID is set to Include, then we will check for a match here.
- If Message ID is different, we will import the NewEmail because it's a new/different file. In other words, emails with the same Email Message-ID will be considered to be an exact match - even if their content hash does not necessarily match.
- If Message ID is the same, we are going to move on to check for Context (#3 below).
This is considered to be the most aggressive form of deduplication (assuming Content Hash is enabled and Context Criteria is set to Include) and recommended when importing emails.
Context Criteria
A file's context is determined by its load file values or location within a folder structure.
Sometimes, two or more files may have the same content hash and/or email-message-ID, but appear in different contexts. For example, copies of the same loose email file loaded at different times from different custodians’ folders.
Ignore Context
Choosing Ignore Context means that any files meeting your File Match Criteria will be deduplicated, even if they appear in different contexts. It’s considered to be a more aggressive choice.
LOOSE DOCUMENTS: If you have selected to Ignore Context, we are basing duplication off the MD5 Hash so the document will be deduped. “This is a match, dedupe the document".
EMAILS: If you have selected to Ignore Context, then we will dedupe. “This NewEmail has the same MessageID as another and we aren't paying attention to anything else, so we will dedupe and not bring in NewEmail”
Include Context
Choosing Include Context ensures that these files are not subject to deduplication, and both files will be preserved. This is the default setting.
LOOSE DOCUMENTS
- If you have selected to Include Context, and NewDoc + ExistingDoc have different load file values, then we will bring in both files.
- If you have selected to Include Context, and NewDoc + ExistingDoc have the same load file values, then we will dedupe and only bring in one file.
- Variances in the following document attributes will be recorded in the ExistingDoc: mailbox_file, mailbox_path, recipients, cc, root_folder, file_path, shortcut, custodians (assigned during import)
EMAILS: If you have selected to Include Context, then we will look at all fields (whether intrinsic or load file values)
- If everything in NewEmail matches ExistingEmail except the merge fields listed above, then we dedupe
- If anything in NewEmail is different, except the merge fields, we will NOT dedupe and instead import that NewEmail
- Example: If everything else is the same except something like a BCC or Date_received value, we will import that NewEmail
Merges and Conflicts
When importing with Include Context selected, Nextpoint will check for context merges and conflicts. If any documents are nearly similar, but conflict on any of the below-mentioned fields, they will not be considered context duplicates and both files will be preserved.
The following are conflict fields:
- Author
- Bates Start
- Bates End
- Document Date
- Email Message ID
- Email Reply ID
- Subject/Title
If you do not want these duplicates due to conflicts (variances) in the aforementioned fields, we would suggest to import with Context Criteria set to Ignore Context.
Note: An email's location within a container file (MBOX, PST) is not considered as Context Criteria. This is because the email is technically part of the container, rather than an independent document.
Return to Review Workflow
Comments
Article is closed for comments.