Deduplication Settings

Follow

*This functionality is available for Advanced users only.

Why use Deduplication?

Many collection efforts result in multiple versions of the same files (documents, mailboxes, etc).  For example, if you collect John Smith’s and Cindy Loh’s mailboxes, and they often communicate, there will likely be duplicative emails across the two custodians.  

Uploading and storing a file multiple times wastes not only storage space, but also causes unnecessary complexity, confusion, time and cost increases. 

Introduction to Deduplication settings

When you open a new Nextpoint database, Deduplication is turned on by default. However, we suggest you review your import settings before initializing any imports to ensure the desired settings are selected.

To view and/or edit your Deduplication settings, go to SETTINGS > Import and click Edit. 

Note: If working in a Prep database, navigate to MORE > SETTINGS > Import and click Edit.

Deduplication_Review_Settings_View.png

Disable or Enable Deduplication using the pill-shaped toggle at the top right of the Deduplication Settings screen.

When off / disabled, no deduplication will occur and all files will be imported (most often recommended when importing data that has been produced to you).

When on / enabled, you can also change the definition of a duplicate by editing the criteria as described below in further detail.

Once you’re satisfied with your Deduplication settings, click Back.

Screen_Shot_2017-02-24_at_9.34.14_AM.png

 


Understanding Deduplication settings

When enabled, Nextpoint processes only one copy of duplicate files as defined by your Settings. Upon a confirmed duplicate occurrence, the following fields will be merged from the new file into the existing/duplicate document:

  • mailbox_file
  • mailbox_path
  • recipients
  • cc
  • root_folder
  • file_path
  • shortcut
  • custodians (assigned during import)

Below is a closer look at the criteria you can employ in your settings to define “what makes a duplicate?”:

1 | Content Hash 2 | Message ID 3 | Context

Content Hash

When Deduplication is enabled, documents/emails with the same content hash value (or “electronic fingerprint”) are always considered to be an exact match.  That is why the setting says “Always On".

The hash is built from the contents of the file, and any change in the content of the file changes the hash.  Therefore, if any attribute of one file is different from the next, they will not have the same content hash, will not be subject to deduplication, and both files will be preserved.

This is our least aggressive deduplication option (assuming Message ID is Ignored and Context Criteria is Included).

Outlined below is a description as to how loose documents and/or emails are handled by this deduplication setting:

LOOSE DOCUMENTS: If Content Hash is different for two documents, we don't import, and move on to check the Context Settings (tab #3).

EMAILS: If Content Hash is different for two emails, we don't import and move on to check the Email Message ID (tab #2).

 

 

 

Return to Review Workflow

2 out of 4 found this helpful

Comments

0 comments

Article is closed for comments.