Understanding Deduplication

Annie Johnson

February 06, 2026 17:28

Deduplication at the time of import prevents existing documents and email families from entering your database multiple times. The deduplication settings selected in the import workflow determine the definition of 'Duplicate' for the import batch at-hand.

When deduplication is turned off at the time of import, no deduplication will occur and all files will be imported.

NextGen Deduplication Experience

Deduplication has been simplified in NextGen processing. Instead of configuring file match and context criteria, you now only choose whether deduplication is on or off during import.

Deduplication OFF: All copies of files (including duplicates already in the database) are imported as separate documents. Exact copies can still be viewed in the Related Documents panel.
Deduplication ON: Files are evaluated using MD5 hash matching. Emails are checked by MD5 hash or Message ID. Matching files are treated as duplicates and only one copy is retained in the database.
Family-level deduplication: Attachments are deduplicated within their family. If the same attachment exists under a different parent email, it will be treated as a new document.
Metadata handling: When duplicates are merged, select file path and mailbox metadata fields are concatenated. Other metadata and custom fields remain unchanged.
Optional BCC merge: A BCC merge option allows blind-copy recipients from duplicate emails to be combined into a single record (the document image will still only display the initial BCC value).

The information below reflects the legacy deduplication experience and may differ from what you see in NextGen databases.

How do I set Deduplication Settings?

When importing data via, Nextpoint will make pre-set recommendations for Deduplication settings in Step 2 of the import workflow. The dedupe selections will be populated based on which type of data is detected from the File Room (or selected in Step 1).

If you would like to modify the recommended settings, make sure the applicable toggle is turned ON and click the gear to open the settings pop-up.

Location of Deduplication Settings

How to Modify Recommended Deduplication Settings

Upon clicking on the gear icon , you will be presented with the option to toggle File match criteria and Context Criteria ON or OFF . Read more below on how File Match and Context Criteria factor into the deduplication process.

Default Deduplication Settings per Import Type

Outlined below is a list of the various import types and their associated deduplication settings:

How does Deduplication work?

FIGURE 1

INITIAL PROCESSING

1 | First, you queue your files for processing by initiating an import. See the three different file types queued for processing on the far left in Figure 1 (above).

2 | Once you initiate an import, Nextpoint begins processing by extracting files from their containers (zips, pst, box), extracting any attachments from their parent emails, and extracting metadata.

FILE MATCH CRITERIA

3 | Next, is the First Deduplication Pass on all three file types where we look for a matching expansive hash for all the file types.

If a loose file, expansive hash is the MD5 hash value.
If an email family, expansive hash is formulated from the MD5s of all members of the email family.

Once an expansive hash is identified, we compare to other files being processed and existing in the database. Any matches are placed in a Dedupe Queue. Loose files without a match are imported.

4 | Also during the First Deduplication Pass, all remaining email files not placed in the dedupe queue due to matching expansive hash are checked for matching Message ID's.

Again, we look for matches against what already exists in the database and other files currently being processed. If we find a matching set, we add to the Dedupe Queue. If no match is found, the file is imported.

Note: Turn File Match Criteria ON for more aggressive deduplication using expansive hash and email message ID, as describe below. Turn OFF to deduplicate more conservatively and only consider Content Hash matches duplicates.

CONTEXT CRITERIA

5 | Next, we address the Dedupe Queue.

If Context is OFF, we take everything in the Dedupe Queue and merge field values which may conflict (e.g. file_path of file A is different than file B). We keep the first copy of the file which entered the database and discard the other(s).
If Context is ON, we take any sets of duplicates* and handle field value conflicts accordingly:
1. If fields from our Conflict Field List do not match in a set of duplicates, we keep both files, remove from the Dedupe Queue and import both.
  
  If there are no conflicts present in the Conflict fields:
2. We evaluate fields from our Merge Field List. If any Merge Field does not match in a set of duplicates*, we keep one copy of the file, merge the mismatched values into the respective field, and discard the last copy to enter the database.
3. We evaluate fields from our Ignore Field List. If any Ignore Field does not match in a set of duplicates, we do nothing with the fields and only keep the first copy of the file which entered the database.

*Duplicates can be considered two files within the same import OR a single file in an import being compared to a file existing in the database.

author	document_last_author	email_sent
bcc	document_subject	email_subject
created_date_time	email_author	last_print_date
document_author	email_message_id	modified_date_time
document_date	email_reply_id	key_document

cc	file_path	recipients
custodians	mailbox_file	root_folder
file_name	mailbox_path	shortcut

All User Generated Fields

Default Fields not on Merge/Conflict Lists

app_name	confidentiality_status	expansive_hash	privileged_status
batch_id	created_on	has_markups	redaction_notes
bates_end	delete_at_gmt	highlight_notes	relevancy_status
bates_range_end	document_title	id	supported_filetype
bates_range_start	document_properties	npcase_id	title
bates_stamped	document_type	number	updated_at_gmt
bates_start	email_received	page_notes	user_id
billing_size	encrypted	prefix	verified_page_count

Note: All deduplication is considered at a family-level. If after a loose file is added to your case, that same file is added, but as part of a larger email family (or vice versa), no deduplication will occur.

Note: Turn Context Criteria ON for more conservative deduplication using the context. Turn OFF to deduplicate more aggresively using content hash match duplicates.

Import Type	Deduplication Setting	Image from Import Data Settings
Manual	Dedupe - OFF
Single Mailbox	Dedupe - ON , File Match - ON , Context - ON
Multiple Files	Dedupe - ON , File Match - ON , Context - ON
Production with load file	Dedupe - OFF