Near Dupe FAQ

Eric Sieker

January 28, 2026 19:17

Nextpoint’s Near Dupe feature helps identify documents that are highly similar but not exact duplicates, allowing reviewers to quickly recognize overlapping content while preserving meaningful differences. By grouping related documents and assigning similarity scores, Near Dupe analysis helps reduce review volume, improve organization, and streamline document review workflows.

How does our Near Dupe algorithm work?

Nextpoint’s near-duplicate identification tool analyzes document content to identify files that are highly similar but not exact copies. This allows reviewers to quickly recognize overlapping content while still preserving meaningful differences between documents.

Content Analysis Using Shingling: Each document is broken into overlapping segments of text, creating a structured representation of its content. Documents that share many of these segments are more likely to be considered near-duplicates.
Similarity Measurement with Jaccard Similarity: These document representations are compared using Jaccard similarity to calculate how much content overlaps between documents. This method identifies high similarity even when documents are not identical.
Efficient and Accurate Processing: Near-duplicate detection is designed to scale efficiently across large document sets, helping reduce review volume without sacrificing accuracy.
Configurable Sensitivity: Similarity thresholds may be adjusted to meet the needs of a specific matter, allowing near-duplicate results to be tuned based on review strategy.

Using Nextpoint’s near-duplicate feature helps streamline document review by reducing redundancy, improving organization, and saving both time and cost.

How are scores calculated? What do they mean?

Near-duplicate scores represent the degree of similarity between documents, based on Jaccard similarity calculations. Scores reflect how much content overlaps between documents and are evaluated against a defined threshold to determine whether documents qualify as near-duplicates.

What is considered a Standard Near Duplicate Analysis? What makes it more custom?

A Standard Near Duplicate Analysis evaluates documents across the database to identify groups of files that share a high degree of similarity. Related documents are associated at the document level and assigned similarity scores to help guide review decisions.

Factors that may result in a more customized Near Duplicate Analysis include, but are not limited to:

Targeted comparisons between specific data sets
- For example, comparing Folder A to Folder B where one data set serves as the primary reference
Reviewing results in bulk versus document-by-document
- For example, viewing related document groupings within a grid view
Review strategies focused on reducing review volume
- For example, setting aside documents produced to you when substantially similar versions have already been reviewed from your own data

Near Dupe FAQ

How does our Near Dupe algorithm work?

How are scores calculated? What do they mean?

What is considered a Standard Near Duplicate Analysis? What makes it more custom?

Comments

Articles in this section