Near Dupe FAQ



neardupe example.jpg

How does our Near Dupe Algorithm work?

Our near-duplicate identification tool is like a highly intelligent filter for your documents. It's designed to sift through vast amounts of content and spot documents that are almost the same but not quite – think of it as finding twins in a crowd.

Smart Scanning with Shingling: Imagine each document is turned into a unique pattern or fingerprint. Our tool creates these fingerprints in such a way that similar documents have similar patterns.

Intelligent Matching with Jaccard Similarity: Just like matching fingerprints, our tool compares these patterns to find matches. It's smart enough to know that documents don't need to be exactly the same to be considered a match; they just need to be very close.

Accuracy Meets Speed: Our technology ensures that this matching is done quickly and accurately, so you don't have to worry about duplicates cluttering your system or missing out on important, unique documents.

Customizable: We know that different cases have different needs. That's why our tool lets you decide how similar documents need to be in order to be considered duplicates.

With Nextpoint’s near-duplicate feature, your document review will be cleaner, more organized, and more efficient, saving time and money. ________________________________________________________________________________________________________________

How are scores calculated?  What do they mean?

We score each pair of documents based on their Jaccard similarity and apply a threshold to determine if they are near-duplicates. This threshold can be adjusted based on the desired sensitivity of the duplication detection.


What is considered a Standard Near Duplicate Analysis?  What makes it more custom?

A Standard Near Duplicate Analysis involves comparing every document to every other document currently in the database and identifying those that are not identical but share a high degree of similarity.  Then, at the document-level, we will cluster those which are similar to the current document view and provide a similarity score for each.  

Aspects which may make a Near Duplicate analysis more custom include, but are not limited to:

  • Comparison of two particular data sets
    • E.g. folder A to folder B where one document set is the “master”
  • Reviewing looking from bulk perspective vs. doc by doc
    • E.g. I want to see clusters of documents in my grid view
  • Those targeted at minimizing your review (eliminating or setting aside documents)
    • E.g. “I don’t want to review documents from this document set produced to me if I’ve already reviewed them in my own native, client-data.
0 out of 0 found this helpful



Please sign in to leave a comment.

Articles in this section