Data Mining Search Guide

Follow

Data mining uses a powerful search syntax called dtSearch. There are differences between dtSearch and the search syntax employed in Nextpoint databases, so some translation may be required. 

Documents are searchable with scans after processing is completed. Consultation on terms and syntax is available for an additional hourly charge. 

Text Searches

Like in a Nextpoint database, Data Mining uses boolean searching for text searches. A "boolean" search request consists of a group of words or phrases linked by connectors such as AND and OR that indicate the relationship between them. 

Examples:

Search Request

Meaning

apple and pear

both words must be present

apple or pear

either word can be present

apple w/5 pear

"apple" must occur within 5 words of "pear"

apple not w/12 pear

"apple" must occur, but not within 12 words of "pear"

apple and not pear

"apple" must be present and "pear" cannot be present.

name contains smith

the field name must contain smith

apple w/5 xfirstword

apple must occur in the first five words of the document

apple w/5 xlastword

apple must occur in the last five words of the document

Warning

Exact phrases should be off set by quotation marks. If pasting from a text editor that uses curly quotation marks, you will need to replace these with straight quotation marks for your search to run correctly.

“test phrase” >> "test phrase"

If you use more than one connector (and, or, contains, etc.), you should use parentheses to indicate precisely what you want to search for.  For example, apple and pear or orange could mean (apple and pear) or orange, or it could mean apple and (pear or orange).  For best results, always enclose expressions with connectors in parenthesis.  Example:

(apple and pear) or (name contains smith)
Field Searches
We allow searches on any metadata field we are able to extract from a file. To run a search on a specific field in Data Mining, use the following syntax:
(field_name contains "content in the field")

Examples:

(email_from contains ("jsmith@nextpoint.com" or "jjohnson@nextpoint.com))

(family_date contains 10/25/2021)

(has_children contains true)
 
To search only for text not found in any field, use the following syntax:
//text contains ("search request" or term)
Example:
//text contains ("John Smith")
This will hit on files that mention the name "John Smith" but not ones that only contain "John Smith" in one of the metadata fields. 
 
Fields can vary based on the data type. For emails, we generally extract the following field information (if available):
import_path
ancestry
mailbox_path
file_type
file_size
md5
s3_path
status
project_id
batch_id
searchability
author
content_type
creation_date
creator
subject
language
email_from
email_to
email_cc
email_bcc
email_subject
email_date
email_content_type
email_message_id
has_children
family_date
file_id
family_id
 
  • Any file extracted from another file (loose files from a zip, attachments from emails, etc.) will have anancestryfield
  • Any file that has other files extracted from it will have ahas_childrenfield (value is true/false)
  • Any file directly or indirectly extracted from a mailbox will have amailbox_pathfield
  • Most (if not all) text-based files will haveauthor,content_type,creation_date, andlanguagefields.
  • Emails and their attachments will have afamily_datefield. This is like Nextpoint's "master_date" field. It uses the family parent'screation_datevalue, but it's inherited by all children in the family. 
All files should have the following metadata fields which can be searched on:
import_path
file_type
file_size
md5
s3_path
status
searchability
project_id
batch_id
file_id
family_id

 

 

Special Characters

Search terms may include the following special characters:

Character

Meaning

?

matches any character

=

matches any single digit

*

matches any number of characters

%

fuzzy search

#

phonic search

~

stemming

&

synonym search

~~

numeric range

##

regular expression

 

Fuzzy Searching

Fuzzy searching will find a word even if it is misspelled.  For example, a fuzzy search for apple will find appple.  Fuzzy searching can be useful when you are searching text that may contain typographical errors (such as emails), or for text that has been scanned using optical character recognition (OCR). 

Add fuzziness selectively using the % character.  The number of % characters you add determines the number of differences dtSearch will ignore when searching for a word.  The position of the % characters determines how many letters at the start of the word have to match exactly.  Examples:

ba%nana

Word must begin with ba and have at most one difference between it and banana.

b%%anana

Word must begin with b and have at most two differences between it and banana.

Phonic Searching

Phonic searching looks for a word that sounds like the word you are searching for and begins with the same letter.  For example, a phonic search for Smith will also find Smithe and Smythe.

To ask dtSearch to search for a word phonically, put a # in front of the word in your search request. Examples:

#smith

#johnson

Stemming

Stemming extends a search to cover grammatical variations on a word.  For example, a search for fish would also find fishing.  A search for applied would also find applying, applies, and apply

To add stemming selectively, add a ~ at the end of words that you want stemmed in a search.  Example: apply~

The stemming rules included with dtSearch are designed to work with the English language.  

Synonym Searching

Synonym searching finds synonyms of a word that you include in a search request.  For example, a search for fast would also find quickly. You can enable synonym searching selectively by adding the & character after certain words in your request.  Example:

improve& w/5 search

Numeric Range Searching

A numeric range search is a search for any numbers that fall within a specified range.  To add a numeric range component to a search request, enter the upper and lower bounds of the search separated by ~~ like this:

apple w/5 12~~17

This request would find any document containing apple within 5 words of a number between 12 and 17.

Notes

  • A numeric range search includes the upper and lower bounds (so 12 and 17 would be retrieved in the above example).
  • Numeric range searches only work with integers greater than or equal to zero, and less than 2,147,483,648
  • For purposes of numeric range searching, decimal points and commas are treated as spaces and minus signs are ignored. For example, -123,456.78 would be interpreted as: 123 456 78 (three numbers).  

Regular Expressions

Regular expression searching provides a way to search for advanced combinations of characters. A regular expression included in a search request must be quoted and must begin with ##.

Examples:

Apple and "##199[0-9]"

This would hit on a file containing the word "Apple" and the number 1994 (or 1990, 1991...1999). 

Apple and "##19[0-9]+"

This would hit on a file containing the word "Apple" and the number 194 (or 1964 or 1983302002...).

Special characters in a regular expression are:

Regular expression

Effect

.  (period)

Matches any single character.  Example: "sampl." would match "sample" or "samplZ"

\

Treat next character literally.  Example: in "\$100", the \ indicates that the pattern is "$100", not end-of-line ($) followed by "100"

[abc]

Brackets indicate a set of characters, one of which must be present.  For example, "sampl[ae]" would match "sample" or "sampla", but not "samplx"

[a-z]

Inside brackets, a dash indicates a range of characters.  For example, "[a-z]" matches any single lower-case letter.

[^a-z]

Indicates any character except the ones in the bracketed range.

.* (period, asterisk)

An asterisk means "0 or more" of something, so .* would match any string of characters, or nothing

.+ (period, plus)

A plus means "1 or more" of something, so .+ would match any string of at least one character

[a-z]+

Any sequence of one or more lower-case letters.

 

Limitations

  • A regular expression must match a single whole word. For example, a search for "##app.*ie" would not find "apple pie".
  • Only letters and numbers are searchable. Characters that are not indexed as letters are not searchable even using regular expressions, because the index does not contain any information about them.
  • Because the dtSearch index does not store information about line breaks, searches that include begining-of-line or end-of-line regular expression criteria (^ and $) will not work.
  • No case or other conversion is done on regular expressions, so a regular expression must match the case of the information stored in the index.  If an index is case-insensitive, all letters in the regular expression must be lower-case.  If a character is not searchable in the index, then it cannot be included as a searchable character in the regular expression.  Non-searchable characters in a regular expression are not ignored as they are in other search expressions.

Performance

A regular expression is like the * wildcard character in its effect on search speed: the closer to the front of a word the expression is, the more it will slow searching. "appl.*" will be nearly as fast as "apple", while ".*pple" will be much slower.

Searching for numbers

The = wildcard, which matches a single digit, is faster than regular expressions for matching patterns of numbers. For example, to search for a social security number, you could use "=== == ====" instead of the equivalent regular expression.

For additional information about dtSearch syntax, review the following documentation (from which this search guide was adapted): https://support.dtsearch.com/webhelp/dtsearch/search_requests_overview.htm 

 

Next up: Data Mining - Exporting Reports and Data 

Or view one of the other support resources in the Data Mining series:

Data Mining – Getting Started

Data Mining - Project Dashboard

Data Mining – Uploading and Importing Data

Data Mining - Searching and Slicing Data

Data Mining - Glossary

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.