Searches and Slices
The primary function of Data Mining is to run search terms on large sets of documents. Both Searches and Slices allow users to narrow their data sets in order to only review the most relevant and useful data. Searches will focus primarily on the terms found in the text of your data while slices will allow you to group search groups and restrict your final sets by the properties of your data. Here is how "Search" works in Data Mining.
Begin by clicking onto the "Search" tab.
Search Builder
-
Building your search:
- In the search builder input field (1), you can manually enter your searches or paste from your external documentation into the field.
- In the search builder, each line item is equal to one search. If you would like to start a new search within the input field simply press enter/return on your keyboard.
- Note: We strongly suggest running searches in sets together, rather than individually when possible. This will be the most time and cost efficient way to search.
- For example it is considerably faster to run 100 searches together than it is to run them individually.
- You can also work out the syntax in any outside text editor and copy/paste them into the search builder.
-
Currently, only the following metadata fields can be searched via the search page:
mailbox_path
author
subject
email_from
email_to
email_cc
email_bcc
email_subjectYou can filter all other available metadata and file properties via the slice builder when creating your slice.
-
Assign to a search group:
- While you can allow searches to run individually, it is usually best to assign them to a search group. This way you can run reports on the set or include them in later slices and exports.
- To add to a search group, click on the dropdown menu below "Add to Search Group" (2) and select an existing group or choose the "Create New" option at the top of the menu.
-
Save Searches:
- You can also put a date restriction on individual searches at this point or you can add date restrictions in the "slice" step. If you restrict dates in this step, the date range will be connected with each individual search term. If you want to be able to adjust this date restriction on later iterations of your project, it is best to restrict your dates in the "slice" section.
- If you choose to include a Date Restriction, note that searches are inclusive of the input dates (so 10/29/2022 to 10/31/2022 would include 3 total days)
- Once your searches are in the input box and you have grouped them as you want, click on the "Save Searches" button (3). Then, the searches will transfer to the "Saved Searches" tab (4).
Saved Searches
The Saved Search table showcases a list of all of the searches you have created within this project.
- Here you can see how each line item within the search builder (as mentioned in the previous step) appears as its own row with related data, conditions applied, and slice assigned. Clicking on the search term will show the term along with any date restrictions applied to it.
- Each line item includes specific data relating to that search including file and family count, uniqueness and search proportion. See Glossary.
- "0" results means that there were no hits for that search.
- Empty results means that the search has not run yet. The user should hit the “calculate results” button (2) to view the results of the search.
- The "Calculate Results" button refreshes both new and old searches with updated hit counts based on all documents currently in the database. If you have multiple searches (or even multiple slices) to run you should add them all to the search table before clicking the "Calculate Results" button.
- "Error calculating results" means that an internal error occurred on this search. Users should reach out to the support team to identify the issue and possible next steps. If you would like to retry these searches, please copy them to builder, edit as needed, and run them again,
- If you selected a search group in the “Search Groups” column of the search page, you will be shown the Search Group Details modal which give you more detailed insight into your search group including the search terms. In this modal, you can also remove searches from a search group.
- You can review, compare, and contrast these search groups and the data that they yield for context as to what you may want to export later on.
- You can add new search terms to a group on the fly by selecting the "+Add" button next to unassigned search terms.
- The "Copy to Builder" button (4) will copy selected terms to the builder where they can be edited and rerun with modified conditions. This button will only be active when one or more terms is selected.
- To clear out your list of saved searches, you can select the ones you want to remove from the table and click on the "Archive/Unarchive" button (5). This action is reversible and you can review archived searches at the bottom of your saved searches chart (they will have "True" in the "Archived" column of the chart).
- Search hit counts only refresh after the "Calculate Results" button is pushed. The "Last Updated" date/time (6) lets the user know the last time that the searches were updated. If you import new document sets and want prior search sets to include the new documents, you need to click the "Calculate Results" button to recalculate the results.
- Each line item includes specific data relating to that search including file and family count, uniqueness and search proportion. See Glossary.
Next up: Data Mining Search Guide
Or view one of the other support resources in the Data Mining series:
Data Mining - Project Dashboard
Data Mining – Uploading and Importing Data
Data Mining - Searches and Search Groups
In Data Mining, your data can not only be searched for keywords, but it can also be sliced to combine complex file filters and parameters with your searches so that you can pull out very specific data sets for your review.
Slice Builder
- In the "Slice" tab, you can access a number of options to build your slice:
Name your slice (1). - Select from the "Slice Field" options below the editor (2).
- Select the specific criterion within that slice field to add to your slice (3).
- Apply a date restriction (4) filter (if applicable). Note that the date restriction is slice field specific. If you apply a data restriction to one part of your slice but use "OR" connectors, it is possible that your resulting data set will include files outside your data range.
- Click on the "Add to Slice" button at the bottom of your "Slice Fields" chart.
- Add additional slice field options as necessary.
- Adjust your connectors (6) to fit the specific needs of your data set. Note that all connectors inside each grouping must be identical. Mixing AND, OR, and NOT connectors within the same slice grouping separated by parentheses could result in unpredictable and inconsistent results.
- When your slice is ready, click on the "Create" button to slice your data set.
Saved Slices
Once you create a slice, it saves to your "Saved Slices" (1) and then calculates your results. The time this processing takes depends on the number of documents in your database, but when it finishes running, you can view the file count of specific hits and the family count of your hits (2) including their full families. If you want to review the specific syntax of a slice, click on the name of the slice (3).
Slice Troubleshooting
As your slices become increasingly complex, the connector restrictions described above may limit your ability to isolate a specific data set. In these cases (specifically if you need to mix connectors but cannot adjust the slice builder to fit your needs), try breaking your slice into separate slices. For example, let's say want to include all hits from a specific search group OR all spreadsheets from a specific custodian in the same slice. You may want your slice to look something like:
(Search Group:"2-9-24 set") OR ((file_type:xl* OR file_type:csv) AND (custodian:"Benjamin Rogers"))
The current iteration of the slice builder cannot use double parentheses or different connectors within the same group. To run this search you need to create 2 slices. First slice:
(file_type:xl* OR file_type:csv) AND (custodian:"Benjamin Rogers")
Then run that slice OR your search group to create your final slice:
(Search Group:"2-9-24 set") OR (Slice:"Rogers Spreadsheets")
If you need help creating searches or slices to meet your needs, reach out to support@nextpoint.com for support.
Data Mining - Data Slices
Data mining uses a powerful search syntax called dtSearch. There are differences between dtSearch and the search syntax employed in Nextpoint databases, so some translation may be required.
Documents are searchable with scans after processing is completed. Consultation on terms and syntax is available for an additional hourly charge.
Like in a Nextpoint database, Data Mining uses boolean searching for text searches. A "boolean" search request consists of a group of words or phrases linked by connectors such as AND and OR that indicate the relationship between them.
Examples:
Search Request |
Meaning |
apple and pear |
both words must be present |
apple or pear |
either word can be present |
apple w/5 pear |
"apple" must occur within 5 words of "pear" |
apple not w/12 pear |
"apple" must occur, but not within 12 words of "pear" |
apple and not pear |
"apple" must be present and "pear" cannot be present. |
name contains smith |
the field name must contain smith |
apple w/5 xfirstword |
apple must occur in the first five words of the document |
apple w/5 xlastword |
apple must occur in the last five words of the document |
Warning
Exact phrases should be off set by quotation marks.
"test phrase" OR single OR word
If you use more than one connector (and, or, contains, etc.), you should use parentheses to indicate precisely what you want to search for. For example, apple and pear or orange could mean (apple and pear) or orange, or it could mean apple and (pear or orange). For best results, always enclose expressions with connectors in parenthesis. Example:
(apple and pear) or (name contains smith)
Field Filtering (in the Slice Section)
The following metadata fields can be used to filter for hits, but they must be applied in the "Slices" section of your Data Mining app. Fields can vary based on the data type. For emails, we generally extract the following field information (if available):
import_path
ancestry
file_type
file_size
md5
s3_path
status
project_id
batch_id
searchability
content_type
creation_date
creator
language
email_date
email_content_type
email_message_id
has_children
family_date
file_id
family_id
- Any file extracted from another file (loose files from a zip, attachments from emails, etc.) will have an ancestry field
- Any file that has other files extracted from it will have a has_children field (value is true/false)
- Any file directly or indirectly extracted from a mailbox will have a mailbox_path field
- Most (if not all) text-based files will have author,content_type, creation_date, and language fields.
- Emails and their attachments will have a family_date field. This is like Nextpoint's "master_date" field. It uses the family parent's creation_date value, but it's inherited by all children in the family.
All files should have the following metadata fields which can be searched on:
import_path
file_type
file_size
md5
s3_path
status
searchability
project_id
batch_id
file_id
family_id
Search terms may include the following special characters:
Character |
Meaning |
? |
matches any character |
= |
matches any single digit |
* |
matches any number of characters |
% |
|
# |
|
~ |
|
& |
|
~~ |
|
## |
Fuzzy Searching
Fuzzy searching will find a word even if it is misspelled. For example, a fuzzy search for apple will find appple. Fuzzy searching can be useful when you are searching text that may contain typographical errors (such as emails), or for text that has been scanned using optical character recognition (OCR).
Add fuzziness selectively using the % character. The number of % characters you add determines the number of differences dtSearch will ignore when searching for a word. The position of the % characters determines how many letters at the start of the word have to match exactly. Examples:
ba%nana
Word must begin with ba and have at most one difference between it and banana.
b%%anana
Word must begin with b and have at most two differences between it and banana.
Phonic Searching
Phonic searching looks for a word that sounds like the word you are searching for and begins with the same letter. For example, a phonic search for Smith will also find Smithe and Smythe.
To ask dtSearch to search for a word phonically, put a # in front of the word in your search request. Examples:
#smith
#johnson
Stemming
Stemming extends a search to cover grammatical variations on a word. For example, a search for fish would also find fishing. A search for applied would also find applying, applies, and apply.
To add stemming selectively, add a ~ at the end of words that you want stemmed in a search. Example: apply~
The stemming rules included with dtSearch are designed to work with the English language.
Synonym Searching
Synonym searching finds synonyms of a word that you include in a search request. For example, a search for fast would also find quickly. You can enable synonym searching selectively by adding the & character after certain words in your request. Example:
improve& w/5 search
Numeric Range Searching
A numeric range search is a search for any numbers that fall within a specified range. To add a numeric range component to a search request, enter the upper and lower bounds of the search separated by ~~ like this:
apple w/5 12~~17
This request would find any document containing apple within 5 words of a number between 12 and 17.
Notes
- A numeric range search includes the upper and lower bounds (so 12 and 17 would be retrieved in the above example).
- Numeric range searches only work with integers greater than or equal to zero, and less than 2,147,483,648
- For purposes of numeric range searching, decimal points and commas are treated as spaces and minus signs are ignored. For example, -123,456.78 would be interpreted as: 123 456 78 (three numbers).
Regular Expressions
Regular expression searching provides a way to search for advanced combinations of characters. A regular expression included in a search request must be quoted and must begin with ##.
Examples:
Apple and "##199[0-9]"
This would hit on a file containing the word "Apple" and the number 1994 (or 1990, 1991...1999).
Apple and "##19[0-9]+"
This would hit on a file containing the word "Apple" and the number 194 (or 1964 or 1983302002...).
Special characters in a regular expression are:
Regular expression |
Effect |
. (period) |
Matches any single character. Example: "sampl." would match "sample" or "samplZ" |
\ |
Treat next character literally. Example: in "\$100", the \ indicates that the pattern is "$100", not end-of-line ($) followed by "100" |
[abc] |
Brackets indicate a set of characters, one of which must be present. For example, "sampl[ae]" would match "sample" or "sampla", but not "samplx" |
[a-z] |
Inside brackets, a dash indicates a range of characters. For example, "[a-z]" matches any single lower-case letter. |
[^a-z] |
Indicates any character except the ones in the bracketed range. |
.* (period, asterisk) |
An asterisk means "0 or more" of something, so .* would match any string of characters, or nothing |
.+ (period, plus) |
A plus means "1 or more" of something, so .+ would match any string of at least one character |
[a-z]+ |
Any sequence of one or more lower-case letters. |
Limitations
- A regular expression must match a single whole word. For example, a search for "##app.*ie" would not find "apple pie".
- Only letters and numbers are searchable. Characters that are not indexed as letters are not searchable even using regular expressions, because the index does not contain any information about them.
- Because the dtSearch index does not store information about line breaks, searches that include begining-of-line or end-of-line regular expression criteria (^ and $) will not work.
- No case or other conversion is done on regular expressions, so a regular expression must match the case of the information stored in the index. If an index is case-insensitive, all letters in the regular expression must be lower-case. If a character is not searchable in the index, then it cannot be included as a searchable character in the regular expression. Non-searchable characters in a regular expression are not ignored as they are in other search expressions.
Performance
A regular expression is like the * wildcard character in its effect on search speed: the closer to the front of a word the expression is, the more it will slow searching. "appl.*" will be nearly as fast as "apple", while ".*pple" will be much slower.
Searching for numbers
The = wildcard, which matches a single digit, is faster than regular expressions for matching patterns of numbers. For example, to search for a social security number, you could use "=== == ====" instead of the equivalent regular expression.
For additional information about dtSearch syntax, review the following documentation (from which this search guide was adapted): https://support.dtsearch.com/webhelp/dtsearch/search_requests_overview.htm
Next up: Data Mining - Exporting Reports and Data
Or view one of the other support resources in the Data Mining series:
Data Mining - Project Dashboard
Data Mining – Uploading and Importing Data