The first step in mining your data is to upload and import it into the Data Mining tool.
Step 1: Getting Started with Imports
- First Time User: As a first time user, you will be immediately prompted to import your data from the Dashboard
- Returning User: To import new data, simply navigate to the import tab and select “New Import”
Step 2: Naming your Import and Selecting Your Source for Data Mining
In order to import your files directly into the Data Mining project, users have the ability to add any outside S3 sources, including their Nextpoint database(s). Once a location has been added and successfully verified, the source is saved and the user will be able to access that for all future imports. Additionally, each Data Mining project comes with a Data Mining Repository pre-created for that project which can be used directly to house source data.
Name the Import and Selecting an Existing Source
-
- Name your import. This name will appear later on your import batch list, so make the name clear and unique to this import data set.
- Select the source of the data (your Data Mining s3 repository, a Nextpoint File Room, or an external s3 Repository). If you need to add a new source location, go back to Step 2.
- Click the "Next" button.
Adding a New Source Location
To add a non-data mining s3 location to your data mining project:
- Click on “Add New” at the bottom of a new import window.
- Name your new s3 location (e.g. Hoven v. Enron Discovery Database).
- Copy and Paste in your AWS Access Key ID.
- Copy and Paste in your Secret Access Key.
- Copy and Paste in your File Room Path. In a Nextpoint database, all of these can be found in the “Settings” tab under “Import” in the "File Room" section. For more information about accessing your AWS keys, visit this support article.
- Click “Add” and confirm that the system was able to verify your credentials. You should see a green “Success” with a checkmark next to it next to the new source you added.
If you select your Data Mining s3 repository, a tool tip on the import screen labeled “How do I transfer files into my Data Mining repository?” will allow you to access the AWS Access Key ID, Secret Access Key, and File Path for that repository. Using these key you can use any of the same tools listed below to transfer your data into your repository.
If uploading the data yourself is not possible, reach out to your client success manager for other options.
Note
Linking data from your file room to the Data Mining tool will create a copy of the data in the Data Mining repository. If you are uploading new data, we recommend placing it directly into your Data Mining repository.
DM S3 Repository
If you choose to import directly from your Data Mining s3 repository, a tool tip on the import screen labeled “How do I transfer files into my Data Mining repository?” will guide you through how to pull your data into your DM repository. The required AWS Access Key ID, Secret Access Key, and File Path will be provided here for input into your external sources.
Step 3: Selecting Data for Import
- User can review the selections from the previous step such as name of import, Location selected/file path, and the data within the selected source can be seen within the table below.
- User has the option (not a required field) to assign a custodian(s) to the import (if applicable). At this time, custodians added to a batch are assigned to all files in the batch (so a custodian cannot be assigned to only certain parts of a batch).
- Select the folder or file for import. At this time only a singe folder or file is eligible for import in one batch.
- Click “Import.”
- The import list will show the batch as “processing” until the batch is “complete.” If very rare occasions a batch will show as “failed” at which point you should contact the support team to identify the issue with the import.
- To download a csv of your import batch list, click on “Export CSV” at the bottom of the batch list.
- If you click on the hamburger (3 dot) menu next to any import batch, you have the option to "Edit Import Details" or "Download CSV Error Report." The error report will
Considerations for Importing
Full text + Metadata | Metadata only | Can be Identified, not processed |
pst zip mbox eml msg jpg png tiff bmp gif rtf txt doc docx xls xlsx ppt pptx dat data csv htm html mht mhtml xml |
mp3 wav flac mp4 m4v m4a mov mpg |
ics vcf flv pnm pbm pgm ppm ps svg emlx mbx anything encrypted |
Nextpoint will assign custodians upon request. Please note that the custodian of a piece of data is not intrinsic to that data, rather it is an employee or other person or group with ownership, custody, or control over potentially relevant information. For example, an individual custodian's electronically stored information (ESI) usually includes their mail file, whereas a group custodian's ESI may include a shared network folder. Due to this, custodians cannot be assigned without direction as to how the data was collected.
Email archives collected and combined into a single PST file with multiple folders can be split among multiple custodians after processing has been completed. Assignment of more than 10 custodians in a single import may be billed as an additional hourly charge.
Documents are standardized and processed into coordinated universal time (UTC) unless otherwise requested. This time zone will be used for all date filters and to standardize any datetime metadata fields. Any time zone offset can be provided in document metadata. For example, a time zone offset from GMT that the data was processed in. For example, if the data was processed in GMT-5 this would be populated with -5.00.
Master date of the document is the date used for filtering and date restrictions. Master date will be generated from the date sent of parent email for emails and their attachments and the last modified date for efiles.
When applying date restrictions, the kept documents are inclusive of the chosen date (master date as described above).
We deduplicate documents based on email message ID or MD5 hash (if no email message ID is available). Any files having matching email message IDs (or MD5 hashes) will be deduplicated, only one native copy will be stored in the system, and their metadata will be merged by default. That said, documents within different document families will not be deduplicated to split up the family. So attachments with matching MD5 hashes but attached to two different emails will be retained as separate documents. Deduplication is done globally within each project, across all batches and custodians.
Currently, this feature cannot be turned off or customized.
Upon import into the Discovery platform, Nextpoint dedupes email families and loose files globally across all custodians. To do so a MD5 hash value is generated, for emails, from Date Sent, Sender Name, Sender Email Address, Recipient Email Addresses, Display To, Display CC, Display BCC, Subject, Body, Attachment Names, Attachment Size and for loose files the bit stream of that file.
Archives with zero extracted files or mismatched expected file count (coming soon) will be addressed on import in a quality control pass. Individual file processing and indexing errors will not be addressed, only reported upon.
- Video/Audio Transcription*
- Language Detection (will occur on all imports) and Translation*
- Image Recognition*
- Entity Recognition/PII*
*These services may incur additional costs. Reach out to your client success representative for details.
Next up: Data Mining - Project Dashboard
Or view one of the other support resources in the Data Mining series:
Data Mining - Searching and Slicing Data
Comments
Please sign in to leave a comment.