Ben Wolf January 29, 2010
This is not really a standard that can be defined or met. Processing by it's nature is taking data that results from the normal course of business and converting it to a different format by which a legal course of business can be performed. What we can tell you is that we have developed a powerful, scalable processing engine that addresses the vast majority of data requirements with a sensible and easy to understand approach.
Our approach is to define a standardized set of data processes that meet a reasonableness requirement for converting the data into a format for review.
This standardized process is to treat each file the same way a user would manually open each file in an application appropriate to handling the file type, printing it, and indexing the output. It is an exact parallel to the the manual paper process of sending each file to a printer to be printed on paper, then scanned, then OCR'd, and then populated into a database. Nothing less. The key differentiator is instead of paper, we print it to PDF format, and use the PDF's as the source of text extraction and image generation. It's fundamentally "electronic printing"
We provide full reporting on each file that was extracted, each file processed, and each file that was not processed.
File types that we support processing natively without any manual processing can be found here
http://support.nextpoint.com/entries/63323-file-types
___________________________________________________________________________________________________________________________
We have built a proprietary processing engine, we have not licensed another company's technology. Our software does use a number of open-source technologies, including the Open Office SDK, which we use to process most file types. You can find out more about the Open Office project here http://download.openoffice.org/sdk/index.html
Depends on what you consider as metadata and when is metadata relevant for particular file types. Our processing logic populates the document date for all file types when available, and all metadata is preserved within the native files which are linked to the image for each record. As a general rule, we rely on metadata from e-mail messages as reliable enough to populate into our predefined fields, and -- unless coded to a particular custodian during the uploading process -- we do not populate other file types with metadata because it's generally unreliable. For example, the author as indicated in a Microsoft Word document is as often not the author as it is the author.
All that said, the original file is always linked to allow for immediate download and a manual review for key documents of interest. Based on your findings, if you find that the standard service is not sufficient for your needs, we can provide customized data management services to meet your needs.
___________________________________________________________________________________________________________________________
We extract the message files, preserve email threads and attachments, and populate the metadata automatically.
We convert these to PST files at no charge, and process them as PST files.
To the extent this unstructured data has appeared in the message inbox of a user, it is processed as an email. The most sensible approach is to export each of these items from the native applications as CSV files, which we will then process as distinct reports.
To the extent these file types are embedded within the natives, but have not been extracted, we preserve the native file data in the original file, but do not provide a reviewable document record. We also cannot provide with assurance that our standard processing service - nor any service - will detect, identify or extract text from these data types if the following conditions exist:
Again, if you would like us to perform customized data management services to determine if any of these factors is present, please let us know and we will be happy to provide an estimate.
Simply put, our standard processing service is designed to extract the vast majority of data, quickly and efficiently and present that information in a way that is meaningful to non-technical reviewers. It is not possible to anticipate for and correct poor preservation and collections practices.
Microsoft Outlook and Lotus Notes are sophisticated containers for a multitude of data types. It's not possible or desirable to extract every byte of unstructured data out of these containers and make them reviewable in a meaningful way. However, the original files are always maintained in the event additional technology services are required on these files types. Again, best practice is to generate individual exports for each data type - calendar, contacts, journal, custom task lists, or other custom forms - from the native applications during the collections process.
We don't automatically unhide or unlock cells in spreadsheets, we open the spreadsheets and "print to pdf."
Automatically altering the format of the document would make many spreadsheets literally incomprehensible, as anyone who has worked on a complex spreadsheet knows. The formatting of the spreadsheet as it would be opened and looked at by the user is the standard we use for preparing the file for review. And as always, the native file is available on demand, with a link for immediate download should someone be interested in manually reviewing how the data within the spreadsheet was structured.
Word processing files are opened using OpenOffice, exported to a pdf, and text extracted from the resulting pdf.
If the native word processing file is set by default to display changes, comments, annotations or redlines, then these items will appear in the imaged version of the document associated with the document record. If the file is defaulted to hide this additional data, these items will not appear in the reviewable image, but will however remain present in the native file linked to the record.
Our approach again is not to alter the underlying document as it was intended to be saved by the user. Automatically revealing any items would fundamentally be altering the underlying native file. Also, as anyone who works regularly with word processing files knows, changes and annotations not displayed normally are present in almost every file. Automatically rendering these comments and annotations would make many documents incomprehensible and slow down the review process.
Given our experience with processing native files, these exist with such regularity as to render it a meaningless notification. Reviewers should assume all documents have a high probability of additional annotations hidden within the file - the extent to which these need to be reviewed as part of the normal course of your project is an issue for you to determine, and should you require some customized technical support to address this issue, we would be happy to provide an estimate for presenting this data. It is not that we cannot notify or extract this data, but we do not provide it as part of our normal standard processing service.
Powerpoint files are opened using OpenOffice, exported to a pdf, and text extracted from the resulting pdf.
If the PowerPoint files is defaulted to open in Notes Pages, then those notes pages will be present in the exported PDF and therefore included in the search index. If the native file is not defaulted, then the notes pages comments are not present in the PDF and not extracted, but are present however within the native file that is linked to the processed record.
___________________________________________________________________________________________________________________________
The data first needs to be loaded to our highly secure web servers, where our processing servers can get to work on them.
Yes, further support can be found here
http://support.nextpoint.com/entries/113764-faq-s-on-uploading
Yes, particularly for large datasets, it makes sense for us to do the uploading for you. We will upload your data to our servers at no charge, but if additional pre-processing work is required to upload those files, we will prepare an estimate for you to approve prior to undertaking that work.
___________________________________________________________________________________________________________________________
When native files are uploaded, they are processed into images that you can view, redact, stamp, and produce resulting in additional data. Indexes are also generated that speed your search results. Sophisticated image processing means you can navigate through documents intuitively without waiting for individual pages to load. Additionally, since color is an important piece of metadata that would be critical to any reviewer, all files are automatically processed as color images.
Storage costs are based on the amount of data that is being hosted. Data processed into the cloud will expand as a result of image and index generation. In addition, a copy of the native files and the container files if present are preserved.
So if you upload a zip file, then you can expect that the original zip file, the extracted contents that zip file, the processed images from the contents, the extracted text and metadata from the contents, and all associated annotations made by users are all preserved. We do not delete anything as a matter of practice. Users are free to delete any or all of these data stores to reduce overall file size, but you can rest assured that you are in control of that decision.
The degree of expansion varies by data type, and may expand in some instances up to 5 times the original file size. If a representative sample of the data can be provided, Nextpoint is happy to provide an estimate of the post-processing data volume at no charge. We are happy to also provide consultation as to what data stores may be safely deleted. It's not complicated -- we think best practice is for the user to understand what has happened to their data, when, why, and how.
We preserve all metadata and color is a part of that information. Not preserving the color would the equivalent of eliminating any images, bold type, or italicized type in an particular document. It is relevant information.
You can perform date and keyword searches in order to cull the amount of data.
A dedupe form exists in the application if you would like your data deduplicated. Please fill out the request form and and we will confirm with you the deduplication protocol at no extra charge.
We dedupe by md5 hash, email header, and full-text extraction (OCR or native). Document populations can be deduped by custodian or across an entire population. Container files such as email or zip are only considered duplicates if the contained/children files are also duplicates.
Meta-data is also used to either reject duplicates (in which case they are "near duplicates") or merged. An example of merged meta-data would be when two documents are exact duplicates but the location field is different. Document 1a was found on John's Desktop and Document 1b was found in Sue's documents folder. We either merge those locations by concatenating and removing one of the duplicates, or we leave them separate depending on your preference.
When you search for your documents, include a date search such as
document_date:"february 20, 2007"
or a include a range of dates
document_date:>"february 20, 2007"
Perform the following search: document_date:NULL