Data Processed to Images, Including Previously Produced Data
First, the data set will contain up to 3 folders (at the very least, this type of data set must contain an "Images" folder):
IMAGES - this folder contains the document pages, each a one-page image file
TEXT - this folder contains the OCR text information, and can be either one text file per page, or one text file per document
NATIVES - this folder contains any native files that accompany each document
Image file pages will be in the .tif or .jpg format, and the files will be named by their bates numbers. If included, the OCR text and Native files will be also be named by the corresponding bates numbers. Here is what a common single-page image data set looks like:
This data structure is constant among most imports of this type, so Nextpoint has built in some automation to allow for more efficient utilization of load files:
1) Which documents need to be imported, and where are they?
When single-page image files are located in a folder called IMAGES and are named by their bates values, you only need to use the column headers bates_start and bates_end. The combination of these two headers serves two functions:
- To tell Nextpoint which page files to use to assemble each document image (the page boundaries for each document). Nextpoint will automatically find the image files based on the bates numbers, since they are in an IMAGES folder.
- To assign the appropriate bates values to each document as it is processed into Nextpoint.
2) What title should Nextpoint give the document?
Nextpoint needs to know what to call the document, or what value to place in the subject/title field. This is accomplished by using a title column header.
3) Which OCR text file corresponds to each document?
Nextpoint needs to know what text file to grab, and apply to each document image as the OCR text. This is accomplished by using a text_file column header, which contains the path to and name of the text file.
4) Which Native file corresponds to each document (where applicable)?
Nextpoint needs to know what native file to grab, and apply to each document image that requires an accompanying native. This is accomplished by using a native_file column header, which contains the path to and name of the native file.
5) Make sure that no OCR text (.txt) files are used as image page files:
Occasionally, text files are mixed into the same folder with image files. Since they are all named by bates number, that means the only difference between an OCR text file and an image page is the file extension. Therefore you need to tell Nextpoint to use only files with a certain extension as document image pages. This is easily accomplished with a column header called image_extension, containing the value tif|jpg in every row.
So in order to import a set of single-page image files to Nextpoint, your load file would need to look like this. Row 1, the column headers. Rows 2-5, the coding values that correspond to each column header for the 4 documents being imported. These are the minimum, required load file commands for such a document import:
In this example, you would save the load file and place it in the same location as the "Images, Natives, Text" folders in your data set. When you select the folder "Image_Import" as your upload, the information in the bates_start column of your load file tells Nextpoint application how to find each document image page (IMAGES), native file (NATIVE), and OCR text file (TEXT):