How does Nextpoint obtain a document's search text?
Nextpoint uses the Tesseract OCR engine, which is an OCR engine developed by H-P that was purchased and open-sourced by Google. If you would like further information the specifics of the Tesseract OCR engine, this Wikipedia article may be beneficial.
During processing, Nextpoint adds text to the database in three ways, and in the following order.
- First, if search text is provided in a loadfile, that text will be prioritized and mapped to each document.
- Next, if no page text is provided, but the document has embedded text that can be extracted, that text is added to the database (e.g. PDFs with embedded text).
- Lastly, if no page text can be extracted directly from the file, Nextpoint will OCR individual pages for their search text.
What does OCR mean?
OCR is short for Optical Character Recognition which is a technology used to recognize text inside of images, such as scanned documents and photos. Once the text is recognized (OCR'd), it is then editable and searchable data.
How does Nextpoint handle foreign language text?
Nextpoint supports language extraction for files with the text present in a load file and/or files with existing extracted text (#'s 1 & 2 above). The following languages are supported:
- Chinese - Simplified
Chinese - Traditional
Math / equation detection
Currently, OCR is supported for English only, but Nextpoint can support additional languages on a custom basis. Please contact your Account Director or email@example.com for further information.
Need OCR for a foreign language not on the above list, or have further questions related to OCR?
Please feel free to contact our support team at firstname.lastname@example.org.