Task
You want to specify the type of agent that you use when performing optical character recognition (OCR) on files, such as images and PDFs.This task works only with the Unstructured open source library.
Approach
Set the environment variable namedOCR_AGENT
to one of the following supported values:
unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract
to use Tesseract OCR. This is the default if not otherwise specified.unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle
to use Paddle OCR.unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision
to use Google Cloud Vision OCR.
- For Tesseract OCR, see the dependency list.
- For Paddle OCR, see the dependency list.
- For Google Cloud Vision OCR, see the dependency list.
Example code
This example uses a PNG file with an embedded combination of English and Korean text. This example uses Tesseract OCR. Language codes will differ depending on the OCR agent you use:- For Tesseract OCR, see the language codes list.
- For Paddle OCR, see the language codes list and language names list.
- For Google Cloud Vision OCR, see the language codes list.
Python