Task

You want to specify the type of agent that you use when performing optical character recognition (OCR) on files, such as images and PDFs.

This task works only with the Unstructured open source library.

Approach

Set the environment variable named OCR_AGENT to one of the following supported values:

  • unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract to use Tesseract OCR. This is the default if not otherwise specified.
  • unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle to use Paddle OCR.
  • unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision to use Google Cloud Vision OCR.

Also, be sure to install the corresponding OCR agent and its dependencies, if you have not already done so:

Example code

This example uses a PNG file with an embedded combination of English and Korean text. This example uses Tesseract OCR.

Language codes will differ depending on the OCR agent you use:

Python
import json

from unstructured.partition.image import partition_image

# Source: https://github.com/Unstructured-IO/unstructured-ingest/blob/main/example-docs/img/english-and-korean.png
# Path to the local file to process, relative to this .py file.
filename = "local-ingest-png/english-and-korean.png"

elements = partition_image(
  filename=filename,
  strategy="ocr_only",
  languages=["eng", "kor"] # Language codes differ by the OCR agent used.
)

# Convert the list of returned elements into a list of dictionaries for printing or saving.
element_dicts = [element.to_dict() for element in elements]

# Print the list.
print(json.dumps(element_dicts, indent=2))

# Or, save the list locally:
#
# file = "local-ingest-output/english-and-korean.json"
#
# with open(file, "w") as file:
#     json.dump(element_dicts, file, indent=2)