Basic usage:

elements = partition(filename=filename,
                     strategy="hi_res",
                     hi_res_model_name="yolox")
  • To use any model with the partition, set the strategy to hi_res as shown above.

  • To maintain the consistency between the unstructured and unstructured-api libraries, we are deprecating the model_name parameter. Please use hi_res_model_name parameter when specifing a model.

List of Available Models in the Partitions:

  • detectron2_onnx is a Computer Vision model by Facebook AI that provides object detection and segmentation algorithms with ONNX Runtime. It is the fastest model with the hi_res strategy.

  • yolox is a single-stage real-time object detector that modifies YOLOv3 with a DarkNet53 backbone.

  • yolox_quantized: runs faster than YoloX and its speed is closer to Detectron2.

Using a Non-Default Model

Unstructured will download the model specified in UNSTRUCTURED_HI_RES_MODEL_NAME environment variable. If not defined, it will download the default model.

There are three ways you can use the non-default model as follows:

  1. Store the model name in the environment variable
import os
from unstructured.partition.pdf import partition_pdf

os.environ["UNSTRUCTURED_HI_RES_MODEL_NAME"] = "yolox"

out_yolox = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res")

  1. Pass the model name in the partition function.
filename = "example-docs/layout-parser-paper-fast.pdf"

elements = partition(filename=filename,
                     strategy="hi_res",
                     hi_res_model_name="yolox")

  1. Use unstructured-inference library.
from unstructured_inference.models.base import get_model
from unstructured_inference.inference.layout import DocumentLayout

model = get_model("yolox")
layout = DocumentLayout.from_file("sample-docs/layout-parser-paper.pdf", detection_model=model)