Task

You want to use the fastest, highest-precision, yet most cost-effective partitioning strategy overall, but you are not sure which one to specify for your particular files.

Approach

The --strategy command option (CLI) or strategy parameter (Python/JavaScript/TypeScript) specifies the partitioning strategy to use.

Use the following decision-maker to help you determine what to set --strategy or strategy to.

1

How many of the files to be processed are image files, or PDF files with embedded images or tables in them?

  • If All, you can increase precision for processing images and tables by specifying hi_res for --strategy or strategy. For additional benefits, skip ahead to Step 2.
  • If Some, you can increase precision for processing images and tables by specifying auto for --strategy or strategy. This will leave the decision up to Unstructured on a file-by-file basis about the best partitioning strategy to use. You have completed this decision-maker. See also Auto partitioning strategy logic.
  • If None, you can increase performance and decrease cost by specifying fast for --strategy or strategy. You have completed this decision-maker.
For OCR, you can also set --strategy or strategy to ocr_only. Unstructured will use the Tesseract OCR agent as applicable. You cannot specify any other OCR agent to be used instead.
2

For object detection, do you know which particular high-resolution object detection model you want to use?

  • If No, then specify auto for --strategy or strategy. This will decrease performance compared to answering Yes. You have completed this decision-maker. See also Auto partitioning strategy logic.
  • If Yes, then set --hi-res-model-name (CLI), hi_res_model_name (Python), or hiResModelName(JavaScript/TypeScript) to the available model’s name. Learn about the available models.

Code example

See Changing partition strategy for a PDF.

Auto partitioning strategy logic

Setting --strategy or strategy to auto leaves the decision up to Unstructured on a file-by-file basis about which partitioning strategy to use. Specifically:

  • If the file is an image, the hi_res stategy is used for that file. The layout_v1.0.0 high-resolution object detection model is used.

  • If the file is a PDF, the local processing logic or Unstructured tries to detect whether there are any embedded tables or images in that file.

    • If no embedded tables or images are detected, the fast strategy is used for that file. No high-resolution object detection model is used.
    • If at least one embedded table or image is found, the hi_res strategy is used for that file. The layout_v1.0.0 high-resolution object detection model is used.
  • If --strategy or strategy is not specified, the auto strategy is used by default.