Partitioning extracts content from raw unstructured files and outputs that content as structured document elements.

For specific file types, such as image files and PDF files, the Unstructured Platform offers special strategies to partition them. Each of these strategies has trade-offs for output speed, cost to output, and quality of output.

PDF files, for example, vary in quality and complexity. In simple cases, traditional natural language processing (NLP) extraction techniques might be enough to extract all the text out of a document. In other cases, advanced image-to-text models are required to process a PDF file. Some of these strategies implement rule-based workflows, which can be faster and cheaper, because they always extract in the same way, but you might sometimes get lower-quality resolution. Other strategies implement model-based workflows, which can be slower and costlier because they require a model that performs inference, but you can get higher-quality resolution. When you choose a partitioning strategy for your files, you should be mindful of these speed, cost, and quality trade-offs. For example, the Fast strategy can be about 100 times faster than leading image-to-text models.

To choose one of these strategies, select one of the Partition Strategy options in the Partitioner node of a workflow:

You can change a workflow’s predefined strategy only through Custom workflow settings.
  • Fast: This strategy is ideal for simple, text-based documents.
  • Hi-Res: This strategy is best for PDFs, images, and complex file types.
  • VLM: For your most challenging documents, including scanned and handwritten content, use this strategy, which leverages vision language models (VLMs). During processing, files that are not PDFs or images are processed by using the Hi-Res strategy and are charged at the Hi-Res rate instead.
  • Auto: This strategy examines each file before processing it. If the file is an image, or if the file is a PDF and at least one embedded table or image is found in it, Hi-Res is used to process that file and charged at the Hi-Res rate for that file. Otherwise, Fast is used and charged at the Fast rate for that file.