Partitioning
Partitioning extracts content from raw unstructured files and outputs that content as structured document elements.
For specific file types, such as image files and PDF files, the Unstructured Platform offers special strategies to partition them. Each of these strategies has trade-offs for output speed, cost to output, and quality of output.
PDF files, for example, vary in quality and complexity. In simple cases, traditional natural language processing (NLP) extraction techniques might be enough to extract all the text out of a document. In other cases, advanced image-to-text models are required to process a PDF file. Some of these strategies implement rule-based workflows, which can be faster and cheaper, because they always extract in the same way, but you might sometimes get lower-quality resolution. Other strategies implement model-based workflows, which can be slower and costlier because they require a model that performs inference, but you can get higher-quality resolution. When you choose a partitioning strategy for your files, you should be mindful of these speed, cost, and quality trade-offs. For example, the Fast strategy can be about 100 times faster than leading image-to-text models.
To choose one of these strategies, select one of the Partition Strategy options in the Partitioner node of a workflow:
-
Fast: This strategy is ideal for simple, text-based documents.
-
High Res: This strategy is best for PDFs, images, and complex file types.
During High Res processing, any detected text-based files are processed and billed at the Fast rate instead.
-
VLM: For your most challenging documents, including scanned and handwritten content.
During VLM processing, any detected files that are not PDFs or images are processed and billed at either the High Res or Fast rate instead. Of those non-PDF and non-image files, all text-based files are processed and billed at the Fast rate instead. The other files are processed and billed at the High Res rate instead.
When you use the VLM strategy with embeddings for PDF files of 200 or more pages, you might notice some errors when these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
Was this page helpful?