Partitioning

Partitioning extracts content from raw unstructured files and outputs that content as structured document elements. For specific file types, such as image files and PDF files, Unstructured offers special strategies to partition them. Each of these strategies has trade-offs for output speed, cost to output, and quality of output. PDF files, for example, vary in quality and complexity. In simple cases, traditional natural language processing (NLP) extraction techniques might be enough to extract all the text out of a document. In other cases, advanced image-to-text models are required to process a PDF file. Some of these strategies implement rule-based workflows, which can be faster and cheaper, because they always extract in the same way, but you might sometimes get lower-quality resolution. Other strategies implement model-based workflows, which can be slower and costlier because they require a model that performs inference, but you can get higher-quality resolution. When you choose a partitioning strategy for your files, you should be mindful of these speed, cost, and quality trade-offs. For example, the Fast strategy can be about 100 times faster than leading image-to-text models. To choose one of these strategies, select one of the following four Partition Strategy options in the Partitioner node of a workflow.

You can change a workflow’s preconfigured strategy only through Custom workflow settings.

Unstructured recommends that you choose the Auto partitioning strategy in most cases. With Auto, Unstructured does all the heavy lifting, optimizing at runtime for the highest quality at the lowest cost page-by-page. You should consider the following additional strategies only if you are absolutely sure that your documents are of the same type. Each of the following strategies are best suited for specific situations. Choosing one of these strategies other than Auto for sets of documents of different types could produce undesirable results, including reduction in transformation quality.

VLM: For the highest-quality transformation of these file types: .bmp, .gif, .heic, .jpeg, .jpg, .pdf, .png, .tiff, and .webp.
High Res: For all other supported file types, and for the generation of bounding box coordinates.
Fast: For text-only documents.

The Auto partitioning strategy routes each file as a complete unit to the appropriate partitioning strategy (VLM, High Res, or Fast) based on the preceding file types. Additionally, for .pdf files, the Auto partitioning strategy routes these files’ pages on a page-by-page basis, as follows:

A page is routed to Fast when it contains only embedded text and no images or tables are detected.
All other kinds of pages are routed to VLM or High Res, depending on the complexity of a page’s content. Unstructured constantly optimizes its proprietary algorithm for routing to VLM or High Res in these cases.

Images and tables in PDF files

The differences between the various partitioning strategies can be more clearly demonstrated by the ways each of these strategies handle images and tables within PDF files. For example, the Fast partitioning strategy skips processing images altogether in PDF files: The Fast strategy skips processing images in PDF files

The Fast strategy skips processing images in PDF files

For tables, the Fast strategy interprets table cells in PDF files as a mixture of title, list, and uncategorized text elements: The Fast strategy interprets table cells in PDF files as text

The Fast strategy interprets table cells in PDF files as text

The High Res strategy, by itself, processes images in PDF files sometimes with limited output: The High Res strategy processes images in PDF files with limited results

The High Res strategy processes images in PDF files with limited results

However, when combined with the image description enrichment, the High Res strategy can process images in PDF files with better result output: The High Res strategy with image description produces better results

The High Res strategy with image description produces better results

For tables, the High Res strategy processes tables in PDF files with the table’s text and an HTML representation of the table as output: The High Res strategy processes tables in PDF files with the table's text and HTML as output

The High Res strategy processes tables in PDF files with the table's text and HTML as output

When combined with the table description and tables to HTML enrichments, the High Res strategy can process tables in PDF files with even richer result output: The High Res strategy with table summarization and table-to-HTML enrichments produces better results

The High Res strategy with table summarization and table-to-HTML enrichments produces better results

The VLM strategy processes images in PDF files with image summaries and text as HTML elements as output. The following example shows GPT-4o by OpenAI being used. If the Auto strategy is selected in this example, Unstructured will route to the VLM strategy for processing: The VLM strategy processes images in PDF files with image summaries and text as HTML

For tables, the VLM strategy processes tables in PDF files with the table’s text and an HTML representation of the table as output, similar to the High Res strategy. The following example shows GPT-4o by OpenAI being used. If the Auto strategy is selected in this example, Unstructured will route to the VLM strategy for processing: The VLM strategy processes tables in PDF files with table summaries and text as HTML

The VLM strategy processes tables in PDF files with table summaries and text as HTML

Handwriting and multilanguage characters in PDF files

The differences between the various partitioning strategies can be more clearly demonstrated by the ways each of these strategies handle handwriting and multilanguage characters within PDF files. For example, the Fast partitioning strategy skips processing handwriting altogether in PDF files. The Fast strategy processes multilanguage characters in PDF files with limited output, depending on the language. In the following example, Japanese hiragana characters are processed as text, but the output can be very difficult to work with: The Fast strategy produces cryptic CID codes for hiragana characters

The Fast strategy produces cryptic CID codes for hiragana characters

For handwriting, the High Res strategy typically produces unusable results, for example: The High Res strategy typically produces unusable results for handwriting

The High Res strategy typically produces unusable results for handwriting

For multilanguage characters, the High Res strategy also typically produces unusable results, for example failing to recognize Japanese hiragana characters: The High Res strategy typically produces unusable results for multilanguage characters

The High Res strategy typically produces unusable results for multilanguage characters

The VLM strategy can produce great results for handwriting, such as this example that uses GPT-4o by OpenAI: The VLM strategy can process handwriting well

The VLM strategy can process handwriting well

The VLM strategy also has great support for recognizing multilanguage characters, such as this example that uses GPT-4o by OpenAI to recognize Japanese hiragana characters: The VLM strategy can process Japanese hiragana well

The VLM strategy can process Japanese hiragana well

Supported languages

Fast partitioning accepts any text inputs, though automatic language detection of those inputs is restricted to langdetect. High Res partitioning leverages Tesseract OCR. For the list of languages that Tesseract supports, see: Languages/Scripts supported in different versions of Tesseract. Language support for VLM depends on the model used. The list of supported languages for a particular model is maintained by that model’s provider. For the list of languages that each model supports, see the following, where provided:

Anthropic
- Claude 3.5 Sonnet: Arabic, Bengali, Chinese (Simplified), English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese (Brazil), Spanish, Swahili, and Yoruba are mentioned. (Source)
OpenAI
- GPT-4o: Arabic, Chinese, English, French, German, Gujarati, Hindi, Italian, Japanese, Korean, Marathi, Persian, Portuguese, Russian, Spanish, Tamil, Telugu, Turkish, Urdu, and Vietnamese are mentioned. (Source)
Amazon Bedrock
- Claude 3.5 Sonnet: “English, Spanish, Japanese, and multiple other languages” (Source)
- Claude 3 Opus: “English, Spanish, Japanese, and multiple other languages” (Source)
- Claude 3 Haiku: “English, Spanish, Japanese, and multiple other languages” (Source)
- Claude 3 Sonnet: “English, Spanish, Japanese, and multiple other languages” (Source)
- Amazon Nova Pro: “200+ languages” (Source)
- Amazon Nova Lite: “200+ languages” (Source)
- Meta Llama 3.2 90B Instruct: “English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai” (Source)
- Meta Llama 3.2 11B Instruct: “English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai” (Source)

Unstructured UI

Getting started with the UI

Using the UI

Concepts

Images and tables in PDF files

Handwriting and multilanguage characters in PDF files

Supported languages

Unstructured UI

Getting started with the UI

Using the UI

Concepts

​Images and tables in PDF files

​Handwriting and multilanguage characters in PDF files

​Supported languages

Images and tables in PDF files

Handwriting and multilanguage characters in PDF files

Supported languages