Unstructured concepts overview

Document elements and metadata

Document elements are the structured pieces of content Unstructured extracts from a file: a title, a paragraph, a table, an image, and so on. Each element also carries metadata that describes its origin and processing context. Preserving semantic structure lets you filter, route, and transform content by type, rather than treating a document as a single blob of text. Metadata gives your downstream applications the context they need to trace results back to their source. Learn about document elements and metadata.

Partitioning

Partitioning is the first processing step. It reads a raw file and extracts its content as structured document elements. Unstructured supports multiple partitioning strategies, from fast, rule-based text extraction to model-based vision language model (VLM) processing for complex layouts, handwriting, and multilanguage content. Choosing the right strategy lets you balance speed, cost, and output quality for your use case. Learn about partitioning.

Structured data extraction

The structured data extractor lets you define a target JSON schema and have Unstructured extract values from your documents directly into that shape. This is especially useful for recurring document types such as forms, invoices, or contracts, where you need specific fields in a consistent, application-ready format. Learn about structured data extraction.

Chunking

Chunking reorganizes partitioned elements into manageable pieces sized for embedding models and optimized for retrieval precision. Instead of embedding entire documents, chunking ensures that each piece of retrieved content is focused and relevant to a user’s query. Unstructured offers several chunking strategies (by character count, section, page, or semantic similarity) so you can tune chunk boundaries to match your content and retrieval goals. Learn about chunking.

Enriching

Enriching adds AI-generated enhancements to processed elements. Enrichments include image descriptions, table descriptions, table-to-HTML conversion, named entity recognition (NER), and generative OCR for improving text accuracy in complex documents. Enrichment gives your downstream applications richer, more useful data from content that would otherwise be hard to work with, such as images, handwritten text, or dense tables. Learn about enriching.

Embedding

Embedding converts extracted text into numeric vectors using an embedding model. These vectors capture semantic meaning and are stored alongside the text, ready to be loaded into a vector store. Vector embeddings power similarity search in RAG applications. When a user submits a query, the application finds the chunks whose embeddings are closest to that query and returns the most relevant results. Learn about embedding.

Examples

For real-world examples of Unstructured output across a range of document types, see More examples.

​Document elements and metadata

​Partitioning

​Structured data extraction

​Chunking

​Enriching

​Embedding

​Examples