Overview
The Unstructured Platform is a no-code user interface, pay-as-you-go platform for transforming your unstructured data into data that is ready for Retrieval Augmented Generation (RAG).
How does it work?
To get your data RAG-ready, the Unstructured Platform moves it through the following process:
Connect
The Unstructured Platform offers multiple source connectors to connect to your data in its existing location.
Route
Routing determines which strategy Unstructured Platform uses to transforming your documents into Unstructured’s canonical JSON schema. The Unstructured Platform provides these partitioning strategies for document transformation:
-
Fast is ideal for simple, text-only documents.
-
High Res is best for PDFs, images, and complex file types.
During High Res processing, any detected text-based files are processed and billed at the Fast rate instead.
-
VLM is for challenging documents, including scanned and handwritten content.
During VLM processing, any detected files that are not PDFs or images are processed and billed at either the High Res or Fast rate instead. Of those non-PDF and non-image files, all text-based files are processed and billed at the Fast rate instead. The other files are processed and billed at the High Res rate instead.
-
Auto automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
- If the page or document has no images and likely does not have tables, Fast partitioning is used, and the page or document is billed at the Fast rate for processing.
- If the page or document has only a few tables or images with standard layouts and languages, High Res partitioning is used, and the page or document is billed at the High Res rate for processing.
- If the page or document has more than a few tables or images, VLM partitioning is used, and the page or document is billed at the VLM rate for processing.
Transform
Your source document is transformed into Unstructured’s canonical JSON schema. Regardless of the input document, this JSON schema gives you a standardized output. It contains more than 20 elements, such as Header
, Footer
, Title
, NarrativeText
, Table
, Image
, and many more. Each document is wrapped in extensive metadata so you can understand languages, file types, sources, hierarchies, and much more.
Chunk
The Unstructured Platform provides these chunking strategies:
- Basic combines sequential elements up to specified size limits. Oversized elements are split, while tables are isolated and divided if necessary. Overlap between chunks is optional.
- By Title uses semantic chunking, understands the layout of the document, and makes intelligent splits.
- By Page attempts to preserve page boundaries when determining the chunks’ contents.
- By Similarity uses an embedding model to identify topically similar sequential elements and combines them into chunks.
Enrich
Images and tables can be optionally summarized. This generates enriched content around the images or tables that were parsed during the transformation process.
Embed
The Unstructured Platform uses optional third-party embedding providers such as OpenAI.
Persist
The Unstructured Platform offers multiple destination connectors, including all major vector databases.
To simplify this process and provide it as a no-code solution, the Unstructured Platform brings together these key concepts:
Source Connectors
Source connectors to ingest your data into the Unstructured Platform for transformation.
Destination Connectors
Destination connectors tell the Unstructured Platform where to write your transformed data to.
Workflow
A workflow connects sources to destinations and provide chunking, embedding, and scheduling options.
Jobs
Jobs enable you to monitor data transformation progress.
What support is there for compliance?
The platform is designed for global reach with SOC2 Type 1, SOC2 Type 2, and HIPAA compliance. It has support for over 50 languages.
How do I get started?
Skip ahead to the quickstart.
How do I get help?
Contact us directly, or join our Slack community.
For enterprise support, email Unstructured Sales at sales@unstructured.io.
Was this page helpful?