Unstructured Platform
What is the Unstructured Platform?
The Unstructured Platform is a no-code user interface, pay-as-you-go platform for transforming your unstructured data into data that is ready for Retrieval Augmented Generation (RAG).
How does it work?
To get your data RAG-ready, the Unstructured Platform moves it through the following process:
Connect
The Unstructured Platform offers multiple source connectors to connect to your data in its existing location.
Route
Routing determines which strategy Unstructured Platform uses to transforming your documents into Unstructured’s canonical JSON schema. There are four partioning strategies for document transformation: Fast, Hi Res, OCR Only, and Auto. Of these:
- Fast is great for when there is extractable text available, like in HTML files or in the Microsoft Office Document format.
- Hi Res is best for PDFs and tables and where accurate classification of document elements is critical.
- OCR Only is useful when dealing with image-based files or PDFs that do not have extractable text.
- If you’re unsure which strategy to use, choose Auto, and the Unstructured Platform will handle the decision for you.
Transform
Your source document is transformed into Unstructured’s canonical JSON schema. Regardless of the input document, this JSON schema we gives you a standardized output. It contains more than 20 elements, such as Header
, Footer
, Title
, NarrativeText
, Table
, Image
, and many more. Each document is wrapped in extensive metadata so you can understand languages, file types, sources, hierarchies, and much more.
Chunk
Initially, the Unstructured Platform provides two optional chunking strategies: Basic and By Title. Of these:
- Basic combines sequential elements up to specified size limits. Oversized elements are split, while tables are isolated and divided if necessary. Overlap between chunks is optional.
- By Title uses semantic chunking, understands the layout of the document, and makes intelligent splits.
Enrich
Images and tables can be optionally summarized. This generates enriched content around the images or tables that were parsed during the transformation process.
Embed
The Unstructured Platform uses optional third-party embedding providers such as OpenAI and AWS Bedrock.
Persist
The Unstructured Platform offers multiple destination connectors, including all major vector databases.
To simplify this process and provide it as a no-code solution, the Unstructured Platform brings together four key concepts:
Source Connector
Source connectors to ingest your data into the Unstructured Platform for transformation.
Destination Connector
Destination connectors tell the Unstructured Platform where to write your transformed data to.
Workflow
Workflows connect sources to destinations and provide chunking, embedding, and scheduling options.
Jobs
Jobs enable you to monitor data transformation progress.
What support is there for compliance?
The platform is designed for global reach with SOC 2 type 2 compliance. It has support for over 50 languages.
How do I get started?
Skip ahead to the quickstart.
Was this page helpful?