Unstructured Platform
Platform is currently in private beta. Click here to join the waitlist.
What We Do
The Unstructured Platform
is a no-code platform for transforming unstructured data to RAG-ready data.
To get your data RAG-ready our platform moves it through the following process:
Connect
We offer multiple Source Connectors. We can connect to your data in its existing location.
Route
Routing determines which strategy we will employ in transforming your document to our canonical JSON schema. There are three Partioning Strategies for document transformation, fast
, hires
, or ocr_only
. fast
is great for when there is extractable text available, like in HTML files or in the Microsoft Office Document format. hires
is best for PDFs and tables and where accurate classification of document elements is critical. ocr_only
is useful when dealing with image-based files or PDFs that do not have extractable text. If you’re unsure, select auto
and we’ll handle the decision for you.
Transform
Your source document is transformed to our canonical JSON schema. Irrespective of the input document, the JSON schema we provide gives you a standardized output to code against. It contains 20+ elements, such as Header
, Footer
, Title
, NarrativeText
, Table
, Image
, and more. Each document is wrapped in extensive metadata so you can understand languages, file_type
, source
, hierarchy
and much more.
Chunk
Initially platform comes with two chunking strategies. Basic: Combines sequential elements up to specified size limits. Oversized elements are split, while tables are isolated and divided if necessary. Overlap between chunks is optional. By Title: Semantic chunking, understands the layout of the document and makes intelligent splits.
Embed
Call out to third party embedding providers, Open AI
, AWS Bedrock
, and Octo ML
.
Persist
We have multiple Destination Connectors. Including all major vector databases.
How We Do It
To simplify this process and provide it as a no-code solution, platform consists of 4 key concepts:
- Source Connectors to ingest your data.
- Destination Connectors tell our system where to write your transformed data too..
- Workflows connect sources to destinations and provide chunking, embedding, and scheduling options.
- Jobs allow you to monitor data transformation progress.
Compliance
The platform is designed for global reach with SOC 2 type 2 compliance. It has support for over 50 languages.
Sign-Up
You can sign-up here to our private beta.
Was this page helpful?