To start using the Unstructured open source library right away, skip ahead to the quickstart.
The Unstructured open source library (GitHub, PyPI) offers an open-source toolkit
designed to simplify the ingestion and pre-processing of diverse data formats, including images and text-based documents
such as PDFs, HTML files, Word documents, and more. With a focus on optimizing data workflows for Large Language Models (LLMs),
the Unstructured open source library provides modular functions and connectors that work seamlessly together. This cohesive system ensures
efficient transformation of unstructured data into structured formats, while also offering adaptability to various platforms
and use cases.
Precise document extraction: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata. Learn more about Document elements and metadata.
Robust file support: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more. Detailed information on supported file types can be found here.
Robust core functionality: Unstructured provides a suite of core functionalities critical for efficient data processing. This includes:
Partitioning: The partitioning functions in Unstructured enable the extraction of structured content from raw, unstructured documents. This feature is crucial for transforming unorganized data into usable formats, aiding in efficient data processing and analysis.
Cleaning: Data preparation for NLP models often requires cleaning to ensure quality. The Unstructured library includes cleaning functions that assist in sanitizing output, removing unwanted content, and improving the performance of NLP models. This step is essential for maintaining the integrity of data before it is passed to downstream applications.
Extracting: This functionality allows for the extraction of specific entities within documents. It is designed to identify and isolate relevant pieces of information, making it easier for users to focus on the most pertinent data in their documents.
Chunking: The chunking process in Unstructured is distinct from conventional methods. Instead of relying solely on text-based features to form chunks, Unstructured uses a deep understanding of document formats to partition documents into semantic units (document elements).
Calls to the Unstructured open source library that are routed to Unstructured’s software-as-a-service (SaaS)
for processing (for example, by calling the
partition_via_api or
partition_multiple_via_api functions with an
Unstructured API key and an Unstructured SaaS URL) require an Unstructured account for billing purposes.Unstructured offers several account types with different pricing plans:
Starter - A single user, with a single workspace, hosted alongside other accounts on Unstructured’s cloud infrastructure.
Team - Multiple users and workspaces, hosted alongside other accounts on Unstructured’s cloud instrastructure.
Enterprise - Multiple users and workspaces, isolated from all other accounts, with two hosting options for additional security and control:
Dedicated instance - Hosted within a virtual private cloud (VPC) running inside Unstructured’s cloud infrastructure.
In-VPC - Hosted within your own VPC on your own cloud infrastructure.
Enterprise accounts also allow for robust customization of Unstructured’s features for your unique needs.
For more details, see the Unstructured Pricing page.To upgrade your account from Starter to Team, or from Team to Enterprise,
email Unstructured Sales at sales@unstructured.io.Some of these plans have billing details that are determined on a per-page basis.Unstructured calculates a page as follows:
For these file types, a page is a page, slide, or image: .pdf, .pptx, and .tiff.
For .docx files that have page metadata, Unstructured calculates the number of pages based on that metadata.
For all other file types, Unstructured calculates the number of pages as the file’s size divided by 100 KB.
For non-file data, Unstructured calculates a page as 100 KB of incoming data to be processed.