.yes-label {
    background-color: green;
    color: white;
    padding: 4px 8px;
    border-radius: 4px;
    display: inline-block;
}

.no-label {
    background-color: red;
    color: white;
    padding: 4px 8px;
    border-radius: 4px;
    display: inline-block;
}

var script = document.createElement("script");
script.src = "https://widget.kapa.ai/kapa-widget.bundle.js";
script.setAttribute("data-website-id", "8ae12a97-484a-4704-8127-b6f17ebc6bcf");
script.setAttribute("data-project-name", "Unstructured");
script.setAttribute("data-project-color", "#0CDDF8");
script.setAttribute(
  "data-project-logo",
  "https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/img/unstructured_logo.png"
);
document.body.appendChild(script);

!function(){var e,t,n;e="95509b6aa5bd95a",t=function(){Reo.init({clientID:"95509b6aa5bd95a"})},(n=document.createElement("script")).src="https://static.reo.dev/"+e+"/reo.js",n.defer=!0,n.onload=t,document.head.appendChild(n)}();

Task

Approach

To run this example

Code

Extract images and tables from documents

Unstructured

file-certificate

microchip-ai

Overview

Supported file types

Supported connectors

Quickstarts

Unstructured UI quickstarts

Workflows

Jobs

Document elements and metadata

Partitioning

Chunking

Embedding

POST request

Process an individual file by making a direct POST request

Python SDK

JavaScript/TypeScript SDK

Endpoint parameters

Partition Endpoint parameters

This section details the structure of HTTP validation errors returned by the Unstructured Partition Endpoint.

Endpoint validation errors

This page provides some examples of accessing Unstructured Partition Endpoint via different methods.

Examples

Partitioning strategies

Chunking strategies

Speed up processing of large files and batches

Get element contents

Extract tables as HTML

Get chunked elements

Transform a JSON file into a different schema

Generate a JSON schema for a file

Pipeline

Follow these steps to deploy the Unstructured API service into your AWS account.

Unstructured API on AWS

Unstructured API on Azure

API keys and URLs

Troubleshooting Unstructured API keys and URLs

Notebooks contain complete working sample code for end-to-end solutions.

Notebooks

Vector database ingestion

Delta Table source connector

Multi-file API processing

This section describes two methods for extracting tables from PDF files.

Table extraction from PDF

Query processed PDF with HuggingChat

Langflow

VectorShift

MCP for Worfklow Endpoint

Model Context Protocol (MCP) Hands-On Walkthrough for the Unstructured Workflow Endpoint

MCP for Partition Endpoint

Model Context Protocol (MCP) Hands-On Walkthrough for the Unstructured Partition Endpoint

Streamlit in Snowflake

CrewAI

Ingest CLI

Process files in batches by using the Unstructured Ingest CLI

Ingest Python library

Process files in batches by using the Unstructured Ingest Python library

Ingest dependencies

Unstructured Open Source

Quickstart

The `unstructured` library includes functions to partition, chunk, clean, and stage raw source documents. These functions serve as the primary public interfaces within the library.

Core Functionality

Partitioning functions in `unstructured` allow users to extract structured content from a raw unstructured document. These functions break a document down into elements such as `Title`, `NarrativeText`, and `ListItem`, enabling users to decide what content they’d like to keep for their particular application. If you’re training a summarization model, for example, you may only be interested in `NarrativeText`.

As part of data preparation for an NLP model, it’s common to need to clean up your data prior to passing it into the model. If there’s unwanted content in your output, for example, it could impact the quality of your NLP model. To help with this, the `unstructured` library includes cleaning functions to help users sanitize output before sending it to downstream applications.

Cleaning

Extracting

Staging

Chunking functions in `unstructured` use metadata and document elements detected with `partition` functions to post-process elements into more useful "chunks" for uses cases such as Retrieval Augmented Generation (RAG).

Set the OCR agent

Depending on your need, `Unstructured` provides OCR-based and Transformer-based models to detect elements in the documents. The models are useful to detect the complex layout in the documents and predict the element types.

Models

Integrate your model development pipeline with your favorite machine learning frameworks and libraries, and prepare your data for ingestion into downstream systems. Most of our integrations come in the form of [staging functions](/open-source/core-functionality/staging), which take a list of `Element` objects as input and return formatted dictionaries as output.

Integrations

Self-hosting Unstructured

Security and compliance overview

Self-hosting Unstructured on Amazon Web Services (AWS)

Onboarding

AWS self-hosted onboarding checklist

Self-hosting Unstructured on Azure

Azure self-hosted onboarding checklist

Self-hosting Unstructured on Google Cloud Platform (GCP)

GCP self-hosted onboarding checklist

Frequently asked questions

Unstructured open source library offers flexible options for installation.

Full Installation

The instructions below guide you on how to use the unstructured library inside a Docker container.

Docker Installation

Unstructured UI quickstart

Azure

Confluence

Couchbase

Databricks Volumes

Dropbox

Elasticsearch

Google Cloud Storage

Google Drive

Jira

Kafka

MongoDB

OneDrive

Outlook

PostgreSQL

Salesforce

SharePoint

Snowflake

Zendesk

Astra DB

Azure AI Search

Delta Tables in Amazon S3

Delta Tables in Databricks

Milvus

MotherDuck

Neo4j

Pinecone

Qdrant

Redis

Weaviate

Accounts overview

Billing

Account billing

Usage

Account usage

Account API keys and API URLs

Image descriptions

Table descriptions

Tables to HTML

Named entity recognition (NER)

Local

Platform Endpoint parameters

Free Unstructured API

Processor configuration

Read configuration

Partition configuration

Permissions configuration

Retry strategy configuration

Chunking configuration

Embedding configuration

Fsspec configuration

Airtable

Biomed

Delta Table

Discord

GitHub

GitLab

HubSpot

Notion

OpenSearch

Reddit

SFTP

SingleStore

Slack

SQLite

Wikipedia

Chroma

Clarifai

DuckDB

KDB.AI

LanceDB

Vectara

This page provides some examples of accessing Unstructured by using the Unstructured Ingest CLI and the Unstructured Ingest Python library.

Process a subset of files

Set embedding behavior

Ingestion

​Task

​Approach

​To run this example

​Code

Task

Approach

To run this example

Code