Notebooks

Historical research about MLK with the Unstructured API

This notebook explores how you can use Unstructured to gather and process declassified historical records surrounding the assassination of Dr. Martin Luther King, Jr. These processed documents can then be analyzed by using Elasticsearch and RAG.
Unstructured API Workflows S3 VLM NER Elasticsearch MLK National Archives

Create a S3 to Qdrant Pipeline using the Unstructured API

This notebook walks through using the Unstructured Workflow Endpoint to set up a complete pipeline that pulls documents from S3, processes them using Unstructured, and stores the resulting embeddings in Qdrant for fast vector search and retrieval.
Unstructured API Workflows S3 Qdrant VLM Embeddings

Create a S3 to MongoDB Pipeline using the Unstructured API

Learn how to build an end-to-end document processing pipeline that processes PDFs from S3 and stores structured results in MongoDB. Features VLM-powered partitioning, semantic chunking, and vector embeddings using the Unstructured Workflows API.
Unstructured API Workflows S3 MongoDB VLM Embeddings

Getting Started with Unstructured API and IBM watsonx.data

Learn how to create data processing workflows with Unstructured API and its Python SDK to preprocess all of your unstructured data from your Azure Blob Storage into your IBM watsonx.data instance.
Unstructured API Workflows Azure Blob Storage IBM watsonx.data

Using Unstructured with Snowflake Cortex Search for RAG

Use Snowflake Cortex and RAG to do natural-language searches across a Snowflake table that contains data provided by Unstructured. Additional Snowflake Cortex functions are also explored.
Unstructured API Snowflake Cortex RAG Search Workflows S3

Agentic RAG with LangGraph and Together AI

Build Agentic RAG with LangGraph and Together AI and compare the results with Vanilla RAG in pure Python
Unstructured API Workflows Agents LangGraph Together AI Astra DB

Getting Started with Unstructured API and Snowflake

Learn how to create data processing workflows with Unstructured API and its Python SDK to preprocess all of your unstructured data from your Azure Blob Storage into your Snowflake Table.
Unstructured API Workflows Azure Blob Storage Snowflake

Building Graph-Based RAG Applications

Learn how to use the Unstructured API to create a Graph RAG-based workflow that writes data with named entity recognition (NER) to your Astra DB.
Unstructured API Workflows Graph RAG NER Astra DB

Getting Started with Unstructured API and Delta Tables in Databricks

Learn how to create data processing workflows with Unstructured API and its Python SDK to preprocess all of your unstructured data into your Delta Table.
Unstructured API Workflows Databricks S3

RAG for Online Documentation

Crawl websites with Firecrawl and build a RAG workflow powered by Unstructured and MongoDB Atlas vector search.
Unstructured API Workflows MongoDB

Unstructured Workflow Endpoint Quickstart

Build an end-to-end workflow in Unstructured programmatically by using the Unstructured Workflow Endpoint.
Unstructured API Workflows S3

RAG with Databricks Vector Search with Context from Multiple Sources

Build RAG with Databricks Vector Search with context preprocessed from multiple sources by Unstructured.
Databricks Introductory notebook

Agentic RAG with Hugging Face smolagents vs Vanilla RAG

Build Agentic RAG with smolagents library and compare the results with Vanilla RAG in pure Python
GPT-4o smolagents Agents DataStax S3 Advanced notebook

LLama3.2 RAG evaluation on unstructured text

Evaluate Llama3.2 for your RAG system with Unstructured, GPT-4o, Ragas, and LangChain
GPT-4o Ragas LangChain Llama3.2 Pinecone S3 Advanced notebook

Multimodal RAG: Enhancing RAG outputs with image results

Process a file in S3 with Unstructured and return images in your RAG output
S3 FAISS GPT-4o-mini Advanced notebook

Quantitative Reasoning with tables inside PDFs

From Pixels to Insights: Seamlessly Extracting and Visualizing Table Data with Unstructured and Hex
Unstructured API Hex Advanced notebook

PII removal with GLiNER in unstructured data ETL

Remove Personally Identifiable Information (PII) as a part of unstructured data preprocessing.
Unstructured API PII GLiNER Advanced notebook

Custom metadata extraction and self-querying retrieval

Extract custom metadata, and enable metadata pre-filtering in your RAG.
Unstructured API MongoDB Metadata Advanced notebook

Selecting an embedding model for custom data

End-to-end data processing pipeline using Unstructured Serverless API.
Unstructured API Hugging Face Advanced notebook

RAG with PDFs, LangChain and Llama 3

A RAG system with the Llama 3 model from Hugging Face.
Unstructured API 🤗 Hugging Face LangChain Llama 3 Introductory notebook

Unstructured data ETL from S3 to SingleStore DB

Learn to ingest, partition, chunk, embed and load data from an S3 bucket into SingleStore DB.
Unstructured API SingleStoreDB AWS S3 Introductory notebook

Google Drive to DataStax Astra DB

Embed your Google Drive Docs in an Astra Vector Database with Unstructured Serverless API
Unstructured API Google DataStax Introductory notebook

Weaviate RAG quickstart

Embed your local documents in an Weaviate Vector Database with Unstructured Serverless API
Unstructured API OpenAI Weaviate Introductory notebook

Preprocess PDFs in AWS S3, load into Elasticsearch

Ingest PDF documents from an S3 bucket, transform them into a normalized JSON with Unstructured Serverless API, chunk, embed and load into Elasticsearch.
Unstructured API AWS S3 Elasticsearch Introductory notebook

Preprocess documents in Google Drive, load into Databricks Volume

Preprocess documents from a Google Drive Unstructured Serverless API and load them into Databricks Volume.
Unstructured API Google Drive Databricks Introductory notebook

Source references in RAG responses

Add document source references to RAG responses based on documents metadata.
Unstructured API RAG LangChain Intermediate notebook

Query processed PDF with HuggingChat

Send a PDF to Unstructured for processing, and send a subset of the returned PDF’s processed text to HuggingChat for chatbot-style querying.
Unstructured API 🤗 Hugging Face 🤗 HuggingChat Introductory notebook

Llama 3 Local RAG with emails

Build a local RAG app for your emails with Unstructured, LangChain and Ollama.
Unstructured API LangChain Ollama Llama 3 Introductory notebook

Building RAG With PowerPoint presentations

A RAG solution that is based on PowerPoint files.
Unstructured API 🤗 Hugging Face LangChain Llama 3 Introductory notebook

Synthetic test dataset generation

Build a Synthetic Test Dataset for your RAG system in 5 easy steps
Unstructured API GPT-4o Ragas LangChain Advanced notebook

Code samples

Tool demos

Notebooks

Historical research about MLK with the Unstructured API

Create a S3 to Qdrant Pipeline using the Unstructured API

Create a S3 to MongoDB Pipeline using the Unstructured API

Getting Started with Unstructured API and IBM watsonx.data

Using Unstructured with Snowflake Cortex Search for RAG

Agentic RAG with LangGraph and Together AI

Getting Started with Unstructured API and Snowflake

Building Graph-Based RAG Applications

Getting Started with Unstructured API and Delta Tables in Databricks

RAG for Online Documentation

Unstructured Workflow Endpoint Quickstart

RAG with Databricks Vector Search with Context from Multiple Sources

Agentic RAG with Hugging Face smolagents vs Vanilla RAG

LLama3.2 RAG evaluation on unstructured text

Multimodal RAG: Enhancing RAG outputs with image results

Quantitative Reasoning with tables inside PDFs

PII removal with GLiNER in unstructured data ETL

Custom metadata extraction and self-querying retrieval

Selecting an embedding model for custom data

RAG with PDFs, LangChain and Llama 3

Unstructured data ETL from S3 to SingleStore DB

Google Drive to DataStax Astra DB

Weaviate RAG quickstart

Preprocess PDFs in AWS S3, load into Elasticsearch

Preprocess documents in Google Drive, load into Databricks Volume

Source references in RAG responses

Query processed PDF with HuggingChat

Llama 3 Local RAG with emails

Building RAG With PowerPoint presentations

Synthetic test dataset generation