Unstructured provides a platform and tools to ingest and process unstructured documents for Retrieval Augmented Generation (RAG) and model fine-tuning.

This 40-second video demonstrates a simple use case that Unstructured helps solve:

Product offerings

  

  Unstructured Platform - No-code UI. Production-ready. Pay as you go.

  

  Unstructured Serverless API services - Use scripts or code. Production-ready. Pay as you go. (There is also a non-production, free edition with limits.)

Learn more about these products:

Unstructured Serverless API services


Use scripts or code to call the Unstructured Ingest CLI or Ingest Python library, to get all of your data RAG-ready.

Unstructured Serverless API services have a
Serverless pay-as-you-go edition and a Free limited edition that process data on Unstructured-hosted compute resources.

If you need to use compute resources that you host instead, there are also Azure pay-as-you-go and AWS pay-as-you-go editions; these editions process data by using the Unstructured API installed on compute resources hosted in your own Azure or AWS account.

Try the quickstart.

Learn more.

  Read the launch announcement.


Supported file types

Unstructured supports processing of the following file types:

By file extension:

File extension
.bmp
.csv
.doc
.docx
.eml
.epub
.heic
.html
.jpeg
.png
.md
.msg
.odt
.org
.p7s
.pdf
.png
.ppt
.pptx
.rst
.rtf
.tiff
.txt
.tsv
.xls
.xlsx
.xml

By file type:

CategoryFile types
CSV.csv
E-mail.eml, .msg, .p7s
EPUB.epub
Excel.xls, .xlsx
HTML.html
Image.bmp, .heic, .jpeg, .png, .tiff
Markdown.md
Org Mode.org
Open Office.odt
PDF.pdf
Plain text.txt
PowerPoint.ppt, .pptx
reStructured Text.rst
Rich Text.rtf
TSV.tsv
Word.doc, .docx
XML.xml

Quickstart: Unstructured Platform

If you want to use your local machine for either your source (input) files, or the destination (output) location for Unstructured to deliver the processed data, you cannot use this quickstart. You must run code on your local machine instead: skip to the Quickstart: Unstructured Serverless API, later in this article.

This quickstart uses a no-code, point-and-click user interface in your web browser to get all of your data RAG-ready. Data is processed on Unstructured-hosted compute resources.

You will need:

2

Sign in

  1. Use the sign-in URL in the welcome email that Unstructured sends you.
  2. Click Google or GitHub to sign in with your Google or GitHub account. Or, enter your email address and then click Sign In.
  3. If you entered your email address, check your email inbox for a message from Unstructured. In that email, click the Sign In link.
  4. The first time you sign in, read the terms and conditions, and then click Accept.
3

Set the source (input) location

  1. In the sidebar, click Sources.
  2. Click New Source.
  3. In the Type dropdown list, select the source location type that matches yours.
  4. Fill in the rest of the fields with the appropriate settings. Learn more.
  5. Click Save and Test.
  6. Click Close.
4

Set the destination (output) location

  1. In the sidebar, click Destinations.
  2. Click New Destination.
  3. In the Type dropdown list, select the destination location type that matches yours.
  4. Fill in the rest of the fields with the appropriate settings. Learn more.
  5. Click Save and Test.
  6. Click Close.
5

Define the workflow

  1. In the sidebar, click Workflows.

  2. Click New Workflow.

  3. Enter a Name for the new workflow.

  4. In the Connectors section, in the Sources dropdown list, select your source location from Step 3.

  5. In the Destination dropdown list, select your destination location from Step 4.

    You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations.
  6. In the Workflow Settings section, choose one of these predefined workflow settings groups:

    • Basic is a good choice if you have text-only documents that have no images or tables in them.
    • Advanced is a good choice if you have complex documents that have images or tables or both in them.

    Learn about the predefined settings for Basic and Advanced.

    If neither the Basic nor Advanced predefined settings meet your needs, click Custom to define different settings. If Custom is not available, click Request Access, and wait for Unstructured to enable it. Learn how to define Custom workflow settings.

  7. If you want to run this workflow on a regular basis, select one of the time periods in the Schedule Type list.

  8. Click Save.

6

Process the documents

  1. If you did not choose to run this workflow on a regular basis in Step 5, you can run the workflow now: on the sidebar, click Jobs.
  2. Click Run Job.
  3. In the Select a Workflow dropdown list, select your workflow from Step 5.
  4. Click Run.
7

Monitor the processing job

  1. In the list of Jobs, click the ID link for the job that you want to monitor.
  2. Wait for the Status to change to Completed.
  3. If Failed at the top of the screen equals 0 (zero), the workflow was fully successful. Go to the next Step.
  4. If Failed at the top of the screen equals 1 (one) or greater, the workflow was not fully successful.
8

View the processed data

Go to your destination location to view the processed data.

Learn more about the Unstructured Platform.


Quickstart: Unstructured Serverless API

This quickstart uses your local machine, with the Unstructured Ingest Python library installed. It preprocesses source (input) files on your local machine, and it uses the Unstructured Serverless API to deliver the processed data to a destination (output) location, also on your local machine. Data is processed on Unstructured-hosted compute resources.

You will need:

  • Python installed on your local machine.
  • Compatible files on your local machine to be processed. See the list of supported file types. If you do not have any files available, you can download some from the example-docs folder in the Unstructured repo on GitHub.
1

Sign up

Sign up for a 14-day trial of the Unstructured Serverless API at https://app.unstructured.io.

2

Sign in

  1. Enter your email address and then click Sign In to receive a magic link to sign in to your personalized dashboard, or authenticate yourself with your Google or GitHub account.
  2. Once you authenticate with the magic link or with your Google or GitHub account, your dashboard appears.
3

Get your API key and API URL

  1. On the sidebar, click API Keys, if it is not already selected.
  2. To get your API key, click the copy icon in the Actions column for your API key, and then click Key Only. Store your copied API key in a secure location. Do not share it with others.
  3. To get your API URL, click the copy icon next to the URL next to API URL. Store your copied API URL in a secure location. Do not share it with others.
4

Set environment variables

  1. Set an environment variable named UNSTRUCTURED_API_KEY to the value of your Unstructured API key.
  2. Set another environment variable named UNSTRUCTURED_API_URL to the value of your Unstructured API URL.
5

Install the Ingest Python library

Run the following command:

pip install unstructured-ingest

You might also need to install additional dependencies, depending on your needs. Learn more.

6

Run the code

Run the following code, replacing:

  • <path/to/input> with the source (input) path to the directory on your local machine that contains the compatible files for Unstructured to process on its hosted compute resources.
  • <path/to/output> with the destination (output) path to the directory on your local machine that will contain the processed data that Unstructured returns from its hosted compute resources.
Python Ingest v2
import os

from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
    LocalIndexerConfig,
    LocalDownloaderConfig,
    LocalConnectionConfig,
    LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig

if __name__ == "__main__":
    Pipeline.from_configs(
        context=ProcessorConfig(),
        indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),
        partitioner_config=PartitionerConfig(
            partition_by_api=True,
            api_key=os.getenv("UNSTRUCTURED_API_KEY"),
            partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
            strategy="hi_res",
            additional_partition_args={
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15
            }
        ),
        uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
    ).run()
7

View the processed data

Go to your destination location to view the processed data.

Learn more about the Unstructured Serverless API.


Get in touch

If you can’t find the information you’re looking for in the documentation, or if you need help, get in touch with our Support team at support@unstructured.io, or join our Slack where our team and community can help you.

Was this page helpful?