Unstructured - Unstructured

Unstructured provides a platform and tools to ingest and process unstructured documents for Retrieval Augmented Generation (RAG) and model fine-tuning.

This 60-second video describes more about what Unstructured does and its benefits:

This 40-second video demonstrates a simple use case that Unstructured helps solve:

Unstructured offers the Unstructured Platform user interface (UI) and the Unstructured Platform API. Read on to learn more.

Unstructured Platform user interface (UI)

No-code UI. Production-ready. Pay as you go. Learn more.

Here is a screenshot of the Unstructured Platform UI Start page:

This 90-second video provides a brief overview of the Unstructured Platform UI:

To start using the Unstructured Platform UI right away, skip ahead to the quickstart.

Unstructured Platform API

Use scripts or code. Production-ready. Pay as you go. Learn more.

The Unstructured Platform API consists of two parts:

The Unstructured Platform Workflow Endpoint enables a full range of partitioning, chunking, embedding, and enrichment options for your files and data. It is designed to batch-process files and data in remote locations; send processed results to various storage, databases, and vector stores; and use the latest and highest-performing models on the market today. It has built-in logic to deliver the highest quality results at the lowest cost. Learn more.
The Unstructured Platform Partition Endpoint is intended for rapid prototyping of Unstructured’s various partitioning strategies, with limited support for chunking. It is designed to work only with processing of local files, one file at a time. Use the Unstructured Platform Workflow Endpoint for production-level scenarios, file processing in batches, files and data in remote locations, generating embeddings, applying post-transform enrichments, using the latest and highest-performing models, and for the highest quality results at the lowest cost. Learn more.

Here is a screenshot of some Python code that calls the Unstructured Platform Workflow Endpoint:

To start using the Unstructured Platform Workflow Endpoint right away, skip ahead to the quickstart.

Supported file types

Unstructured supports processing of the following file types:

By file extension:

File extension
`.bmp`
`.csv`
`.doc`
`.docx`
`.eml`
`.epub`
`.heic`
`.html`
`.jpeg`
`.png`
`.md`
`.msg`
`.odt`
`.org`
`.p7s`
`.pdf`
`.png`
`.ppt`
`.pptx`
`.rst`
`.rtf`
`.tiff`
`.txt`
`.tsv`
`.xls`
`.xlsx`
`.xml`

By file type:

Category	File types
CSV	`.csv`
E-mail	`.eml`, `.msg`, `.p7s`
EPUB	`.epub`
Excel	`.xls`, `.xlsx`
HTML	`.html`
Image	`.bmp`, `.heic`, `.jpeg`, `.png`, `.tiff`
Markdown	`.md`
Org Mode	`.org`
Open Office	`.odt`
PDF	`.pdf`
Plain text	`.txt`
PowerPoint	`.ppt`, `.pptx`
reStructured Text	`.rst`
Rich Text	`.rtf`
TSV	`.tsv`
Word	`.doc`, `.docx`
XML	`.xml`

Quickstart: Unstructured Platform UI

This quickstart uses a no-code, point-and-click user interface in your web browser to get all of your data RAG-ready. Data is processed on Unstructured-hosted compute resources.

The requirements are as follows.

A compatible source (input) location that contains your data for Unstructured to process. See the list of supported source types.
For document-based source locations, compatible files in that location. See the list of supported file types. If you do not have any files available, you can download some from the example-docs folder in the Unstructured repo on GitHub.
A compatible destination (output) location for Unstructured to put the processed data. See the list of supported destination types.

To sign up for the Unstructured Platform, go to the For Developers page and choose one of the following plans:

Sign up for a pay-per-page plan.
Save money by signing up for a subscribe-and-save plan instead.

If you’re not sure which plan to sign up for, start with a pay-per-page plan. You can always switch plans later.

If you choose a pay-per-page plan, after your first 14 days of usage or more than 1000 processed pages per day, whichever comes first, your account is then billed at Unstructured’s standard service usage rates. To keep using the service, you must provide Unstructured with your payment details.

To save money by switching from a pay-per-page to a subscribe-and-save plan, go to the Unstructured Subscribe & Save page and complete the on-screen instructions. To save even more money by making a long-term billing commitment, stop here and sign up through the For Enterprise page instead.

By signing up for a pay-per-page or subscribe-and-save plan, your Unstructured account will run within the context of the Unstructured Platform on Unstructured’s own hosted cloud resources. If you would rather run the Unstructured Platform within the context of your own virtual private cloud (VPC), stop here and sign up through the For Enterprise page instead.

If you initially signed up for a subscribe-and-save plan instead of a pay-per-page plan, wait to complete this step until after you receive confirmation from Unstructured that your plan is activated. Then go to the Unstructured home page at https://unstructured.io and click Login.

If you signed up through the For Enterprise page instead, your sign-in process will be different. For enterprise sign-in guidance, contact Unstructured Sales at sales@unstructured.io.

After you have signed up for a pay-per-page plan, the Unstructured Platform sign-in page appears.
Click Google or GitHub to sign in with the Google or GitHub account that you signed up with. Or, enter the email address that you signed up with, and then click Sign In.
If you entered your email address, check your email inbox for a message from Unstructured. In that email, click the Sign In link.
The first time you sign in, read the terms and conditions, and then click Accept.

After you have signed in for the first time, you can sign in the second time and beyond by going to the Unstructured home page at https://unstructured.io and clicking Login.

For enterprise sign-in guidance, contact Unstructured Sales at sales@unstructured.io.

Set the source (input) location

From your Unstructured Platform dashboard, in the sidebar, click Connectors.
Click Sources.
Cick New or Create Connector.
For Name, enter some unique name for this connector.
In the Provider area, click the source location type that matches yours.
Click Continue.
Fill in the fields with the appropriate settings. Learn more.
If a Continue button appears, click it, and fill in any additional settings fields.
Click Save and Test.

Set the destination (output) location

In the sidebar, click Connectors.
Click Destinations.
Cick New or Create Connector.
For Name, enter some unique name for this connector.
In the Provider area, click the destination location type that matches yours.
Click Continue.
Fill in the fields with the appropriate settings. Learn more.
If a Continue button appears, click it, and fill in any additional settings fields.
Click Save and Test.

Define the workflow

In the sidebar, click Workflows.
Click New Workflow.
Next to Build it for Me, click Create Workflow.

If a radio button appears instead of Build it for Me, select it, and then click Continue.
For Workflow Name, enter some unique name for this workflow.
In the Sources dropdown list, select your source location from Step 3.
In the Destinations dropdown list, select your destination location from Step 4.

You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations.
Click Continue.
The Reprocess All box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
- Checking this box reprocesses all documents in the source location on every workflow run.
- Unchecking this box causes new documents that have been added to the source location, as well as existing documents in the source location that have had their contents or titles changed, since the last workflow run to be processed on future runs. Other previously processed documents are not processed again.
Click Continue.
If you want this workflow to run on a schedule, in the Repeat Run dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select Don’t repeat.
Click Complete.

Process the documents

If you did not choose to run this workflow on a schedule in Step 5, you can run the workflow now: on the sidebar, click Workflows.
Next to your workflow from Step 5, click Run.

Monitor the processing job

In the sidebar, click Jobs.
In the list of jobs, wait for the job’s Status to change to Finished.
Click the row for the job.
After Overview displays Finished, go to the next Step.

View the processed data

Go to your destination location to view the processed data.

Learn more about the Unstructured Platform UI.

Quickstart: Unstructured Platform Workflow Endpoint

This quickstart uses the Unstructured Python SDK to call the Unstructured Platform Workflow Endpoint to get your data RAG-ready. The Python code for this quickstart is in a remote hosted Google Collab notebook. Data is processed on Unstructured-hosted compute resources.

The requirements are as follows:

A compatible source (input) location that contains your data for Unstructured to process. See the list of supported source types. This quickstart uses an Amazon S3 bucket as the source location. If you use a different source type, you will need to modify the quickstart notebook accordingly.
For document-based source locations, compatible files in that location. See the list of supported file types. If you do not have any files available, you can download some from the example-docs folder in the Unstructured-IO/unstructured-ingest repository in GitHub.
A compatible destination (output) location for Unstructured to put the processed data. See the list of supported destination types. For this quickstart’s destination location, a different folder in the same Amazon S3 bucket as the source location is used. If you use a different destination S3 bucket or a different destination type, you will need to modify the quickstart notebook accordingly.

To sign up for the Unstructured Platform, go to the For Developers page and choose one of the following plans:

Sign up for a pay-per-page plan.
Save money by signing up for a subscribe-and-save plan instead.

If you’re not sure which plan to sign up for, start with a pay-per-page plan. You can always switch plans later.

If you signed up through the For Enterprise page instead, your sign-in process will be different. For enterprise sign-in guidance, contact Unstructured Sales at sales@unstructured.io.

After you have signed up for a pay-per-page plan, the Unstructured Platform sign-in page appears.
Click Google or GitHub to sign in with the Google or GitHub account that you signed up with. Or, enter the email address that you signed up with, and then click Sign In.
If you entered your email address, check your email inbox for a message from Unstructured. In that email, click the Sign In link.
The first time you sign in, read the terms and conditions, and then click Accept.

After you have signed in for the first time, you can sign in the second time and beyond by going to the Unstructured home page at https://unstructured.io and clicking Login.

For enterprise sign-in guidance, contact Unstructured Sales at sales@unstructured.io.

Get your API key

Sign in to your Unstructured account, at https://platform.unstructured.io.
At the bottom of the sidebar, click your user icon, and then click Account Settings.
On the API Keys tab, click Generate New Key.
Enter some descriptive name for the API key, and then click Save.
Click the Copy icon for your new API key. The API key’s value is copied to your system’s clipboard.

Create and set up the S3 bucket

This quickstart uses an Amazon S3 bucket as both the source location and the destination location. (You can use other source and destination types that are supported by Unstructured. If you use a different source or destination type, or if you use a different S3 bucket for the destination location, you will need to modify the quickstart notebook accordingly.)

Inside of the S3 bucket, a folder named input represents the source location. This is where your files to be processed will be stored. The S3 URI to the source location will be s3://<your-bucket-name>/input.

Inside of the same S3 bucket, a folder inside named output represents the destination location. This is where Unstructured will put the processed data. The S3 URI to the destination location will be s3://<your-bucket-name>/output.

Learn how to create an S3 bucket and set it up for Unstructured. (Do not run the Python SDK code or REST commands at the end of those setup instructions.)

Run the quickstart notebook

After your S3 bucket is created and set up, follow the instructions in this quickstart notebook.

View the processed data

After you run the quickstart notebook, go to your destination location to view the processed data.

Learn more about the Unstructured Platform API.

Get in touch

If you can’t find the information you’re looking for in the documentation, or if you need help, contact us directly, or join our Slack where our team and community can help you.

​ Unstructured Platform user interface (UI)

​ Unstructured Platform API

​ Supported file types

​ Quickstart: Unstructured Platform UI

​ Quickstart: Unstructured Platform Workflow Endpoint