Unstructured
Unstructured provides a platform and tools to ingest and process unstructured documents for Retrieval Augmented Generation (RAG) and model fine-tuning.
This 40-second video demonstrates a simple use case that Unstructured helps solve:
Product offerings
Unstructured Platform - No-code UI. Production-ready. Pay as you go.
Unstructured Serverless API services - Use scripts or code. Production-ready. Pay as you go. (There is also a non-production, free edition with limits.)
Learn more about these products:
Unstructured Platform
No-code user interface, pay-as-you-go platform to get all of your data RAG-ready.
Data is processed on Unstructured-hosted compute resources.
Try the quickstart.
Learn more.
Read the announcement.
Unstructured Serverless API services
Use scripts or code to call the Unstructured Ingest CLI or Ingest Python library, to get all of your data RAG-ready.
Unstructured Serverless API services have a Serverless pay-as-you-go edition and a Free limited edition that process data on Unstructured-hosted compute resources.
If you need to use compute resources that you host instead, there are also Azure pay-as-you-go and AWS pay-as-you-go editions; these editions process data by using the Unstructured API installed on compute resources hosted in your own Azure or AWS account.
Try the quickstart.
Learn more.
Read the launch announcement.
Supported file types
Unstructured supports processing of the following file types:
By file extension:
File extension |
---|
.bmp |
.csv |
.doc |
.docx |
.eml |
.epub |
.heic |
.html |
.jpeg |
.png |
.md |
.msg |
.odt |
.org |
.p7s |
.pdf |
.png |
.ppt |
.pptx |
.rst |
.rtf |
.tiff |
.txt |
.tsv |
.xls |
.xlsx |
.xml |
By file type:
Category | File types |
---|---|
CSV | .csv |
.eml , .msg , .p7s | |
EPUB | .epub |
Excel | .xls , .xlsx |
HTML | .html |
Image | .bmp , .heic , .jpeg , .png , .tiff |
Markdown | .md |
Org Mode | .org |
Open Office | .odt |
.pdf | |
Plain text | .txt |
PowerPoint | .ppt , .pptx |
reStructured Text | .rst |
Rich Text | .rtf |
TSV | .tsv |
Word | .doc , .docx |
XML | .xml |
Quickstart: Unstructured Platform
This quickstart uses a no-code, point-and-click user interface in your web browser to get all of your data RAG-ready. Data is processed on Unstructured-hosted compute resources.
The requirements are as follows.
- A compatible source (input) location that contains your data for Unstructured to process. See the list of supported source types.
- For document-based source locations, compatible files in that location. See the list of supported file types. If you do not have any files available, you can download some from the example-docs folder in the Unstructured repo on GitHub.
- A compatible destination (output) location for Unstructured to put the processed data. See the list of supported destination types.
Sign up
To sign up for the Unstructured Platform, go to the For Developers page and choose one of the following plans:
- Sign up for a pay-per-page plan.
- Save money by signing up for a subscribe-and-save plan instead.
If you’re not sure which plan to sign up for, start with a pay-per-page plan. You can always switch plans later.
If you choose a pay-per-page plan, after your first 14 days of usage or more than 1000 processed pages per day, whichever comes first, your account is then billed at Unstructured’s standard service usage rates. To keep using the service, you must provide Unstructured with your payment details.
To save money by switching from a pay-per-page to a subscribe-and-save plan, go to the Unstructured Subscribe & Save page and complete the on-screen instructions. To save even more money by making a long-term billing commitment, stop here and sign up through the For Enterprise page instead.
By signing up for a pay-per-page or subscribe-and-save plan, your Unstructured account will run within the context of the Unstructured Platform on Unstructured’s own hosted cloud resources. If you would rather run the Unstructured Platform within the context of your own virtual private cloud (VPC), stop here and sign up through the For Enterprise page instead.
Sign in
If you initially signed up for a subscribe-and-save plan instead of a pay-per-page plan, wait to complete this step until after you receive confirmation from Unstructured that your plan is activated. Then go to the Unstructured home page at https://unstructured.io and click Login.
If you signed up through the For Enterprise page instead, your sign-in process will be different. For enterprise sign-in guidance, contact Unstructured Sales at sales@unstructured.io.
-
After you have signed up for a pay-per-page plan, the Unstructured Platform sign-in page appears.
-
Click Google or GitHub to sign in with the Google or GitHub account that you signed up with. Or, enter the email address that you signed up with, and then click Sign In.
-
If you entered your email address, check your email inbox for a message from Unstructured. In that email, click the Sign In link.
-
The first time you sign in, read the terms and conditions, and then click Accept.
After you have signed in for the first time, you can sign in the second time and beyond by going to the Unstructured home page at https://unstructured.io and clicking Login.
For enterprise sign-in guidance, contact Unstructured Sales at sales@unstructured.io.
Set the source (input) location
- From your Unstructured Platform dashboard, in the sidebar, click Connectors.
- Click Sources.
- Cick New or Create Connector.
- For Name, enter some unique name for this connector.
- In the Provider area, click the source location type that matches yours.
- Click Continue.
- Fill in the fields with the appropriate settings. Learn more.
- If a Continue button appears, click it, and fill in any additional settings fields.
- Click Save and Test.
Set the destination (output) location
- In the sidebar, click Connectors.
- Click Destinations.
- Cick New or Create Connector.
- For Name, enter some unique name for this connector.
- In the Provider area, click the destination location type that matches yours.
- Click Continue.
- Fill in the fields with the appropriate settings. Learn more.
- If a Continue button appears, click it, and fill in any additional settings fields.
- Click Save and Test.
Define the workflow
-
In the sidebar, click Workflows.
-
Click New Workflow.
-
Next to Build it with Me, click Create Workflow.
If a radio button appears instead of Build it with Me, select it, and then click Continue. -
For Workflow Name, enter some unique name for this workflow.
-
In the Sources dropdown list, select your source location from Step 3.
-
In the Destinations dropdown list, select your destination location from Step 4.
You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations. -
Click Continue.
-
In the Optimize for section, select the option to choose one of these preconfigured workflow settings groups:
-
Basic: Ideal for simple, text-only documents.
-
Advanced: Best for PDFs, images, and complex file types.
-
Platinum: For your most challenging documents, including scanned and handwritten content. It uses vision language models (VLMs). During processing, files that are not PDFs or images are processed by using the Advanced strategy and are charged at the Advanced rate instead.
When you use the Platinum strategy for PDF files of 200 or more pages, you might notice some errors when these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
-
-
The Reprocess All box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
- Checking this box reprocesses all documents in the source location on every workflow run.
- Unchecking this box causes new documents that have been added to the source location, as well as existing documents in the source location that have had their contents or titles changed, since the last workflow run to be processed on future runs. Other previously processed documents are not processed again.
-
Click Continue.
-
If you want this workflow to run on a schedule, in the Repeat Run dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select Don’t repeat.
-
Click Complete.
Process the documents
- If you did not choose to run this workflow on a schedule in Step 5, you can run the workflow now: on the sidebar, click Workflows.
- Next to your workflow from Step 5, click Run.
Monitor the processing job
- In the sidebar, click Jobs.
- In the list of jobs, wait for the job’s Status to change to Finished.
- Click the row for the job.
- After Overview displays Finished, go to the next Step.
View the processed data
Go to your destination location to view the processed data.
Learn more about the Unstructured Platform.
Quickstart: Unstructured Serverless API
This quickstart uses your local machine, with the Unstructured Ingest Python library installed. It preprocesses source (input) files on your local machine, and it uses the Unstructured Serverless API to deliver the processed data to a destination (output) location, also on your local machine. Data is processed on Unstructured-hosted compute resources.
The requirements are as follows.
- Python installed on your local machine.
- Compatible files on your local machine to be processed. See the list of supported file types. If you do not have any files available, you can download some from the example-docs folder in the Unstructured repo on GitHub.
Sign up
Sign up for an Unstructured account by going to the For Developers page and choosing one of the following plans:
- Sign up for a pay-per-page plan.
- Save money by signing up for a subscribe-and-save plan instead.
If you’re not sure which plan to sign up for, start with a pay-per-page plan. You can always switch plans later.
By signing up through the For Developers page, your Unstructured account will run within the context of the Unstructured Platform on Unstructured’s own hosted cloud resources.
If you choose a pay-per-page plan, after your first 14 days of usage or more than 1000 processed pages per day, whichever comes first, your account is then billed at Unstructured’s standard service usage rates. You must provide Unstructured with your payment details to keep using the service.
To save money by switching from a pay-per-page to a subscribe-and-save plan, go to the Unstructured Subscribe & Save page and complete the on-screen instructions.
If you would rather use Unstructured within the context of your own virtual private cloud (VPC), or you want to make a long-term billing commitment in exchange for deeply discounted service usage rates, stop here and sign up through the For Enterprise page instead.
Sign in
-
After you have signed up through the For Developers page, the Unstructured account sign-in page appears.
If you signed up through the For Enterprise page instead, your sign-in process will be different. For enterprise sign-in guidance, contact Unstructured Sales at sales@unstructured.io.
-
Click Google or GitHub to sign in with the Google or GitHub account that you signed up with through the For Developers page. Or, enter the email address that you signed up with, and then click Sign In.
-
If you entered your email address, check your email inbox for a message from Unstructured. In that email, click the Sign In link.
-
The first time you sign in, read the terms and conditions, and then click Accept.
Get your API key and API URL
- At the bottom of the sidebar, click your user icon, and then click Account Settings.
- Click the API Keys tab, if it is not already selected.
- To get your API key, click the copy icon in the Actions column for your API key, and then click Key Only. Store your copied API key in a secure location. Do not share it with others.
- To get your API URL, click the copy icon next to the URL next to API URL. Store your copied API URL in a secure location. Do not share it with others.
Set environment variables
- Set an environment variable named
UNSTRUCTURED_API_KEY
to the value of your Unstructured API key. - Set another environment variable named
UNSTRUCTURED_API_URL
to the value of your Unstructured API URL.
Install the Ingest Python library
Run the following command:
You might also need to install additional dependencies, depending on your needs. Learn more.
Run the code
Run the following code, replacing:
<path/to/input>
with the source (input) path to the directory on your local machine that contains the compatible files for Unstructured to process on its hosted compute resources.<path/to/output>
with the destination (output) path to the directory on your local machine that will contain the processed data that Unstructured returns from its hosted compute resources.
View the processed data
Go to your destination location to view the processed data.
Learn more about the Unstructured Serverless API.
Get in touch
If you can’t find the information you’re looking for in the documentation, or if you need help, contact us directly, or join our Slack where our team and community can help you.
Was this page helpful?