Workflows
Workflows dashboard
To view the workflows dashboard, on the sidebar, click Workflows.
A workflow in the Unstructured Platform is a defined sequence of processes that automate the data handling from source to destination. It allows users to configure how and when data should be ingested, processed, and stored.
Workflows are crucial for establishing a systematic approach to managing data flows within the platform, ensuring consistency, efficiency, and adherence to specific data processing requirements.
Create a workflow
The Unstructured Platform provides two types of workflow builders:
- Automatic workflows, which use sensible default workflow settings to enable you to get good-quality results faster.
- Custom workflows, which enable you to fine-tune the workflow settings behind the scenes to get very specific results.
All Unstructured accounts can create automatic worklows.
To create custom workflows, you must request Unstructured to enable your account first. Learn how.
Create an automatic workflow
You must first have an existing source connector and destination connector to add to the workflow.
If you do not have an existing connector for either your target source (input) or destination (output) location, create the source connector, create the destination connector, and then return here.
To see your existing connectors, on the sidebar, click Connectors, and then click Sources or Destinations.
To create an automatic workflow:
-
On the sidebar, click Workflows.
-
Click New Workflow.
-
Next to Build it with me, click Create Workflow.
If a radio button appears instead of Build it with me, select it, and then click Continue. -
For Workflow Name, enter some unique name for this workflow.
-
In the Sources dropdown list, select your source location.
-
In the Destinations dropdown list, select your destination location.
You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations. -
Click Continue.
-
In the Optimize for section, select the option to choose one of these preconfigured workflow settings groups. Expand any or all of the following options to learn more about these preconfigured settings:
-
The Reprocess all box applies only to the Amazon S3 and Azure Blob Storage source connectors:
- Checking this box reprocesses all documents in the source location on every workflow run.
- Unchecking this box causes only new documents that are added to the source location since the last workflow run to be processed on future runs. Previously processed documents are not processed again, even if those documents’ contents change.
-
If you want to retry processing any documents that failed to process, check the Retry Failed Documents box.
-
Click Continue.
-
If you want this workflow to run on a schedule, in the Repeat Run dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select Don’t repeat.
-
Click Complete.
-
If you did not set the workflow to run on a schedule, you can run the worklow now.
Create a custom workflow
You must first have an existing source connector and destination connector to add to the workflow.
If you do not have an existing connector for either your target source (input) or destination (output) location, create the source connector, create the destination connector, and then return here.
To see your existing connectors, on the sidebar, click Connectors, and then click Sources or Destinations.
There are two ways to create a custom workflow:
- Through Build it with me > Custom. This option enables you to fine-tune the kinds of settings that are in Basic, Advanced, and Platinum.
- Through Build it myself. This option offers a visual workflow designer with even more fine-tuning than the Custom option.
Build it with me - Custom
-
On the sidebar, click Workflows.
-
Click New Workflow.
-
Next to Build it with me, click Create Workflow.
If a radio button appears instead of Build it with me, select it, and then click Continue. -
For Workflow Name, enter some unique name for this workflow.
-
In the Sources dropdown list, select your source location.
-
In the Destinations dropdown list, select your destination location.
You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations. -
Click Continue.
-
In the Optimize for section, click the Custom option, and then click Continue.
If the Custom option is disabled, inside the Custom option click Notify me, and follow the on-screen directions to complete the request. Unstructured will notify you when your account has been enabled with the Custom option. After you receive this notification, click the Custom option, and then click Continue.
-
In the Strategy area, choose one of the following:
-
Fast: Ideal for simple, text-only documents.
-
High Res: Best for PDFs, images, and complex file types.
During High Res processing, any detected text-based files are processed and billed at the Fast rate instead.
-
VLM: For your most challenging documents, including scanned and handwritten content.
During VLM processing, any detected files that are not PDFs or images are processed and billed at either the High Res or Fast rate instead. Of those non-PDF and non-image files, all text-based files are processed and billed at the Fast rate instead. The other files are processed and billed at the High Res rate instead.
When you use the VLM strategy with embeddings for PDF files of 200 or more pages, you might notice some errors when these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
-
-
In the Image Summzarizer drop-down list, choose one of the following:
- None: Do not provide summaries for any detected images in any of the files.
- GPT-4o: Use GPT-4o to provide summaries for any detected images in any of the files. Learn more.
- Claude 3.5 Sonnet: Use Claude 3.5 Sonnet to provide summaries for any detected images in any of the files. Learn more.
-
In the Table Summzarizer drop-down list, choose one of the following:
- None: Do not provide summaries for any detected tables in any of the files.
- GPT-4o: Use GPT-4o to provide summaries for any detected tables in any of the files. Learn more.
- Claude 3.5 Sonnet: Use Claude 3.5 Sonnet to provide summaries for any detected tables in any of the files. Learn more.
-
Check the Include Page Breaks box to include page breaks in the output, if the file type support it.
-
Check the Infer Table Structure box to extract any detected table elements in PDF files as HTML format into a
metadata
output field namedtext_as_html
. -
In the Elements to Exclude drop-down list, select any element types to exclude from the output.
-
In the Chunk area, for Chunker Type, select one of the following:
-
None: Do not apply special chunking rules to the output.
-
Chunk by Character (also known as basic chunking): Combine sequential elements to maximally fill each chunk. Also, specify the following:
- Include Original Elements: Check this box to output the elements that were used to form a chunk, to appear in the
metadata
field’sorig_elements
field for that chunk. By default, this box is unchecked. - Max Characters: Cut off new sections after reaching a length of this many characters. The default is 2048.
- New After n chars: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is 1500.
- Overlap: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is 160.
- Overlap all: Check this box to apply overlap to “normal” chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked.
- Include Original Elements: Check this box to output the elements that were used to form a chunk, to appear in the
-
Chunk by Page: Preserve page boundaries. When a new page is detected, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following:
- Include Original Elements: Check this box to output the elements that were used to form a chunk, to appear in the
metadata
field’sorig_elements
field for that chunk. By default, this box is unchecked. - Max Characters: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is 500.
- New After n Characters: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is 50.
- Overlap: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is 30.
- Overlap all: Check this box to apply overlap to “normal” chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked.
- Include Original Elements: Check this box to output the elements that were used to form a chunk, to appear in the
-
Chunk by Title: Preserve section boundaries and optionally page boundaries as well. A single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following:
- Combine Text Under n Chars: Combine elements until a section reaches a length of this many characters. The default is 0.
- Include Original Elements: Check this box to output the elements that were used to form a chunk, to appear in the
metadata
field’sorig_elements
field for that chunk. By default, this box is unchecked. - Max Characters: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is 2048.
- Multipage Sections: Check this box to allow sections to span multiple pages. By default, this box is unchecked.
- New After n Characters: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is *1500.
- Overlap: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is 160.
- Overlap all: Check this box to apply overlap to “normal” chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked.
-
Chunk by Similarity: Use the sentence-transformers/multi-qa-mpnet-base-dot-v1 embedding model to identify topically similar sequential elements and combine them into chunks. Also, specify the following:
- Include Original Elements: Check this box to output the elements that were used to form a chunk, to appear in the
metadata
field’sorig_elements
field for that chunk. By default, this box is unchecked. - Max Characters: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is 500.
- Similarity Threshold: Specify a threshold between 0 and 1 exclusive (0.01 to 0.99 inclusive), where 0 indicates completely dissimilar vectors and 1 indicates identical vectors, taking into consider the trade-offs between precision (a higher threshold) and recall (a lower threshold). The default is 0.5. Learn more.
- Include Original Elements: Check this box to output the elements that were used to form a chunk, to appear in the
Learn more:
-
-
In the Embed area, for Provider, choose one of the following:
-
None: Do not generate embeddings.
-
OpenAI: Use OpenAI to generate embeddings. Also, choose the model to use:
- text-embedding-3-small, with 1536 dimensions.
- text-embedding-3-large, with 3072 dimensions.
- Ada 002 (Text), with 1536 dimensions.
Learn more:
-
-
The Reprocess all box applies only to the Amazon S3 and Azure Blob Storage source connectors:
- Checking this box reprocesses all documents in the source location on every workflow run.
- Unchecking this box causes only new documents that are added to the source location since the last workflow run to be processed on future runs. Previously processed documents are not processed again, even if those documents’ contents change.
-
Check the Retry Failed Documents box if you want to retry processing any documents that failed to process.
-
Click Continue.
-
If you want this workflow to run on a schedule, in the Repeat Run dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select Don’t repeat.
-
Click Complete.
-
If you did not set the workflow to run on a schedule, you can run the worklow now.
Build it myself
-
On the sidebar, click Workflows.
-
Click New Workflow.
-
Click the Build it myself option, and then click Continue.
If the Build it myself option is disabled, inside the Build it myself option click Notify me, and follow the on-screen directions to complete the request. Unstructured will notify you when your account has been enabled with the Build it myself option. After you receive this notification, click the Build it myself option, and then click Continue.
-
In the This workflow pane, click the Details button.
-
Next to Name, click the pencil icon, enter some unique name for this workflow, and then click the check mark icon.
-
If you want this workflow to run on a schedule, click the Schedule button. In the Repeat Run dropdown list, select one of the scheduling options, and fill in the scheduling settings.
-
To overwrite any previously processed files, or to retry any documents that fail to process, click the Settings button, and check either or both of the boxes.
The Reprocess all box applies only to the Amazon S3 and Azure Blob Storage source connectors:
- Checking this box reprocesses all documents in the source location on every workflow run.
- Unchecking this box causes only new documents that are added to the source location since the last workflow run to be processed on future runs. Previously processed documents are not processed again, even if those documents’ contents change.
-
The workflow begins with the following layout:
The following workflow layouts are also valid:
-
In the pipeline designer, click the Source node. In the Source pane, select the source location. Then click Save.
-
Click the Destination node. In the Destination pane, select the destination location. Then click Save.
-
As needed, add more nodes by clicking the plus icon (recommended) or Add Node button:
- Click Connect to add another Source or Destination node. You can add multiple source and destination locations. Files will be ingested from all of the source locations, and the processed data will be delivered to all of the destination locations. Learn more.
- Click Enrich to add a Chunker or Enrichment node. Learn more.
- Click Transform to add a Partitioner or Embedder node. Learn more.
Make sure to add nodes in the correct order. If you are unsure, see the usage hints in the blue note that appears in the node’s settings pane.
To edit a node, click that node, and then change its settings.
To delete a node, click that node, and then click the trash can icon above it.
-
Click Save.
-
If you did not set the workflow to run on a schedule, you can run the worklow now.
Custom workflow node types
Edit, delete, or run a workflow
To run a workflow once, manually:
- On the sidebar, click Workflows.
- In the list of workflows, click Run in the row for the workflow that you want to run.
For each of the workflows on the Workflows list page, the following actions are available by clicking the ellipses (the three dots) in the row for the respective workflow:
- Edit via Form: Changes the existing configuration of your workflow.
- Delete: Removes the workflow from the platform. Use this action cautiously, as it will permanently delete the workflow and its configurations.
- Open: Opens the workflow’s settings page.
Pause a scheduled workflow
To stop running a workflow that is set to run on a repeating schedule:
- On the sidebar, click Workflows.
- In the list of workflows, turn off the Status toggle in the row for the workflow that you want to stop running on a repeated schedule.
Turning off the Status toggle also disables the workflow’s Run button, which prevents that workflow from being run manually as well.
To resume running the workflow on its original repeating schedule, as well as enable the workflow to be run manually as needed, turn on the workflow’s Status toggle.