Workflows dashboard

To view the workflows dashboard, on the sidebar, click Workflows.

A workflow in the Unstructured Platform is a defined sequence of processes that automate the data handling from source to destination. It allows users to configure how and when data should be ingested, processed, and stored.

Workflows are crucial for establishing a systematic approach to managing data flows within the platform, ensuring consistency, efficiency, and adherence to specific data processing requirements.

Create a workflow

The Unstructured Platform provides two types of workflow builders:

  • Automatic or Build it For Me workflows, which use sensible default workflow settings to enable you to get good-quality results faster.
  • Custom or Build it Myself workflows, which enable you to fine-tune the workflow settings behind the scenes to get very specific results.

Create an automatic workflow

You must first have an existing source connector and destination connector to add to the workflow.

If you do not have an existing connector for either your target source (input) or destination (output) location, create the source connector, create the destination connector, and then return here.

To see your existing connectors, on the sidebar, click Connectors, and then click Sources or Destinations.

To create an automatic workflow:

  1. On the sidebar, click Workflows.

  2. Click New Workflow.

  3. Next to Build it for Me, click Create Workflow.

    If a radio button appears instead of Build it for Me, select it, and then click Continue.
  4. For Workflow Name, enter some unique name for this workflow.

  5. In the Sources dropdown list, select your source location.

  6. In the Destinations dropdown list, select your destination location.

    You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations.
  7. Click Continue.

  8. The Reprocess All box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:

    • Checking this box reprocesses all documents in the source location on every workflow run.
    • Unchecking this box causes only new documents that are added to the source location since the last workflow run to be processed on future runs. Previously processed documents are not processed again, even if those documents’ contents change.
  9. Click Continue.

  10. If you want this workflow to run on a schedule, in the Repeat Run dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select Don’t repeat.

  11. Click Complete.

By default, this workflow partitions, chunks, and generates embeddings as follows:

  • Partitioner: Auto strategy

    Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:

    • If the page or document has no images and likely does not have tables, Fast partitioning is used, and the page or document is billed at the Fast rate for processing.
    • If the page or document has only a few tables or images with standard layouts and languages, High Res partitioning is used, and the page or document is billed at the High Res rate for processing.
    • If the page or document has more than a few tables or images, VLM partitioning is used, and the page or document is billed at the VLM rate for processing.

    Learn about partitioning strategies.

  • Chunker: Chunk by Title strategy

    • Contextual Chunking: No (unchecked)
    • Combine Text Under N Characters: 3000
    • Include Original Elements: Yes (checked)
    • Max Characters: 5500
    • Multipage Sections: Yes (checked)
    • New After N Characters: 3500
    • Overlap: 350
    • Overlap All: Yes (checked)

    Learn about chunking strategies.

  • Embedder:

    • Provider: Azure OpenAI
    • Model: text-embedding-3-large, with 3072 dimensions

    Learn about embedding providers and models.

  • Enrichments:

    This workflow contains no enrichments.

    Learn about available enrichments.

After this workflow is created, you can change any or all of its settings if you want to. This includes the workflow’s source connector, destination connector, partitioning, chunking, and embedding settings. You can also add enrichments to the workflow if you want to.

To change the workflow’s default settings or to add enrichments:

  1. On the sidebar, click Workflows.
  2. In the list of available workflows, click the workflow that was just created. This opens a visual designer that shows your workflow as a directed acyclic graph (DAG). This DAG contains a node representing each step in the workflow. There is one node for the partitioning step, another node for the chunking step, and so on.
  3. To learn how to change a node’s settings or to add enrichment nodes, click the FAQ button in the flyout pane in the workflow DAG designer.

If you did not previously set the workflow to run on a schedule, you can run the worklow now.

Create a custom workflow

If you already have an existing workflow that you want to change, do the following:

  1. On the sidebar, click Workflows.
  2. Click the name of the workflow that you want to change.
  3. Skip ahead to Step 11 in the following procedure.

You must first have an existing source connector and destination connector to add to the workflow.

If you do not have an existing connector for either your target source (input) or destination (output) location, create the source connector, create the destination connector, and then return here.

To see your existing connectors, on the sidebar, click Connectors, and then click Sources or Destinations.

  1. On the sidebar, click Workflows.

  2. Click New Workflow.

  3. Click the Build it Myself option, and then click Continue.

  4. In the This workflow pane, click the Details button.

  5. Next to Name, click the pencil icon, enter some unique name for this workflow, and then click the check mark icon.

  6. If you want this workflow to run on a schedule, click the Schedule button. In the Repeat Run dropdown list, select one of the scheduling options, and fill in the scheduling settings.

  7. To overwrite any previously processed files, or to retry any documents that fail to process, click the Settings button, and check either or both of the boxes.

    The Reprocess All box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:

    • Checking this box reprocesses all documents in the source location on every workflow run.
    • Unchecking this box causes new documents that have been added to the source location, as well as existing documents in the source location that have had their contents or titles changed, since the last workflow run to be processed on future runs. Other previously processed documents are not processed again.
  8. The workflow begins with the following layout:

    The following workflow layouts are also valid:

  9. In the pipeline designer, click the Source node. In the Source pane, select the source location. Then click Save.

  10. Click the Destination node. In the Destination pane, select the destination location. Then click Save.

  11. As needed, add more nodes by clicking the plus icon (recommended) or Add Node button:

    • Click Connect to add another Source or Destination node. You can add multiple source and destination locations. Files will be ingested from all of the source locations, and the processed data will be delivered to all of the destination locations. Learn more.
    • Click Enrich to add a Chunker or Enrichment node. Learn more.
    • Click Transform to add a Partitioner or Embedder node. Learn more.

    Make sure to add nodes in the correct order. If you are unsure, see the usage hints in the blue note that appears in the node’s settings pane.

    To edit a node, click that node, and then change its settings.

    To delete a node, click that node, and then click the trash can icon above it.

  12. Click Save.

  13. If you did not set the workflow to run on a schedule, you can run the worklow now.

Custom workflow node types

Edit, delete, or run a workflow

To run a workflow once, manually:

  1. On the sidebar, click Workflows.
  2. In the list of workflows, click Run in the row for the workflow that you want to run.

For each of the workflows on the Workflows list page, the following actions are available by clicking the ellipses (the three dots) in the row for the respective workflow:

  • Edit via Form: Changes the existing configuration of your workflow.
  • Delete: Removes the workflow from the platform. Use this action cautiously, as it will permanently delete the workflow and its configurations.
  • Open: Opens the workflow’s settings page.

Pause a scheduled workflow

To stop running a workflow that is set to run on a repeating schedule:

  1. On the sidebar, click Workflows.
  2. In the list of workflows, turn off the Status toggle in the row for the workflow that you want to stop running on a repeated schedule.

Turning off the Status toggle also disables the workflow’s Run button, which prevents that workflow from being run manually as well.

To resume running the workflow on its original repeating schedule, as well as enable the workflow to be run manually as needed, turn on the workflow’s Status toggle.