Create a new workflow, either custom or auto, and configure its settings.
Documentation Index
Fetch the complete documentation index at: https://docs.unstructured.io/llms.txt
Use this file to discover all available pages before exploring further.
auto uses sensible default workflow settings to enable you to get good-quality results faster. custom enables you to fine-tune the workflow settings to get very specific results.advanced, basic, and platinum are non-operational and will be removed in a future release.source_id use a local file source. Local-source workflows must set workflow_type to custom, cannot be set to run on a repeating schedule, and cannot be run from the Unstructured UI (though they can be run via the API or Python SDK).id (string, UUID) and node_type (string), and supports optional node_subtype (string), config (object), and params (object).For more information on workflow nodes, see Workflow nodes.| Value | cron | Description |
|---|---|---|
every 15 minutes | */15 * * * * | Every 15 minutes. |
every hour | 0 * * * * | At the first minute of every hour. |
every 2 hours | 0 */2 * * * | At the first minute of every second hour. |
every 4 hours | 0 */4 * * * | At the first minute of every fourth hour. |
every 6 hours | 0 */6 * * * | At the first minute of every sixth hour. |
every 8 hours | 0 */8 * * * | At the first minute of every eighth hour. |
every 10 hours | 0 */10 * * * | At the first minute of every tenth hour. |
every 12 hours | 0 */12 * * * | At the first minute of every twelfth hour. |
daily | 0 0 * * * | At the first minute of every day. |
weekly | 0 0 * * 0 | At the first minute of every Sunday. |
monthly | 0 0 1 * * | At the first minute of the first day of every month. |
false. If true, reprocesses all documents in the source location on every run. If false, the workflow excludes from future processing any files Unstructured determines are unchanged since the last time the workflow ran.Unstructured determines if a document has changed based on the document version. For each workflow, Unstructured maintains a record of documents (and their versions, if present) processed by that workflow. Each document record consists of:record_id derived from the document name and path.record_version derived from either the document Etag (if the source provider generates one) or the source provider’s native version identifier.reprocess_all to false for a source connector that supports reprocess_all, Unstructured uses this list of records to determine whether or not to process each document:record_id does not exist in the workflow records, Unstructured processes the document.record_id exists, but the record_version has changed, or there is no record_version, Unstructured processes the document.record-id and record_version combinations, and the action Unstructured takes in each case:record_id | record_version | Action |
|---|---|---|
| Exists | Unchanged | Do not process file |
| Exists | Changed | Process file |
| Exists | (none) | Process file |
| New | (Does not apply) | Process file |
record_id; Unstructured will then reprocess the renamed document when the workflow runs.reprocess_all setting. The Record version base column specifies the versioning information Unstructured uses to generated the corresponding record version for each processed document.Source connectors that do not support reprocess_all reprocess every document in the source location each time the workflow runs.| Connector | record_version base |
|---|---|
| Amazon S3 | ETag |
| Azure Blob Storage | ETag |
| Box | Provider version ID |
| Dropbox | Provider version ID |
| Elastisearch | Provider version ID |
| Google Cloud Storage | ETag |
| Google Drive | Provider version ID |
| Microsoft OneDrive | Provider version ID |
| Microsoft SharePoint | Provider version ID |
reprocess_all to false:record_id, but may result in a different record_version being generated. This is especially true of multipart uploads. This results in Unstructured reprocessing the document.record_version of a document changing, and Unstructured reprocessing the document.reprocess_all to true for at least the next workflow run.custom or auto.active, inactive, or paused.