You can use Databricks Volumes events, such as uploading files to Databricks Volumes, to automatically run Unstructured ETL+ workflows that rely on those Databricks Volumes as sources. This enables a no-touch approach to having Unstructured automatically process files as they are uploaded to Databricks Volumes.
This example shows how to automate this process by adding a custom job in Lakeflow Jobs for your Databricks workspace in AWS, Azure, or GCP. This job runs whenever a file upload event is detected in the specified Databricks Volume. This job uses a custom Databricks notebook to call the Unstructured Workflow Endpoint to automatically run the specified corresponding Unstructured ETL+ workflow within your Unstructured account.
This example uses a custom job in Lakeflow Jobs and a custom Databricks notebookthat you create and maintain. Any issues with file detection, timing, or job execution could be related to your custom job or notebook, rather than with Unstructured. If you are getting unexpected or no results, be sure to check your custom job’s run logs first for any informational and error messages.
To use this example, you will need the following:
An Unstructured account, and an Unstructured API key for your account, as follows:
Sign in to your Unstructured account:
Get your Unstructured API key:
a. In the Unstructured UI, click API Keys on the sidebar.
b. Click Generate API Key.
c. Follow the on-screen instructions to finish generating the key.
d. Click the Copy icon next to your new key to add the key to your system’s clipboard. If you lose this key, simply return and click the Copy icon again.
The Unstructured Workflow Endpoint URL for your account, as follows:
A Databricks Volumes source connector in your Unstructured account. Learn how.
Some available destination connector in your Unstructured account.
A workflow that uses the preceding source and destination connectors. Learn how.
Sign in to the Databricks workspace within your Databricks account for AWS, Azure, or GCP that corresponds to the workspace you specified for your Databricks Volumes source connector.
On the sidebar, click + New > Notebook.
Click the notebook’s title and change it to something more descriptive, such as Unstructured Workflow Runner Notebook
.
In the notebook’s first cell, add the following code:
Click Edit > Insert cell below.
In this second cell, add the following code:
Replace the placeholders in this second cell as follows:
Replace <unstructured-api-url>
with the value of the Unstructured Workflow Endpoint field earlier from the requirements.
Replace <workflow-id>
with the ID of the workflow that you want to run.
For your Unstructured API key, do one of the following:
<scope>
and <key>
with the scope and key names for the existing secret that you have already created in Databricks Secrets.
Learn how to work with Databricks Secrets for
AWS,
Azure, or
GCP.UNSTRUCTURED_API_KEY
to your Unstructured API key value. Learn how for
AWS,
Azure, or
GCP.With your Databricks workspace still open from the previous step, on the sidebar, click Jobs & Pipelines.
Expand Create new, and then click Job.
Click the job’s title and change it to something more descriptive, such as Unstructured Workflow Runner Job
.
On the Tasks tab, enter some Task name such as Run_Unstructured_Workflow_Runner_Notebook
.
With Notebook selected for Type, and with Workspace selected for Source, use the Path dropdown to select the notebook you created in the previous step.
For Cluster, select the cluster you want to use to run the notebook.
Click Create task.
In the Job details pane, under Schedules & Triggers, click Add trigger.
For Trigger type, select File arrival.
For Storage location, enter the path to the volume to monitor or, if you are monitoring a folder within that volume, the path to the folder. To get this path, do the following:
a. On the sidebar, click Catalog.
b. In the list of catalogs, expand the catalog that contains the volume you want to monitor.
c. In the list of schemas (formerly known as databases), expand the schema that contains the volume you want to monitor.
d. Expand Volumes.
e. Click the volume you want to monitor.
f. On the Overview tab, copy the path to the volume you want to monitor or, if you are monitoring a folder within that volume, click the path to the folder and then copy the path to that folder.
Click Save.
To stop triggering the job, with your job in Lakeflow Jobs still open earlier from Step 4, in the Job details pane, under Schedules & Triggers, click Pause.