This page was recently updated. What do you think about it? Let us know!.

Batch process all your records to store structured outputs in Databricks Volumes.

You will need:

The Databricks Volumes prerequisites:

  • The Databricks compute resource’s host name. Get the host name for AWS, Azure, or GCP.
  • The Databricks authentication details. For more information, see the documentation for AWS, Azure, or GCP.
  • The Databricks catalog name for the Volume. Get the catalog name for AWS, Azure, or GCP.
  • The Databricks Volume name. Get the volume name for AWS, Azure, or GCP.

The Databricks Volumes connector dependencies:

CLI, Python
pip install "unstructured-ingest[databricks-volumes]"

You might also need to install additional dependencies, depending on your needs. Learn more.

The following environment variables:

  • DATABRICKS_HOST - The Databricks compute resource’s host name, represented by --host (CLI) or host (Python).
  • DATABRICKS_CATALOG - The Databricks catalog name for the Volume, represented by --catalog (CLI) or catalog (Python).
  • DATABRICKS_VOLUME - The Databricks Volume name, represented by --volume (CLI) or volume (Python).

Environment variables based on your authentication type, depending which types are supported by your cloud provider. For example, for username and password authentication:

  • DATABRICKS_USERNAME - The Databricks account user’s name, represented by --username (CLI) or username (Python).
  • DATABRICKS_PASSWORD - The Databricks account user’s password, represented by --password (CLI) or password (Python).

These environment variables:

  • UNSTRUCTURED_API_KEY - Your Unstructured API key value.
  • UNSTRUCTURED_API_URL - Your Unstructured API URL.

Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector: