Databricks Volumes

This article covers connecting Unstructured to Databricks Volumes.

For information about connecting Unstructured to Delta Tables in Databricks instead, see Delta Tables in Databricks.

Send processed data from Unstructured to Databricks Volumes.

The requirements are as follows.

A Databricks account on AWS, Azure, or GCP.
A workspace within the Datbricks account for AWS, Azure, or GCP.
The workspace’s URL. Get the workspace URL for AWS, Azure, or GCP.

Examples:
- AWS: https://<workspace-id>.cloud.databricks.com
- Azure: https://adb-<workspace-id>.<random-number>.azuredatabricks.net
- GCP: https://<workspace-id>.<random-number>.gcp.databricks.com
The Databricks authentication details. For more information, see the documentation for AWS, Azure, or GCP.

For the Unstructured Platform, only Databricks OAuth machine-to-machine (M2M) authentication is supported for AWS, Azure, and GCP. You will need the the Client ID (or UUID or Application ID) and OAuth Secret (client secret) values for the corresponding service principal. Note that for Azure, only Databricks managed service principals are supported. Microsoft Entra ID managed service principals are not supported.

The following video shows how to create a Databricks managed service principal:

For Unstructured Ingest, the following Databricks authentication types are supported:
- For Databricks personal access token authentication for AWS, Azure, or GCP: The personal access token’s value.
  
  The following video shows how to create a Databricks personal access token:
- For username and password (basic) authentication (AWS only): The user’s name and password values.
- For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal.
- For OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP): No additional values.
- For Azure managed identities (formerly Managed Service Identities (MSI) authentication) (Azure only): The client ID value for the corresponding managed identity.
- For Microsoft Entra ID service principal authentication (Azure only): The tenant ID, client ID, and client secret values for the corresponding service principal.
- For Azure CLI authentication (Azure only): No additional values.
- For Microsoft Entra ID user authentication (Azure only): The Entra ID token for the corresponding Entra ID user.
- For Google Cloud Platform credentials authentication (GCP only): The local path to the corresponding Google Cloud service account’s credentials file.
- For Google Cloud Platform ID authentication (GCP only): The Google Cloud service account’s email address.
The name of the parent catalog in Unity Catalog for AWS, Azure, or GCP for the volume.
The name of the parent schema (formerly known as a database) in Unity Catalog for AWS, Azure, or GCP for the volume.
The name of the volume in Unity Catalog for AWS, Azure, or GCP, and optionally any path in that volume that you want to access directly, beginning with the volume’s root.
The Databricks workspace user or service principal must have the following minimum set of privileges to read from or write to the existing volume in Unity Catalog:
- USE CATALOG on the volume’s parent catalog in Unity Catalog.
- USE SCHEMA on the volume’s parent schema (formerly known as a database) in Unity Catalog.
- READ VOLUME and WRITE VOLUME on the volume.
The following videos shows how to create and set privileges for a catalog, schema (formerly known as a database), and volume in Unity Catalog.

Learn more about how to check and set Unity Catalog privileges for AWS, Azure, or GCP.

To create the destination connector:

On the sidebar, click Connectors.
Click Destinations.
Cick New or Create Connector.
Give the connector some unique Name.
In the Provider area, click Databricks Volumes.
Click Continue.
Follow the on-screen instructions to fill in the fields as described later on this page.
Click Save and Test.

Fill in the following fields:

Name (required): A unique name for this connector.
Host (required): The Databricks workspace host URL.
Catalog (required): The name of the catalog to use.
Schema : The name of the associated schema. If not specified, default is used.
Volume (required): The name of the associated volume.
Volume Path : Any optional path to access within the volume.
Client Secret (required): The associated OAuth Secret value for the Databricks managed service principal that has the appropriate privileges to the volume.
Client ID (required): The Client ID (or UUID or Application ID) value for the Databricks managed service principal that has appropriate privileges to the volume.

Unstructured Platform

Getting started with Platform

Using Platform

Concepts