Ingest your files into Unstructured from Databricks Volumes.

The requirements are as follows.

The preceding video shows how to use Databricks personal access tokens (PATs), which are supported only for Unstructured Ingest.

To learn how to use Databricks-managed service principals, which are supported by both the Unstructured Platform and Unstructured Ingest, see the additional videos later on this page.

  • The Databricks workspace URL. Get the workspace URL for AWS, Azure, or GCP.

    Examples:

    • AWS: https://<workspace-id>.cloud.databricks.com
    • Azure: https://adb-<workspace-id>.<random-number>.azuredatabricks.net
    • GCP: https://<workspace-id>.<random-number>.gcp.databricks.com
  • The Databricks authentication details. For more information, see the documentation for AWS, Azure, or GCP.

    The following videos show how to create a Databricks-managed service principal and then grant it access to a Databricks volume:

    For the Unstructured Platform, only the following Databricks authentication type is supported:

    • For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal. Note that for Azure, only Databricks-managed service principals are supported. Microsoft Entra ID-managed service principals are not supported.

    For Unstructured Ingest, the following Databricks authentication types are supported:

    • For Databricks personal access token authentication (AWS, Azure, and GCP): The personal access token’s value.
    • For username and password (basic) authentication (AWS only): The user’s name and password values.
    • For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal.
    • For OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP): No additional values.
    • For Azure managed identities (MSI) authentication (Azure only): The client ID value for the corresponding managed identity.
    • For Microsoft Entra ID service principal authentication (Azure only): The tenant ID, client ID, and client secret values for the corresponding service principal.
    • For Azure CLI authentication (Azure only): No additional values.
    • For Microsoft Entra ID user authentication (Azure only): The Entra ID token for the corresponding Entra ID user.
    • For Google Cloud Platform credentials authentication (GCP only): The local path to the corresponding Google Cloud service account’s credentials file.
    • For Google Cloud Platform ID authentication (GCP only): The Google Cloud service account’s email address.
  • The Databricks catalog name for the volume. Get the catalog name for AWS, Azure, or GCP.

  • The Databricks schema name for the volume. Get the schema name for AWS, Azure, or GCP.

  • The Databricks volume name, and optionally any path in that volume that you want to access directly. Get the volume information for AWS, Azure, or GCP.

  • Make sure that the target user or service principal has access to the target volume. To learn more, see the documentation for AWS, Azure, or GCP.

To create the source connector:

  1. On the sidebar, click Connectors.
  2. Click Sources.
  3. Cick New or Create Connector.
  4. Give the connector some unique Name.
  5. In the Provider area, click Databricks Volumes.
  6. Click Continue.
  7. Follow the on-screen instructions to fill in the fields as described later on this page.
  8. Click Save and Test.

Fill in the following fields:

  • Name (required): A unique name for this connector.
  • Host (required): The Databricks workspace host URL.
  • Catalog (required): The name of the catalog to use.
  • Schema : The name of the associated schema. If not specified, default is used.
  • Volume (required): The name of the associated volume.
  • Volume Path : Any optional path to access within the volume.
  • Client Secret (required): The associated OAuth secret value for the Databricks-managed service principal that has access to the volume.
  • Client ID (required): The application ID value for the Databricks-managed service principal that has access to the volume.

To learn how to create a Databricks-managed service principal, get its application ID, and generate an associated OAuth secret, see the documentation for AWS, Azure, or GCP.

For Azure, only Databricks-managed service principals are supported. Microsoft Entra ID-managed service principals are not supported.

To learn how to grant a Databricks-managed service principal access to a volume, see the documentation for AWS, Azure, or GCP.