> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unstructured.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Databricks Volumes

Connect Databricks Volumes to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem.

The requirements are as follows.

* A Databricks account on [AWS](https://docs.databricks.com/getting-started/free-trial.html),
  [Azure](https://learn.microsoft.com/azure/databricks/getting-started/), or
  [GCP](https://docs.gcp.databricks.com/getting-started/index.html).

* A workspace within the Databricks account for [AWS](https://docs.databricks.com/admin/workspace/index.html),
  [Azure](https://learn.microsoft.com/azure/databricks/admin/workspace/), or
  [GCP](https://docs.gcp.databricks.com/admin/workspace/index.html).

* The workspace's URL. Get the workspace URL for
  [AWS](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids),
  [Azure](https://learn.microsoft.com/azure/databricks/workspace/workspace-details#workspace-instance-names-urls-and-ids),
  or [GCP](https://docs.gcp.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids).

  Examples:

  * AWS: `https://<workspace-id>.cloud.databricks.com`
  * Azure: `https://adb-<workspace-id>.<random-number>.azuredatabricks.net`
  * GCP: `https://<workspace-id>.<random-number>.gcp.databricks.com`

  <Note>
    Do not add a trailing slash (`/`) to the workspace URL.
  </Note>

* The Databricks authentication details. For more information, see the documentation for
  [AWS](https://docs.databricks.com/dev-tools/auth/index.html),
  [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/),
  or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/index.html).

  For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), the following Databricks authentication types are supported:

  * Databricks OAuth machine-to-machine (M2M) authentication for\
    [AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html),
    [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), or
    [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html).

    You will need the **Client ID** (or **UUID** or **Application** ID) and OAuth **Secret** (client secret) values for the corresponding service principal.
    Note that for Azure, only Databricks managed service principals are supported. Microsoft Entra ID managed service principals are not supported.

    The following video shows how to create a Databricks managed service principal:

    <iframe width="560" height="315" src="https://www.youtube.com/embed/wBmqv5DaA1E" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

  * Databricks personal access token authentication for
    [AWS](https://docs.databricks.com/dev-tools/auth/pat.html),
    [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat), or
    [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html).

    You will need the personal access token's value.

    The following video shows how to create a Databricks personal access token:

    <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

  For [Unstructured Ingest](/open-source/ingestion/overview), the following Databricks authentication types are supported:

  * For Databricks personal access token authentication for
    [AWS](https://docs.databricks.com/dev-tools/auth/pat.html),
    [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat), or
    [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html): The personal access token's value.

    The following video shows how to create a Databricks personal access token:

    <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

  * For username and password (basic) authentication ([AWS](https://docs.databricks.com/archive/dev-tools/basic.html) only): The user's name and password values.

  * For OAuth machine-to-machine (M2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html),
    [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), and
    [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html)): The client ID and OAuth secret values for the corresponding service principal.

  * For OAuth user-to-machine (U2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-u2m.html),
    [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-u2m), and
    [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-u2m.html)): No additional values.

  * For Azure managed identities (formerly Managed Service Identities (MSI) authentication) ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-mi) only): The client ID value for the corresponding managed identity.

  * For Microsoft Entra ID service principal authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-sp) only): The tenant ID, client ID, and client secret values for the corresponding service principal.

  * For Azure CLI authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-cli) only): No additional values.

  * For Microsoft Entra ID user authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/user-aad-token) only): The Entra ID token for the corresponding Entra ID user.

  * For Google Cloud Platform credentials authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-creds.html) only): The local path to the corresponding Google Cloud service account's credentials file.

  * For Google Cloud Platform ID authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-id.html) only): The Google Cloud service account's email address.

* The name of the parent catalog in Unity Catalog for
  [AWS](https://docs.databricks.com/catalogs/create-catalog.html),
  [Azure](https://learn.microsoft.com/azure/databricks/catalogs/create-catalog), or
  [GCP](https://docs.gcp.databricks.com/catalogs/create-catalog.html) for the volume.

* The name of the parent schema (formerly known as a database) in Unity Catalog for
  [AWS](https://docs.databricks.com/schemas/create-schema.html),
  [Azure](https://learn.microsoft.com/azure/databricks/schemas/create-schema), or
  [GCP](https://docs.gcp.databricks.com/schemas/create-schema.html) for the volume.

* The name of the volume in Unity Catalog for [AWS](https://docs.databricks.com/tables/managed.html),
  [Azure](https://learn.microsoft.com/azure/databricks/tables/managed), or
  [GCP](https://docs.gcp.databricks.com/tables/managed.html), and optionally any path in that volume that you want to access directly, beginning with the volume's root.

* The Databricks workspace user or service principal must have the following *minimum* set of privileges to read from or write to the
  existing volume in Unity Catalog:

  * `USE CATALOG` on the volume's parent catalog in Unity Catalog.
  * `USE SCHEMA` on the volume's parent schema (formerly known as a database) in Unity Catalog.
  * `READ VOLUME` and `WRITE VOLUME` on the volume.

  The following videos show how to create and set privileges for a catalog, schema (formerly known as a database), and volume in Unity Catalog.

  <iframe width="560" height="315" src="https://www.youtube.com/embed/yF9DJphhQQc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

  Learn more about how to check and set Unity Catalog privileges for
  [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges),
  [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or
  [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges).

The Databricks Volumes connector dependencies:

```bash CLI, Python theme={null}
pip install "unstructured-ingest[databricks-volumes]"
```

You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies).

The following environment variables:

* `DATABRICKS_HOST` - The Databricks host URL, represented by `--host` (CLI) or `host` (Python).

  <Note>
    Do not add a trailing slash (`/`) to the host URL.
  </Note>

* `DATABRICKS_CATALOG` - The Databricks catalog name for the Volume, represented by `--catalog` (CLI) or `catalog` (Python).

* `DATABRICKS_SCHEMA` - The Databricks schema name for the Volume, represented by `--schema` (CLI) or `schema` (Python). If not specified, `default` is used.

* `DATABRICKS_VOLUME` - The Databricks Volume name, represented by `--volume` (CLI) or `volume` (Python).

* `DATABRICKS_VOLUME_PATH` - Any optional path to access within the volume, specified by `--volume-path` (CLI) or `volume_path` (Python).

Environment variables based on your authentication type, depending on your cloud provider:

* For Databricks personal access token authentication (AWS, Azure, and GCP):

  * `DATABRICKS_TOKEN` - The personal access token, represented by `--token` (CLI) or `token` (Python).

* For username and password (basic) authentication (AWS only): The user's name and password values.

  * `DATABRICKS_USERNAME` - The user's name, represented by `--username` (CLI) or `username` (Python).
  * `DATABRICKS_PASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python).

* For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal.

  * `DATABRICKS_CLIENT_ID` - The client ID value for the corresponding service principal, represented by `--client-id` (CLI) or `client_id` (Python).
  * `DATABRICKS_CLIENT_SECRET` - The client ID and OAuth secret values for the corresponding service principal, represented by `--client-secret` (CLI) or `client_secret` (Python).

* For OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP): No additional environment variables.

* For Azure managed identities (MSI) authentication (Azure only):

  * `ARM_CLIENT_ID` - The client ID value for the corresponding managed identity, represented by `--azure-client-id` (CLI) or `azure_client_id` (Python).
  * If the target identity has not already been added to the workspace, then you must also specify the
    `DATABRICKS_AZURE_RESOURCE_ID`, represented by `--azure-workspace-resource-id` (CLI) or `azure_workspace_resource_id` (Python).

* For Microsoft Entra ID service principal authentication (Azure only):

  * `ARM_TENANT_ID` - The tenant ID value for the corresponding service principal, represented by `--azure-tenant-id` (CLI) or `azure_tenant_id` (Python).
  * `ARM_CLIENT_ID` - The client ID value for the corresponding service principal, represented by `--azure-client-id` (CLI) or `azure_client_id` (Python).
  * `ARM_CLIENT_SECRET` - The client secret value for the corresponding service principal, represented by `--azure-client-secret` (CLI) or `azure_client_secret` (Python).
  * If the service principal has not already been added to the workspace, then you must also specify the
    `DATABRICKS_AZURE_RESOURCE_ID`, represented by `--azure-workspace-resource-id` (CLI) or `azure_workspace_resource_id` (Python).

* For Azure CLI authentication (Azure only): No additional environment variables.

* For Microsoft Entra ID user authentication (Azure only):

  * `DATABRICKS_TOKEN` - The Entra ID token for the corresponding Entra ID user, represented by `--token` (CLI) or `token` (Python).

* For Google Cloud Platform credentials authentication (GCP only):

  * `GOOGLE_CREDENTIALS` - The local path to the corresponding Google Cloud service account's credentials file, represented by `--google-credentials` (CLI) or `google_credentials`

* For Google Cloud Platform ID authentication (GCP only):

  * `GOOGLE_SERVICE_ACCOUNT` - The Google Cloud service account's email address, represented by `--google-service-account` (CLI) or `google_service_account` (Python).

* Alternatively, you can store the preceding settings in a local
  [Databricks configuration profile](https://docs.databricks.com/en/dev-tools/auth/config-profiles.html) and then just
  refer to the profile's name:

  * `DATABRICKS_PROFILE` - The name of the Databricks configuration profile, represented by `--profile` (CLI) or `profile` (Python).

Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector.

This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page.

<CodeGroup>
  ```bash CLI theme={null}
  #!/usr/bin/env bash

  # Chunking and embedding are optional.

  unstructured-ingest \
    databricks-volumes \
      --profile $DATABRICKS_PROFILE \
      --host $DATABRICKS_HOST \
      --catalog $DATABRICKS_CATALOG \
      --schema $DATABRICKS_SCHEMA \
      --volume $DATABRICKS_VOLUME \
      --volume-path $DATABRICKS_VOLUME_PATH \
      --partition-by-api \
      --api-key $UNSTRUCTURED_API_KEY \
      --partition-endpoint $UNSTRUCTURED_API_URL \
      --strategy hi_res \
      --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
      --chunking-strategy by_title \
      --embedding-provider huggingface \
    local \
      --output-dir $LOCAL_FILE_OUTPUT_DIR
  ```

  ```python Python Ingest theme={null}
  import os

  from unstructured_ingest.pipeline.pipeline import Pipeline
  from unstructured_ingest.interfaces import ProcessorConfig

  # For all supported Databricks authentication types, you can import this:
  from unstructured_ingest.processes.connectors.databricks.volumes import (
      DatabricksVolumesIndexerConfig,
      DatabricksVolumesConnectionConfig,
      DatabricksVolumesAccessConfig,
      DatabricksVolumesDownloaderConfig
  )

  # Alternatively, for supported Databricks on AWS authentication types only, you can import this:
  # from unstructured_ingest.processes.connectors.databricks.volumes_aws import (
  #     DatabricksAWSVolumesIndexerConfig,
  #     DatabricksAWSVolumesConnectionConfig,
  #     DatabricksAWSVolumesAccessConfig,
  #     DatabricksAWSVolumesDownloaderConfig
  # )

  # Alternatively, for supported Azure Databricks authentication types only, you can import this:
  # from unstructured_ingest.processes.connectors.databricks.volumes_azure import (
  #     DatabricksAzureVolumesIndexerConfig,
  #     DatabricksAzureVolumesConnectionConfig,
  #     DatabricksAzureVolumesAccessConfig,
  #     DatabricksAzureVolumesDownloaderConfig
  # )

  # Alternatively, for supported Databricks on Google Cloud authentication types only, you can import this:
  # from unstructured_ingest.processes.connectors.databricks.volumes_gcp import (
  #     DatabricksGoogleVolumesIndexerConfig,
  #     DatabricksGoogleVolumesConnectionConfig,
  #     DatabricksGoogleVolumesAccessConfig,
  #     DatabricksGoogleVolumesDownloaderConfig
  # )

  from unstructured_ingest.processes.connectors.local import (
      LocalConnectionConfig,
      LocalUploaderConfig
  )
  from unstructured_ingest.processes.partitioner import PartitionerConfig
  from unstructured_ingest.processes.chunker import ChunkerConfig
  from unstructured_ingest.processes.embedder import EmbedderConfig

  # Chunking and embedding are optional.

  if __name__ == "__main__":
      Pipeline.from_configs(
          context=ProcessorConfig(reprocess=True),
          indexer_config=DatabricksVolumesIndexerConfig(recursive=True),
          # For specifying a Databricks configuration profile:
          downloader_config=DatabricksVolumesDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")),
          source_connection_config=DatabricksVolumesConnectionConfig(
              access_config=DatabricksVolumesAccessConfig(profile=os.getenv("DATABRICKS_PROFILE")),
              host=os.getenv("DATABRICKS_HOST"),
              catalog=os.getenv("DATABRICKS_CATALOG"),
              schema=os.getenv("DATABRICKS_SCHEMA"),
              volume=os.getenv("DATABRICKS_VOLUME"),
              volume_path=os.getenv("DATABRICKS_VOLUME_PATH")
          ),
          # Other examples:
          #
          # For Databricks on AWS, with Databricks personal access token authentication:
          # downloader_config=DatabricksAWSVolumesDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")),
          # source_connection_config=DatabricksAWSVolumesConnectionConfig(
          #     access_config=DatabricksAWSVolumesAccessConfig(token=os.getenv("DATABRICKS_TOKEN")),
          #     host=os.getenv("DATABRICKS_HOST")
          # ),
          #
          # For Azure Databricks, with Microsoft Entra ID service principal authentication:
          # downloader_config=DatabricksAzureVolumesDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")),
          # source_connection_config=DatabricksAzureVolumesConnectionConfig(
          #     access_config=DatabricksAzureVolumesAccessConfig(
          #         azure_client_id=os.getenv("ARM_CLIENT_ID"),
          #         azure_client_secret=os.getenv("ARM_CLIENT_SECRET"),
          #         azure_tenant_id=os.getenv("ARM_TENANT_ID")   
          #     ),
          #     host=os.getenv("DATABRICKS_HOST")
          # ),
          #
          # For Databricks on Google Cloud, with Google Cloud Platform credentials authentication:
          # downloader_config=DatabricksGoogleVolumesDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")),
          # source_connection_config=DatabricksGoogleVolumesConnectionConfig(
          #     access_config=DatabricksGoogleVolumesAccessConfig(
          #         google_service_account=os.getenv("GOOGLE_CREDENTIALS")
          #     ),
          #     host=os.getenv("DATABRICKS_HOST")
          # ),
          partitioner_config=PartitionerConfig(
              partition_by_api=True,
              api_key=os.getenv("UNSTRUCTURED_API_KEY"),
              partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
              additional_partition_args={
                  "split_pdf_page": True,
                  "split_pdf_allow_failed": True,
                  "split_pdf_concurrency_level": 15
              }
          ),
          chunker_config=ChunkerConfig(chunking_strategy="by_title"),
          embedder_config=EmbedderConfig(embedding_provider="huggingface"),
          uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
      ).run()
  ```
</CodeGroup>

For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed:

* To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python).

  Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:

  * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python)
  * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python)
  * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`

* To send files to the legacy [Unstructured Partition Endpoint](/api-reference/legacy-api/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python).

  Unstructured also requires an Unstructured API key and API URL, by adding the following:

  * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python)
  * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python)
  * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively.

  <Note>
    You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, which applies to **Let's Go**, **Pay-As-You-Go**, and **Business SaaS** accounts.

    The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the legacy[Unstructured Partition Endpoint](/api-reference/legacy-api/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).

    If you do not have an API key, [get one now](/api-reference/legacy-api/partition/overview).

    If you are using a **Business** account, the process
    for generating Unstructured API keys, and the Unstructured API URL that you use, are different.
    For instructions, see your Unstructured account administrator, or email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).
  </Note>
