This page was recently updated. What do you think about it? Let us know!.

Batch process all your records to store structured outputs in Databricks Volumes.

You will need:

The Databricks Volumes prerequisites:

  • The Databricks workspace URL. Get the workspace URL for AWS, Azure, or GCP.

    Examples:

    • AWS: https://<workspace-id>.cloud.databricks.com
    • Azure: https://adb-<workspace-id>.<random-number>.azuredatabricks.net
    • GCP: https://<workspace-id>.<random-number>.gcp.databricks.com
  • The Databricks compute resource’s ID. Get the compute resource ID for AWS, Azure, or GCP.

  • The Databricks authentication details. For more information, see the documentation for AWS, Azure, or GCP.

    More specifically, you will need:

    • For Databricks personal access token authentication (AWS, Azure, and GCP): The personal access token’s value.
    • For username and password (basic) authentication (AWS only): The user’s name and password values.
    • For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal.
    • For OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP): No additional values.
    • For Azure managed identities (MSI) authentication (Azure only): The client ID value for the corresponding managed identity.
    • For Microsoft Entra ID service principal authentication (Azure only): The tenant ID, client ID, and client secret values for the corresponding service principal.
    • For Azure CLI authentication (Azure only): No additional values.
    • For Microsoft Entra ID user authentication (Azure only): The Entra ID token for the corresponding Entra ID user.
    • For Google Cloud Platform credentials authentication (GCP only): The local path to the corresponding Google Cloud service account’s credentials file.
    • For Google Cloud Platform ID authentication (GCP only): The Google Cloud service account’s email address.
  • The Databricks catalog name for the Volume. Get the catalog name for AWS, Azure, or GCP.

  • The Databricks schema name for the Volume. Get the schema name for AWS, Azure, or GCP.

  • The Databricks Volume name, and optionally any path in that Volume that you want to access directly. Get the Volume information for AWS, Azure, or GCP.

The Databricks Volumes connector dependencies:

CLI, Python
pip install "unstructured-ingest[databricks-volumes]"

You might also need to install additional dependencies, depending on your needs. Learn more.

The following environment variables:

  • DATABRICKS_HOST - The Databricks host URL, represented by --host (CLI) or host (Python).
  • DATABRICKS_CLUSTER_ID - The Databricks compute resource ID, represented by --cluster-id (CLI) or cluster_id (Python).
  • DATABRICKS_CATALOG - The Databricks catalog name for the Volume, represented by --catalog (CLI) or catalog (Python).
  • DATABRICKS_SCHEMA - The Databricks schema name for the Volume, represented by --schema (CLI) or schema (Python). If not specified, default is used.
  • DATABRICKS_VOLUME - The Databricks Volume name, represented by --volume (CLI) or volume (Python).
  • DATABRICKS_VOLUME_PATH - Any optional path to access within the volume, specified by --volume-path (CLI) or volume_path (Python).

Environment variables based on your authentication type, depending on your cloud provider:

  • For Databricks personal access token authentication (AWS, Azure, and GCP):

    • DATABRICKS_TOKEN - The personal access token, represented by --token (CLI) or token (Python).
  • For username and password (basic) authentication (AWS only): The user’s name and password values.

    • DATABRICKS_USERNAME - The user’s name, represented by --username (CLI) or username (Python).
    • DATABRICKS_PASSWORD - The user’s password, represented by --password (CLI) or password (Python).
  • For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal.

    • DATABRICKS_CLIENT_ID - The client ID value for the corresponding service principal, represented by --client-id (CLI) or client_id (Python).
    • DATABRICKS_CLIENT_SECRET - The client ID and OAuth secret values for the corresponding service principal, represented by --client-secret (CLI) or client_secret (Python).
  • For OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP): No additional environment variables.

  • For Azure managed identities (MSI) authentication (Azure only):

    • ARM_CLIENT_ID - The client ID value for the corresponding managed identity, represented by --azure-client-id (CLI) or azure_client_id (Python).
    • If the target identity has not already been added to the workspace, then you must also specify the DATABRICKS_AZURE_RESOURCE_ID, represented by --azure-workspace-resource-id (CLI) or azure_workspace_resource_id (Python).
  • For Microsoft Entra ID service principal authentication (Azure only):

    • ARM_TENANT_ID - The tenant ID value for the corresponding service principal, represented by --azure-tenant-id (CLI) or azure_tenant_id (Python).
    • ARM_CLIENT_ID - The client ID value for the corresponding service principal, represented by --azure-client-id (CLI) or azure_client_id (Python).
    • ARM_CLIENT_SECRET - The client secret value for the corresponding service principal, represented by --azure-client-secret (CLI) or azure_client_secret (Python).
    • If the service principal has not already been added to the workspace, then you must also specify the DATABRICKS_AZURE_RESOURCE_ID, represented by --azure-workspace-resource-id (CLI) or azure_workspace_resource_id (Python).
  • For Azure CLI authentication (Azure only): No additional environment variables.

  • For Microsoft Entra ID user authentication (Azure only):

    • DATABRICKS_TOKEN - The Entra ID token for the corresponding Entra ID user, represented by --token (CLI) or token (Python).
  • For Google Cloud Platform credentials authentication (GCP only):

    • GOOGLE_CREDENTIALS - The local path to the corresponding Google Cloud service account’s credentials file, represented by --google-credentials (CLI) or google_credentials
  • For Google Cloud Platform ID authentication (GCP only):

    • GOOGLE_SERVICE_ACCOUNT - The Google Cloud service account’s email address, represented by --google-service-account (CLI) or google_service_account (Python).
  • Alternatively, you can store the preceding settings in a local Databricks configuration profile and then just refer to the profile’s name:

    • DATABRICKS_PROFILE - The name of the Databricks configuration profile, represented by --profile (CLI) or profile (Python).

Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector.

This example sends files to Unstructured API services for processing by default. To process files locally instead, see the instructions at the end of this page.

For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the --partition-by-api option (CLI) or partition_by_api (Python) parameter to specify where files are processed:

  • To do local file processing, omit --partition-by-api (CLI) or partition_by_api (Python), or explicitly specify partition_by_api=False (Python).

    Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:

    • --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
    • --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
    • The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL
  • To send files to Unstructured API services for processing, specify --partition-by-api (CLI) or partition_by_api=True (Python).

    Unstructured API services also requires an Unstructured API key and API URL, by adding the following:

    • --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
    • --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
    • The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL, representing your API key and API URL, respectively.

    Get an API key and API URL.