> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unstructured.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Delta Tables in Databricks

<Note>
  This article covers connecting Unstructured to Delta Tables in Databricks.

  For information about connecting Unstructured to Delta Tables in Amazon S3 instead, see
  [Delta Tables in Amazon S3](/open-source/ingestion/destination-connectors/delta-table).

  For information about connecting Unstructured to Databricks Volumes instead, see
  [Databricks Volumes](/open-source/ingestion/destination-connectors/databricks-volumes).
</Note>

Batch process all your records to store structured outputs in a Delta Table in Databricks.

The Delta Tables in Databricks connector provides two implementations:

* The *SQL-based* implementation enables you to have Unstructured write its processed data directly to a table in Unity Catalog.
  If you use this implementation, you do not need to provide a volume as described in the following requirements.
* The *volume-based* implementation enables you to have Unstructured write its processed data as a file to a volume in Unity Catalog. Unstructured then
  uses that file in the volume to write the file's data to a table in Unity Catalog. The file remains in the volume.

The requirements are as follows.

* A Databricks account on [AWS](https://docs.databricks.com/getting-started/free-trial.html),
  [Azure](https://learn.microsoft.com/azure/databricks/getting-started/), or
  [GCP](https://docs.gcp.databricks.com/getting-started/index.html).

* A workspace within the Datbricks account for [AWS](https://docs.databricks.com/admin/workspace/index.html),
  [Azure](https://learn.microsoft.com/azure/databricks/admin/workspace/), or
  [GCP](https://docs.gcp.databricks.com/admin/workspace/index.html).

* One of the following compute resources within the workspace:

  * A SQL warehouse for [AWS](https://docs.databricks.com/compute/sql-warehouse/create.html),
    [Azure](https://learn.microsoft.com/azure/databricks/compute/sql-warehouse/create), or
    [GCP](https://docs.gcp.databricks.com/compute/sql-warehouse/create.html).

    The following video shows how to create a SQL warehouse if you do not already have one available, get its **Server Hostname** and **HTTP Path** values, and set permissions for someone other than the warehouse's owner to use it:

    <iframe width="560" height="315" src="https://www.youtube.com/embed/N-Aw9-U3_fE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

  * An all-purpose cluster for [AWS](https://docs.databricks.com/compute/use-compute.html),
    [Azure](https://learn.microsoft.com/azure/databricks/compute/use-compute), or
    [GCP](https://docs.gcp.databricks.com/compute/use-compute.html).

    The following video shows how to create an all-purpose cluster if you do not already have one available, get its **Server Hostname** and **HTTP Path** values, and set permissions for someone other than the cluster's owner to use it:

    <iframe width="560" height="315" src="https://www.youtube.com/embed/apgibaelVY0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

* The SQL warehouse's or cluster's **Server Hostname** and **HTTP Path** values for [AWS](https://docs.databricks.com/integrations/compute-details.html),
  [Azure](https://learn.microsoft.com/azure/databricks/integrations/compute-details), or
  [GCP](https://docs.gcp.databricks.com/integrations/compute-details.html).

* Unity Catalog enabled in the workspace for [AWS](https://docs.databricks.com/data-governance/unity-catalog/get-started.html),
  [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/get-started), or
  [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/get-started.html).

* Within Unity Catalog:

  * A catalog
    for [AWS](https://docs.databricks.com/catalogs/create-catalog.html),
    [Azure](https://learn.microsoft.com/azure/databricks/catalogs/create-catalog), or
    [GCP](https://docs.gcp.databricks.com/catalogs/create-catalog.html).
  * A schema (formerly known as a database)
    for [AWS](https://docs.databricks.com/schemas/create-schema.html),
    [Azure](https://learn.microsoft.com/azure/databricks/schemas/create-schema), or
    [GCP](https://docs.gcp.databricks.com/schemas/create-schema.html)
    within that catalog,
  * A table
    for [AWS](https://docs.databricks.com/tables/managed.html),
    [Azure](https://learn.microsoft.com/azure/databricks/tables/managed), or
    [GCP](https://docs.gcp.databricks.com/tables/managed.html)
    within that schema (formerly known as a database).

    You can have the connector attempt to create a table for you automatically at run time. To do this, in the connector settings as described later in this article,
    do one of the following:

    * Specify the name of the table that you want the connector to attempt to create within the specified catalog and schema (formerly known as a database).
    * Leave the table name blank. The connector will attempt to create a table within the specified catalog and schema (formerly known as a database).
      For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview), the table is named `u<short-workflow-id>`.
      For the [Unstructured Ingest CLI and Ingest Python library](/open-source/ingestion/overview), the table is named `unstructuredautocreated`.

    The connector will attempt to create the table on behalf of the related Databricks workspace user or Databricks managed service principal that is referenced in the connector settings, as described later in these requirements.
    If successful, the table's owner is set as the related Databricks workspace user or Databricks managed service principal. The owner will have all Unity Catalog
    privileges on the table by default. No other Databricks workspace users or Databricks managed service principals will have any privileges on the table by default.

    <Warning>
      If the table's parent schema (formerly known as a database) is not owned by the same Databricks workspace user or Databricks managed service principal that is
      referenced in the connector settings, then you should grant the new table's owner the `CREATE TABLE` privilege on that parent schema (formerly known as a database)
      before the connector attempts to create the table. Otherwise, table creation could fail.
    </Warning>

    <Note>
      Using dashes (`-`) in the names of catalogs, schemas (formerly known as databases), and tables might cause isolated issues with the connector. It is
      recommended to use underscores (`_`) instead of dashes in the names of catalogs, schemas, and tables.
    </Note>

  The following video shows how to create a catalog, schema (formerly known as a database), and a table in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them:

  <iframe width="560" height="315" src="https://www.youtube.com/embed/ffNnq-6bpd4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

  If you want to use an existing table or create one yourself beforehand, this table must contain at minimum the following column names and their data types:

  ```text theme={null}
  CREATE TABLE IF NOT EXISTS <catalog_name>.<schema_name>.<table_name> (
      id STRING NOT NULL PRIMARY KEY,
      record_id STRING NOT NULL,
      element_id STRING NOT NULL,
      text STRING,
      embeddings ARRAY<FLOAT>,
      type STRING,
      metadata VARIANT
  );
  ```

  <Info>
    In Databricks, a table's *schema* is different than a *schema* (formerly known as a database) in a catalog-schema object relationship in Unity Catalog.
  </Info>

* Within Unity Catalog, a volume
  for [AWS](https://docs.databricks.com/volumes/utility-commands.html),
  [Azure](https://learn.microsoft.com/azure/databricks/volumes/utility-commands),
  or [GCP](https://docs.gcp.databricks.com/volumes/utility-commands.html). The volume can be in the same
  schema (formerly known as a database) as the table, or the volume and table can be in separate schemas. In either case, both of these
  schemas must share the same parent catalog.

  <Note>
    Using dashes (`-`) in the names of volumes might cause isolated issues with the connector. It is
    recommended to use underscores (`_`) instead of dashes in the names of volumes.
  </Note>

  The following video shows how to create a catalog, schema (formerly known as a database), and a volume in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them:

  <iframe width="560" height="315" src="https://www.youtube.com/embed/yF9DJphhQQc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

* For Databricks managed service principal authentication (using Databricks OAuth M2M) to the workspace:

  * A Databricks managed service principal.
    This service principal must have the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse.
  * The service principal's **UUID** (or **Client ID** or **Application ID**) value.
  * The OAuth **Secret** value for the service principal.

  To get this information, see Steps 1-3 of the instructions for [AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html),
  [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), or
  [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html).

  <Note>
    For Azure Databricks, this connector only supports Databricks managed service principals for authentication.
    Microsoft Entra ID managed service principals are not supported.
  </Note>

  The following video shows how to create a Databricks managed service principal if you do not already have one available:

  <iframe width="560" height="315" src="https://www.youtube.com/embed/wBmqv5DaA1E" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

* For Databricks personal access token authentication to the workspace, the
  Databricks personal access token value for
  [AWS](https://docs.databricks.com/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users),
  [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat#azure-databricks-personal-access-tokens-for-workspace-users), or
  [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users).
  This token must be for the workspace user who
  has the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse,

  The following video shows how to create a Databricks personal access token if you do not already have one available:

  <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

* The Databricks workspace user or Databricks managed service principal must have the following *minimum* set of permissions and privileges to write to an
  existing volume or table in Unity Catalog. If the owner of these is that Databricks workspace user or Databricks managed service principal, then
  they will have all necessary permissions and privileges by default. If the owner is someone else, then the following permissions and privileges must be
  explicitly granted to them before using the connector:

  * To use an all-purpose cluster for access, `Can Restart` permission on that cluster. Learn how to check and set cluster permissions for
    [AWS](https://docs.databricks.com/compute/clusters-manage.html#compute-permissions),
    [Azure](https://learn.microsoft.com/azure/databricks/compute/clusters-manage#cluster-level-permissions), or
    [GCP](https://docs.gcp.databricks.com/compute/clusters-manage.html#compute-permissions).

  * To use a SQL warehouse for access, `Can use` permission on that SQL warehouse. Learn how to check and set SQL warehouse permissions for
    [AWS](https://docs.databricks.com/compute/sql-warehouse/create.html#manage-a-sql-warehouse),
    [Azure](https://learn.microsoft.com/azure/databricks/compute/sql-warehouse/create#manage), or
    [GCP](https://docs.gcp.databricks.com/compute/sql-warehouse/create.html#manage-a-sql-warehouse).

  * To access a Unity Catalog volume, the following privileges:

    * `USE CATALOG` on the volume's parent catalog in Unity Catalog.
    * `USE SCHEMA` on the volume's parent schema (formerly known as a database) in Unity Catalog.
    * `READ VOLUME` and `WRITE VOLUME` on the volume.

    Learn how to check and set Unity Catalog privileges for
    [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges),
    [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or
    [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges).

  * To access a Unity Catalog table, the following privileges:

    * `USE CATALOG` on the table's parent catalog in Unity Catalog.
    * `USE SCHEMA` on the table's parent schema (formerly known as a database) in Unity Catalog.
    * To create a new table, `CREATE TABLE` on the table's parent schema (formerly known as a database) in Unity Catalog.
    * If the table already exists, `MODIFY` and `SELECT` on the table.

    Learn how to check and set Unity Catalog privileges for
    [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges),
    [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or
    [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges).

The Delta Tables in Databricks connector dependencies:

```bash CLI, Python theme={null}
pip install "unstructured-ingest[databricks-delta-tables]"
```

You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies).

The following environment variables:

* `DATABRICKS_HOST` - The Databricks cluster's or SQL warehouse's **Server Hostname** value, represented by `--server-hostname` (CLI) or `server_hostname` (Python).

* `DATABRICKS_HTTP_PATH` - The cluster's or SQL warehouse's **HTTP Path** value, represented by `--http-path` (CLI) or `http_path` (Python).

* `DATABRICKS_TOKEN` - For Databricks personal access token authentication, the token's value, represented by `--token` (CLI) or `token` (Python).

* `DATABRICKS_CLIENT_ID` - For Databricks managed service principal authenticaton, the service principal's **UUID** (or **Client ID** or **Application ID**) value, represented by `--client-id` (CLI) or `client_id` (Python).

* `DATABRICKS_CLIENT_SECRET` - For Databricks managed service principal authenticaton, the service principal's OAuth **Secret** value, represented by `--client-secret` (CLI) or `client_secret` (Python).

* `DATABRICKS_CATALOG` - The name of the catalog in Unity Catalog, represented by `--catalog` (CLI) or `catalog` (Python).

* `DATABRICKS_DATABASE` - The name of the schema (formerly known as a database) inside of the catalog for the target table, represented by `--database` (CLI) or `database` (Python). The default is `default` if not otherwise specified.

  If you are also using a volume, and the target table and volume are in the same schema (formerly known as a database), then `DATABRICKS_DATABASE` and `DATABRICKS_SCHEMA` will have the same values.

* `DATABRICKS_TABLE` - The name of the table inside of the schema (formerly known as a database), represented by `--table-name` (CLI) or `table_name` (Python).

  * If a table name is specified, but a table with that name does not exist within the specified schema (formerly known as a database), the connector attempts to create a table with that name within that schema.
  * If no table name is specified, the connector attempts to create a table named `u<short-workflow-id>` within the specified schema (formerly known as a database).

  See the beginning of this article for additional technical requirements before having the connector attempt to create a table.

<Note>
  Using dashes (`-`) in the names of catalogs, schemas (formerly known as databases), tables, and volumes might cause isolated issues with the connector. It is
  recommended to use underscores (`_`) instead of dashes in the names of catalogs, schemas, tables, and volumes.
</Note>

For the SQL-based implementation, add these environment variables:

* `DATABRICKS_RECORD_ID_KEY` - The name of the column that uniquely identifies each record in the table, represented by `--record-id-key` (CLI) or `record_id_key` (Python).
* The maximum number of records to be transmitted per batch, represented by `--batch-size` (CLI) or `batch_size` (Python). The default is `50` if not otherwise specified.

For the volume-based implementation, add these environment variables:

* `DATABRICKS_SCHEMA` - The name of the schema (formerly known as a database) inside of the catalog for the target volume, represented by `--schema` (CLI) or `schema` (Python). The default is `default` if not otherwise specified.

  If the target volume and table are in the same schema (formerly known as a database), then `DATABRICKS_SCHEMA` and `DATABRICKS_SCHEMA` will have the same values.

* `DATABRICKS_VOLUME` - The name of the volume inside of the schema (formerly known as a database), represented by `--volume` (CLI) or `volume` (Python).

* `DATABRICKS_VOLUME_PATH` - Optionally, a specific path inside of the volume that you want to start accessing from, starting from the volume's root, represented by `--volume-path` (CLI) or `volume_path` (Python). The default is to start accessing from the volume's root if not otherwise specified.

Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library.  The source connector can be any of the ones supported. This example uses the local source connector.

This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page.

<CodeGroup>
  ```bash CLI (SQL-Based) theme={null}
  #!/usr/bin/env bash

  # Chunking and embedding are optional.

  # For authenticating with Databricks personal access tokens:
  unstructured-ingest \
    local \
      --input-path $LOCAL_FILE_INPUT_DIR \
      --chunking-strategy by_title \
      --embedding-provider huggingface \
      --partition-by-api \
      --api-key $UNSTRUCTURED_API_KEY \
      --partition-endpoint $UNSTRUCTURED_API_URL \
      --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
    databricks-delta-tables \
      --server-hostname $DATABRICKS_HOST \
      --http-path $DATABRICKS_HTTP_PATH \
      --token $DATABRICKS_TOKEN \
      --catalog $DATABRICKS_CATALOG \
      --database $DATABRICKS_DATABASE \
      --table-name $DATABRICKS_TABLE
      --record-id-key $DATABRICKS_RECORD_ID_KEY \
      --batch-size 50

  # For authenticating with Databricks managed service principals:
  unstructured-ingest \
    local \
      --input-path $LOCAL_FILE_INPUT_DIR \
      --chunking-strategy by_title \
      --embedding-provider huggingface \
      --partition-by-api \
      --api-key $UNSTRUCTURED_API_KEY \
      --partition-endpoint $UNSTRUCTURED_API_URL \
      --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
    databricks-delta-tables \
      --server-hostname $DATABRICKS_HOST \
      --http-path $DATABRICKS_HTTP_PATH \
      --client-id $DATABRICKS_CLIENT_ID \
      --client-secret $DATABRICKS_CLIENT_SECRET \
      --catalog $DATABRICKS_CATALOG \
      --database $DATABRICKS_DATABASE \
      --table-name $DATABRICKS_TABLE
      --record-id-key $DATABRICKS_RECORD_ID_KEY \
      --batch-size 50
  ```

  ```bash CLI (Volume-Based) theme={null}
  #!/usr/bin/env bash

  # Chunking and embedding are optional.

  # For authenticating with Databricks personal access tokens:
  unstructured-ingest \
    local \
      --input-path $LOCAL_FILE_INPUT_DIR \
      --chunking-strategy by_title \
      --embedding-provider huggingface \
      --partition-by-api \
      --api-key $UNSTRUCTURED_API_KEY \
      --partition-endpoint $UNSTRUCTURED_API_URL \
      --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
    databricks-volume-delta-tables \
      --server-hostname $DATABRICKS_HOST \
      --http-path $DATABRICKS_HTTP_PATH \
      --token $DATABRICKS_TOKEN \
      --catalog $DATABRICKS_CATALOG \
      --schema $DATABRICKS_SCHEMA \
      --database $DATABRICKS_DATABASE \
      --volume $DATABRICKS_VOLUME \
      --volume-path $DATABRICKS_VOLUME_PATH \
      --table-name $DATABRICKS_TABLE
    
  # For authenticating with Databricks managed service principals:
  unstructured-ingest \
    local \
      --input-path $LOCAL_FILE_INPUT_DIR \
      --chunking-strategy by_title \
      --embedding-provider huggingface \
      --partition-by-api \
      --api-key $UNSTRUCTURED_API_KEY \
      --partition-endpoint $UNSTRUCTURED_API_URL \
      --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
    databricks-volume-delta-tables \
      --server-hostname $DATABRICKS_HOST \
      --http-path $DATABRICKS_HTTP_PATH \
      --client-id $DATABRICKS_CLIENT_ID \
      --client-secret $DATABRICKS_CLIENT_SECRET \
      --catalog $DATABRICKS_CATALOG \
      --schema $DATABRICKS_SCHEMA \
      --database $DATABRICKS_DATABASE \
      --volume $DATABRICKS_VOLUME \
      --volume-path $DATABRICKS_VOLUME_PATH \
      --table-name $DATABRICKS_TABLE
  ```

  ```python Python Ingest (SQL-Based) theme={null}
  import os

  from unstructured_ingest.pipeline.pipeline import Pipeline
  from unstructured_ingest.interfaces import ProcessorConfig

  from unstructured_ingest.processes.connectors.sql.databricks_delta_tables import (
      DatabricksDeltaTablesConnectionConfig,
      DatabricksDeltaTablesAccessConfig,
      DatabricksDeltaTablesUploadStagerConfig,
      DatabricksDeltaTablesUploaderConfig
  )

  from unstructured_ingest.processes.connectors.local import (
      LocalIndexerConfig,
      LocalConnectionConfig,
      LocalDownloaderConfig
  )

  from unstructured_ingest.processes.partitioner import PartitionerConfig
  from unstructured_ingest.processes.chunker import ChunkerConfig
  from unstructured_ingest.processes.embedder import EmbedderConfig

  # Chunking and embedding are optional.

  if __name__ == "__main__":

      Pipeline.from_configs(
          context=ProcessorConfig(),
          indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
          downloader_config=LocalDownloaderConfig(),
          source_connection_config=LocalConnectionConfig(),
          partitioner_config=PartitionerConfig(
              partition_by_api=True,
              api_key=os.getenv("UNSTRUCTURED_API_KEY"),
              partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
              additional_partition_args={
                  "split_pdf_page": True,
                  "split_pdf_allow_failed": True,
                  "split_pdf_concurrency_level": 15
              }
          ),
          chunker_config=ChunkerConfig(chunking_strategy="by_title"),
          embedder_config=EmbedderConfig(embedding_provider="huggingface"),
          
          # For authenticating with Databricks personal access tokens.
          destination_connection_config=DatabricksDeltaTablesConnectionConfig(
              access_config=DatabricksDeltaTablesAccessConfig(
                  token=os.getenv("DATABRICKS_TOKEN"),
              ),
              server_hostname=os.getenv("DATABRICKS_HOST"),
              http_path=os.getenv("DATABRICKS_HTTP_PATH")
          ),

          # For authenticating with Databricks managed service principals.
          # destination_connection_config=DatabricksDeltaTablesConnectionConfig(
          #     access_config=DatabricksDeltaTablesAccessConfig(
          #         client_id=os.getenv("DATABRICKS_CLIENT_ID"),
          #         client_secret=os.getenv("DATABRICKS_CLIENT_SECRET")
          #     ),
          #     server_hostname=os.getenv("DATABRICKS_HOST"),
          #     http_path=os.getenv("DATABRICKS_HTTP_PATH")
          # ),

          stager_config=DatabricksDeltaTablesUploadStagerConfig(),
          uploader_config=DatabricksDeltaTablesUploaderConfig(
              catalog=os.getenv("DATABRICKS_CATALOG"),
              database=os.getenv("DATABRICKS_DATABASE"),
              table_name=os.getenv("DATABRICKS_TABLE"),
              record_id_key=os.getenv("DATABRICKS_RECORD_ID_KEY"),
              batch_size=50    
          )
      ).run()
  ```

  ```python Python Ingest (Volume-Based) theme={null}
  import os

  from unstructured_ingest.pipeline.pipeline import Pipeline
  from unstructured_ingest.interfaces import ProcessorConfig

  from unstructured_ingest.processes.connectors.sql.databricks_delta_tables import (
      DatabricksDeltaTablesConnectionConfig,
      DatabricksDeltaTablesAccessConfig,
      DatabricksDeltaTablesUploadStagerConfig
  )

  from unstructured_ingest.processes.connectors.databricks.volumes_table import (
      DatabricksVolumeDeltaTableUploaderConfig
  )

  from unstructured_ingest.processes.connectors.local import (
      LocalIndexerConfig,
      LocalConnectionConfig,
      LocalDownloaderConfig
  )

  from unstructured_ingest.processes.partitioner import PartitionerConfig
  from unstructured_ingest.processes.chunker import ChunkerConfig
  from unstructured_ingest.processes.embedder import EmbedderConfig

  if __name__ == "__main__":

      Pipeline.from_configs(
          context=ProcessorConfig(),
          indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
          downloader_config=LocalDownloaderConfig(),
          source_connection_config=LocalConnectionConfig(),
          partitioner_config=PartitionerConfig(
              partition_by_api=True,
              api_key=os.getenv("UNSTRUCTURED_API_KEY"),
              partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
              additional_partition_args={
                  "split_pdf_page": True,
                  "split_pdf_allow_failed": True,
                  "split_pdf_concurrency_level": 15
              }
          ),
          chunker_config=ChunkerConfig(chunking_strategy="by_title"),
          embedder_config=EmbedderConfig(embedding_provider="huggingface"),
          
          # For authenticating with Databricks personal access tokens.
          destination_connection_config=DatabricksDeltaTablesConnectionConfig(
              access_config=DatabricksDeltaTablesAccessConfig(
                  token=os.getenv("DATABRICKS_TOKEN"),
              ),
              server_hostname=os.getenv("DATABRICKS_HOST"),
              http_path=os.getenv("DATABRICKS_HTTP_PATH")
          ),

          # For authenticating with Databricks managed service principals.
          # destination_connection_config=DatabricksDeltaTablesConnectionConfig(
          #     access_config=DatabricksDeltaTablesAccessConfig(
          #         client_id=os.getenv("DATABRICKS_CLIENT_ID"),
          #         client_secret=os.getenv("DATABRICKS_CLIENT_SECRET")
          #     ),
          #     server_hostname=os.getenv("DATABRICKS_HOST"),
          #     http_path=os.getenv("DATABRICKS_HTTP_PATH")
          # ),

          stager_config=DatabricksDeltaTablesUploadStagerConfig(),
          uploader_config=DatabricksVolumeDeltaTableUploaderConfig(
              catalog=os.getenv("DATABRICKS_CATALOG"),
              schema=os.getenv("DATABRICKS_SCHEMA"),
              database=os.getenv("DATABRICKS_DATABASE"),
              volume=os.getenv("DATABRICKS_VOLUME"),
              volume_path=os.getenv("DATABRICKS_VOLUME_PATH"),
              table_name=os.getenv("DATABRICKS_TABLE")
          )
      ).run()
  ```
</CodeGroup>

For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed:

* To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python).

  Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:

  * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python)
  * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python)
  * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`

* To send files to the legacy [Unstructured Partition Endpoint](/api-reference/legacy-api/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python).

  Unstructured also requires an Unstructured API key and API URL, by adding the following:

  * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python)
  * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python)
  * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively.

  <Note>
    You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, which applies to **Let's Go**, **Pay-As-You-Go**, and **Business SaaS** accounts.

    The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the legacy[Unstructured Partition Endpoint](/api-reference/legacy-api/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).

    If you do not have an API key, [get one now](/api-reference/legacy-api/partition/overview).

    If you are using a **Business** account, the process
    for generating Unstructured API keys, and the Unstructured API URL that you use, are different.
    For instructions, see your Unstructured account administrator, or email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).
  </Note>
