> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unstructured.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Delta Tables in Databricks

> Send processed data from Unstructured to a Delta Table in Databricks.

<Tip>
  This article covers connecting Unstructured to Delta Tables in Databricks.

  For information about connecting Unstructured to Delta Tables in Amazon S3 instead, see
  [Delta Tables in Amazon S3](/api-reference/workflow/destinations/delta-table).

  For information about connecting Unstructured to Databricks Volumes instead, see
  [Databricks Volumes](/api-reference/workflow/destinations/databricks-volumes).
</Tip>

<Note>
  First time creating a connector? [Read this first](/api-reference/workflow/connector-first-time-reqs).
</Note>

## Requirements

You will need:

* [Account and resources](#accounts-and-resources)
* [Unity Catalog resources](#unity-catalog-resources)
* [Authentication and permissions](#authentication-and-permissions)

### Accounts and resources

* A Databricks account on [Amazon Web Services (AWS)](https://docs.databricks.com/getting-started/free-trial.html), [Microsoft Azure (Azure)](https://learn.microsoft.com/azure/databricks/getting-started/), or [Google Cloud Platform (GCP)](https://docs.gcp.databricks.com/getting-started/index.html).

* A workspace within the Databricks account for [AWS](https://docs.databricks.com/admin/workspace/index.html), [Azure](https://learn.microsoft.com/azure/databricks/admin/workspace/), or [GCP](https://docs.gcp.databricks.com/admin/workspace/index.html).

* One of the following compute resources within the workspace:

  * A SQL warehouse for [AWS](https://docs.databricks.com/compute/sql-warehouse/create.html), [Azure](https://learn.microsoft.com/azure/databricks/compute/sql-warehouse/create), or [GCP](https://docs.gcp.databricks.com/compute/sql-warehouse/create.html).

    The following video shows how to create a SQL warehouse if you do not already have one available, get its **Server Hostname** and **HTTP Path** values, and set permissions for someone other than the warehouse's owner to use it:

    <iframe width="560" height="315" src="https://www.youtube.com/embed/N-Aw9-U3_fE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

  * An all-purpose cluster for [AWS](https://docs.databricks.com/compute/use-compute.html), [Azure](https://learn.microsoft.com/azure/databricks/compute/use-compute), or [GCP](https://docs.gcp.databricks.com/compute/use-compute.html).

    The following video shows how to create an all-purpose cluster if you do not already have one available, get its **Server Hostname** and **HTTP Path** values, and set permissions for someone other than the cluster's owner to use it:

    <iframe width="560" height="315" src="https://www.youtube.com/embed/apgibaelVY0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

* The SQL warehouse's or cluster's **Server Hostname** and **HTTP Path** values for [AWS](https://docs.databricks.com/integrations/compute-details.html),
  [Azure](https://learn.microsoft.com/azure/databricks/integrations/compute-details), or
  [GCP](https://docs.gcp.databricks.com/integrations/compute-details.html).

* Unity Catalog enabled in the workspace for [AWS](https://docs.databricks.com/data-governance/unity-catalog/get-started.html),
  [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/get-started), or
  [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/get-started.html).

### Unity Catalog resources

Within Unity Catalog:

* A catalog for [AWS](https://docs.databricks.com/catalogs/create-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/create-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/create-catalog.html).
* A schema for [AWS](https://docs.databricks.com/schemas/create-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/create-schema), or [GCP](https://docs.gcp.databricks.com/schemas/create-schema.html) within that catalog.

  <Note>
    A *schema* was formerly known as a *database* in Unity Catalog.
  </Note>
* A table for [AWS](https://docs.databricks.com/tables/managed.html), [Azure](https://learn.microsoft.com/azure/databricks/tables/managed), or [GCP](https://docs.gcp.databricks.com/tables/managed.html) within that schema.

  How you create and specify the table depends on whether you want document metadata stored as nested JSON in a single column, or have Unstructured flatten the metadata and store each field as a separate columns. For more information, see [Storing document metadata](#storing-document-metadata) later in this topic.

  * If you choose to store document metadata as a JSON blob, you can specify a table, or have the connector attempt to create a table for you automatically at run time.

    ***To specify an existing table***

    In the connector settings as described later in this article, specify the name of an existing table.

    The following video shows how to create a catalog, schema, and a table in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them:

    <iframe width="560" height="315" src="https://www.youtube.com/embed/ffNnq-6bpd4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

    If you want to use an existing table or create one yourself beforehand, this table must contain at minimum the following column names and their data types:

    ```text theme={null}
    CREATE TABLE IF NOT EXISTS <catalog_name>.<schema_name>.<table_name> (
        id STRING NOT NULL PRIMARY KEY,
        record_id STRING NOT NULL,
        element_id STRING NOT NULL,
        text STRING,
        embeddings ARRAY<FLOAT>,
        type STRING,
        metadata VARIANT
    );
    ```

    <Info>
      In Databricks, a table's *schema* is different than a *schema* in a catalog-schema object relationship in Unity Catalog.
    </Info>

    <Info>
      The `record_id`, `element_id`, and `id` fields are closely related, but each has a distinct purpose. For more information, see [How connectors use record IDs, element IDs, and IDs](/api-reference/record-element-id).
    </Info>

    ***To have the connector create a table***

    You can have the connector attempt to create a table for you automatically at run time. In the connector settings as described later in this article, do one of the following:

    * Specify the name of the table that you want the connector to attempt to create within the specified catalog and schema.
    * Leave the table name blank. The connector will attempt to create a table within the specified catalog and schema.

      For [Unstructured Pipelines](/pipelines/overview) and [Unstructured API](/api-reference/overview), the table is named `u<short-workflow-id>`.
      For the [Unstructured Ingest CLI and Ingest Python library](/open-source/ingestion/overview), the table is named `unstructuredautocreated`.

    The connector will attempt to create the table on behalf of the related Databricks workspace user or Databricks managed service principal that is referenced in the connector settings, as described later in these requirements.
    If successful, the table's owner is set as the related Databricks workspace user or Databricks managed service principal. The owner will have all Unity Catalog
    privileges on the table by default. No other Databricks workspace users or Databricks managed service principals will have any privileges on the table by default.

    <Warning>
      If the table's parent schema is not owned by the same Databricks workspace user or Databricks managed service principal that is
      referenced in the connector settings, then you should grant the new table's owner the `CREATE TABLE` privilege on that parent schema
      before the connector attempts to create the table. Otherwise, table creation could fail.
    </Warning>

    <Note>
      Using dashes (`-`) in the names of catalogs, schemas, and tables might cause isolated issues with the connector. It is
      recommended to use underscores (`_`) instead of dashes in the names of catalogs, schemas, and tables.
    </Note>

  * If you choose to flatten the document metadata, you must create the table before you configure your connector. For more information, see [Storing document metadata](#storing-document-metadata) later in this topic.

### Authentication and permissions

* For Databricks managed service principal authentication (using Databricks OAuth M2M) to the workspace:

  * A Databricks managed service principal.
    This service principal must have the appropriate access permissions to the catalog, schema, table, volume, and cluster or SQL warehouse.
  * The service principal's **UUID** (or **Client ID** or **Application ID**) value.
  * The OAuth **Secret** value for the service principal.

  To get this information, see Steps 1-3 of the instructions for [AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html),
  [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), or
  [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html).

  <Note>
    For Azure Databricks, this connector only supports Databricks managed service principals for authentication.
    Microsoft Entra ID managed service principals are not supported.
  </Note>

  The following video shows how to create a Databricks managed service principal if you do not already have one available:

  <iframe width="560" height="315" src="https://www.youtube.com/embed/wBmqv5DaA1E" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

* For Databricks personal access token authentication to the workspace, the
  Databricks personal access token value for
  [AWS](https://docs.databricks.com/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users),
  [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat#azure-databricks-personal-access-tokens-for-workspace-users), or
  [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users).
  This token must be for the workspace user who
  has the appropriate access permissions to the catalog, schema, table, volume, and cluster or SQL warehouse,

  The following video shows how to create a Databricks personal access token if you do not already have one available:

  <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

* The Databricks workspace user or Databricks managed service principal must have the following *minimum* set of permissions and privileges to write to an
  existing volume or table in Unity Catalog. If the owner of these is that Databricks workspace user or Databricks managed service principal, then
  they will have all necessary permissions and privileges by default. If the owner is someone else, then the following permissions and privileges must be
  explicitly granted to them before using the connector:

  * To use an all-purpose cluster for access, `Can Restart` permission on that cluster. Learn how to check and set cluster permissions for
    [AWS](https://docs.databricks.com/compute/clusters-manage.html#compute-permissions),
    [Azure](https://learn.microsoft.com/azure/databricks/compute/clusters-manage#cluster-level-permissions), or
    [GCP](https://docs.gcp.databricks.com/compute/clusters-manage.html#compute-permissions).

  * To use a SQL warehouse for access, `Can use` permission on that SQL warehouse. Learn how to check and set SQL warehouse permissions for
    [AWS](https://docs.databricks.com/compute/sql-warehouse/create.html#manage-a-sql-warehouse),
    [Azure](https://learn.microsoft.com/azure/databricks/compute/sql-warehouse/create#manage), or
    [GCP](https://docs.gcp.databricks.com/compute/sql-warehouse/create.html#manage-a-sql-warehouse).

  * To access a Unity Catalog volume, the following privileges:

    * `USE CATALOG` on the volume's parent catalog in Unity Catalog.
    * `USE SCHEMA` on the volume's parent schema in Unity Catalog.
    * `READ VOLUME` and `WRITE VOLUME` on the volume.

    Learn how to check and set Unity Catalog privileges for
    [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges),
    [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or
    [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges).

  * To access a Unity Catalog table, the following privileges:

    * `USE CATALOG` on the table's parent catalog in Unity Catalog.
    * `USE SCHEMA` on the table's parent schema in Unity Catalog.
    * To create a new table, `CREATE TABLE` on the table's parent schema in Unity Catalog.
    * If the table already exists, `MODIFY` and `SELECT` on the table.

    Learn how to check and set Unity Catalog privileges for
    [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges),
    [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or
    [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges).

* Within Unity Catalog, a volume
  for [AWS](https://docs.databricks.com/volumes/utility-commands.html),
  [Azure](https://learn.microsoft.com/azure/databricks/volumes/utility-commands),
  or [GCP](https://docs.gcp.databricks.com/volumes/utility-commands.html). The volume can be in the same
  schema as the table, or the volume and table can be in separate schemas. In either case, both of these
  schemas must share the same parent catalog.

  <Note>
    Using dashes (`-`) in the names of volumes might cause isolated issues with the connector. It is
    recommended to use underscores (`_`) instead of dashes in the names of volumes.
  </Note>

  The following video shows how to create a catalog, schema, and a volume in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them:

  <iframe width="560" height="315" src="https://www.youtube.com/embed/yF9DJphhQQc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

  ## Storing document metadata

  Unstructured offers the following options for storing [document metadata](/concepts/document-elements#metadata) in the destination table:

  * Store the metadata as a single nested JSON field.
  * *Flatten* the metadata by writing each metadata field as its own typed, queryable column.

  Each of these approaches has its advantages. In general, storing the metadata as a JSON blob works for most use cases, including when:

  * Performing dot.notation queries on the stored JSON is sufficient for your needs.
  * Document metadata schemas vary across file sources. When flattening document metadata, Unstructured drops fields that do not match existing columns in the schema.
  * You want the connector to automatically generate the destination table. This option is not supported when flattening document metadata.

  Flattening the metadata and storing it in multiple columns is most useful when:

  * You want to query individual metadata fields directly using standard SQL, without JSON parsing.
  * The business intelligence or analytics tools you are using require columnar data.

  Storing metadata as a JSON blob is the default.

  To store metadata as a JSON blob, when configuring the connector uncheck **Flatten Metadata** (in the Unstructured Pipelines), or set `flatten_metadata` to `false` (in the Unstructured API).  To flatten the metadata, check **Flatten Metadata**, or set `flatten_metadata` to `true`.

  ### How Unstructured flattens metadata

  If you choose to have Unstructured flatten document metadata, you must create the Delta Table to use as the destination, and specify it when creating the connector. In order to prevent possible data loss, Unstructured will not automatically create a new table.

  Considerations to keep in mind when creating the Delta Table:

  * The table must contain a column for each metadata field you want to store. Any metadata field that does not have a corresponding column in the table is silently dropped, although the event is written to the logs. For more information, see [Logging and monitoring](/business/security-compliance/overview#logging-and-monitoring).
  * Do not declare metadata columns as `NOT NULL`. Missing metadata values are written as `NULL`.
  * Unstructured passes values through as their JSON-native type: strings, numbers, boolean, and so on. For example, no special formatting is applied to timestamp values.
  * Metadata fields that are lists are not further flattened. Lists remain single columns.

  ### Metadata flattening example

  The following example demonstrates how Unstructured flattens metadata into separate columns. Consider the following metadata:

  ```json theme={null}
  {
    "metadata": {
      "category_depth": 1,
      "data_source": {
        "url": "s3://my-source-bucket/path/chart-and-table.pdf",
        "version": "864ae40b0719e976e98ba0a7b9fcba92",
        "record_locator": {
          "protocol": "s3",
          "remote_file_path": "s3://my-source-bucket/path/"
        }
      },
      "languages": ["eng"]
    }
  }
  ```

  When flattening metadata, Unstructured generates a field name comprised of the full path to that field within the metadata structure, from the outermost object to the field itself. For example, `protocol`, which is included in the `record_locator` object, which is in turn within `data_source`, becomes `data_source_record_locator_protocol`:

  ```json theme={null}
  {
    "category_depth": 1,
    "data_source_url": "s3://my-source-bucket/path/chart-and-table.pdf",
    "data_source_version": "864ae40b0719e976e98ba0a7b9fcba92",
    "data_source_record_locator_protocol": "s3",
    "data_source_record_locator_remote_file_path": "s3://my-source-bucket/path/",
    "languages": ["eng"]
  }
  ```

## Examples

For more information on working with destination connectors using the Unstructured API, see [Destination endpoints](/api-reference/api/destination/destination-apis).

<CodeGroup>
  ```python Python SDK theme={null}
  import os

  from unstructured_client import UnstructuredClient
  from unstructured_client.models.operations import CreateDestinationRequest
  from unstructured_client.models.shared import CreateDestinationConnector

  with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client:
      response = client.destinations.create_destination(
          request=CreateDestinationRequest(
              create_destination_connector=CreateDestinationConnector(
                  name="<name>",
                  type="databricks_volume_delta_tables",
                  config={
                      "server_hostname": "<server-hostname>",
                      "http_path": "<http-path>",
                      "token": "<token>",
                      "client_id": "<client-id>",
                      "client_secret": "<client-secret>",
                      "volume": "<volume>",
                      "catalog": "<catalog>",
                      "volume_path": "<volume_path>",
                      "schema": "<schema>",
                      "database": "<database>",
                      "table_name": "<table_name>"
                  }
              )
          )
      )

      print(response.destination_connector_information)
  ```

  ```bash curl theme={null}
  curl --request 'POST' --location \
  "$UNSTRUCTURED_API_URL/destinations" \
  --header 'accept: application/json' \
  --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \
  --header 'content-type: application/json' \
  --data \
  '{
      "name": "<name>",
      "type": "databricks_volume_delta_tables",
      "config": {
          "server_hostname": "<server-hostname>",
          "http_path": "<http-path>",
          "token": "<token>",
          "client_id": "<client-id>",
          "client_secret": "<client-secret>",
          "volume": "<volume>",
          "catalog": "<catalog>",
          "volume_path": "<volume_path>",
          "schema": "<schema>",
          "database": "<database>",
          "table_name": "<table_name>"
      }
  }'
  ```
</CodeGroup>

## Configuration settings

Replace the preceding placeholders as follows:

<ParamField body="name" type="string" required>
  A unique name for this connector.
</ParamField>

<ParamField body="server_hostname" type="string" required>
  The target Databricks cluster's or SQL warehouse's **Server Hostname** value.
</ParamField>

<ParamField body="http_path" type="string" required>
  The cluster's or SQL warehouse's **HTTP Path** value.
</ParamField>

<ParamField body="token" type="string">
  For Databricks personal access token (PAT) authentication, the target Databricks user's PAT value. Required for PAT authentication.
</ParamField>

<ParamField body="client_id" type="string">
  For Databricks OAuth machine-to-machine (M2M) authentication, the Databricks managed service principal's **UUID** (or **Client ID** or **Application ID**) value. Required for OAuth authentication.
</ParamField>

<ParamField body="client_secret" type="string">
  For Databricks OAuth M2M authentication, the Databricks managed service principal's OAuth **Secret** (client secret) value. Required for OAuth authentication.
</ParamField>

<ParamField body="catalog" type="string" required>
  The name of the catalog in Unity Catalog for the target volume and table in the Databricks workspace.
</ParamField>

<ParamField body="database" type="string" default="&#x22;default&#x22;">
  The name of the schema (formerly known as a database) in Unity Catalog for the target table. If the target table and volume are in the same schema, then `database` and `schema` will have the same values.
</ParamField>

<ParamField body="table_name" type="string">
  The name of the target table in Unity Catalog. If specified but the table does not exist, the connector attempts to create it. If not specified, the connector attempts to create a table named `u<short-workflow-id>`. See the beginning of this article for additional technical requirements.
</ParamField>

<ParamField body="schema" type="string" default="&#x22;default&#x22;">
  The name of the schema (formerly known as a database) in Unity Catalog for the target volume. If the target volume and table are in the same schema, then `schema` and `database` will have the same values.
</ParamField>

<ParamField body="volume" type="string" required>
  The name of the target volume in Unity Catalog.
</ParamField>

<ParamField body="volume_path" type="string">
  Any target folder path inside of the volume to use instead of the volume's root. If not otherwise specified, processing occurs at the volume's root.
</ParamField>

<Note>
  Using dashes (`-`) in the names of catalogs, schemas (formerly known as databases), tables, and volumes might cause isolated issues with the connector. It is
  recommended to use underscores (`_`) instead of dashes in the names of catalogs, schemas, tables, and volumes.
</Note>

<ParamField body="flatten_metadata" type="boolean" default="&#x22;false&#x22;">
  Set to `true` to have Unstructured flatten the metadata and store each field as a separate columns, or `false` to store document metadata as nested JSON in a single column. For more information, see [Storing document metadata](#storing-document-metadata).
</ParamField>

## Learn more

* <Icon icon="blog" />  [Integration Highlight: Databricks Delta Tables](https://unstructured.io/blog/integration-highlight-databricks-delta-tables)
