This article covers connecting Unstructured to Delta Tables in Databricks.

For information about connecting Unstructured to Delta Tables in Amazon S3 instead, see Delta Tables in Amazon S3.

For information about connecting Unstructured to Databricks Volumes instead, see Databricks Volumes.

Send processed data from Unstructured to a Delta Table in Databricks.

The requirements are as follows.

  • A Databricks account on AWS, Azure, or GCP.

  • A workspace within the Datbricks account for AWS, Azure, or GCP.

  • One of the following compute resources within the workspace:

    • A SQL warehouse for AWS, Azure, or GCP.

      The following video shows how to create a SQL warehouse if you do not already have one available, get its Server Hostname and HTTP Path values, and set permissions for someone other than the warehouse’s owner to use it:

    • An all-purpose cluster for AWS, Azure, or GCP.

      The following video shows how to create an all-purpose cluster if you do not already have one available, get its Server Hostname and HTTP Path values, and set permissions for someone other than the cluster’s owner to use it:

  • The SQL warehouse’s or cluster’s Server Hostname and HTTP Path values for AWS, Azure, or GCP.

  • Unity Catalog enabled in the workspace for AWS, Azure, or GCP.

  • Within Unity Catalog:

    • A catalog for AWS, Azure, or GCP.
    • A schema (formerly known as a database) for AWS, Azure, or GCP within that catalog,
    • A table for AWS, Azure, or GCP within that schema (formerly known as a database).

    The following video shows how to create a catalog, schema (formerly known as a database), and a table in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them:

    This table must contain the following column names and their data types:

    CREATE TABLE IF NOT EXISTS `<catalog-name>`.`<schema-name>`.elements (
        id STRING NOT NULL PRIMARY KEY,
        record_id STRING,
        element_id STRING,
        text STRING,
        embeddings ARRAY<FLOAT>,
        type STRING,
        date_created TIMESTAMP,
        date_modified TIMESTAMP,
        date_processed TIMESTAMP,
        permissions_data STRING,
        filesize_bytes FLOAT,
        url STRING,
        version STRING,
        record_locator STRING,
        category_depth INT,
        parent_id STRING,
        attached_filename STRING,
        filetype STRING,
        last_modified TIMESTAMP,
        file_directory STRING,
        filename STRING,
        languages ARRAY<STRING>,
        page_number STRING,
        links STRING,
        page_name STRING,
        link_urls STRING,
        link_texts STRING,
        sent_from STRING,
        sent_to STRING,
        subject STRING,
        section STRING,
        header_footer_type STRING,
        emphasized_text_contents STRING,
        emphasized_text_tags STRING,
        text_as_html STRING,
        regex_metadata STRING,
        detection_class_prob FLOAT,
        is_continuation BOOLEAN,
        orig_elements STRING,
        coordinates_points STRING,
        coordinates_system STRING,
        coordinates_layout_width FLOAT,
        coordinates_layout_height FLOAT
    );
    

    In Databricks, a table’s schema is different than a schema (formerly known as a database) in a catalog-schema object relationship in Unity Catalog.

  • Within Unity Catalog, a volume for AWS, Azure, or GCP. The volume can be in the same schema (formerly known as a database) as the table, or the volume and table can be in separate schemas. In either case, both of these schemas must share the same parent catalog.

    The following video shows how to create a catalog, schema (formerly known as a database), and a volume in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them:

  • For Databricks managed service principal authentication (using Databricks OAuth M2M) to the workspace:

    • A Databricks managed service principal. This service principal must have the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse.
    • The service principal’s UUID (or Client ID or Application ID) value.
    • The OAuth Secret value for the service principal.

    To get this information, see Steps 1-3 of the instructions for AWS, Azure, or GCP.

    For Azure Databricks, this connector only supports Databricks managed service principals for authentication. Microsoft Entra ID managed service principals are not supported.

    The following video shows how to create a Databricks managed service principal if you do not already have one available:

  • For Databricks personal access token authentication to the workspace, the Databricks personal access token value for AWS, Azure, or GCP. This token must be for the workspace user who has the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse,

    The following video shows how to create a Databricks personal access token if you do not already have one available:

  • The Databricks workspace user or Databricks managed service principal must have the following minimum set of permissions and privileges to write to an existing volume or table in Unity Catalog:

    • To use an all-purpose cluster for access, Can Restart permission on that cluster. Learn how to check and set cluster permissions for AWS, Azure, or GCP.

    • To use a SQL warehouse for access, Can use permission on that SQL warehouse. Learn how to check and set SQL warehouse permissions for AWS, Azure, or GCP.

    • To access a Unity Catalog volume, the following privileges:

      • USE CATALOG on the volume’s parent catalog in Unity Catalog.
      • USE SCHEMA on the volume’s parent schema (formerly known as a database) in Unity Catalog.
      • READ VOLUME and WRITE VOLUME on the volume.

      Learn how to check and set Unity Catalog privileges for AWS, Azure, or GCP.

    • To access a Unity Catalog table, the following privileges:

      • USE CATALOG on the table’s parent catalog in Unity Catalog.
      • USE SCHEMA on the tables’s parent schema (formerly known as a database) in Unity Catalog.
      • MODIFY and SELECT on the table.

      Learn how to check and set Unity Catalog privileges for AWS, Azure, or GCP.

To create the destination connector:

  1. On the sidebar, click Connectors.
  2. Click Destinations.
  3. Cick New or Create Connector.
  4. Give the connector some unique Name.
  5. In the Provider area, click Delta Tables in Databricks.
  6. Click Continue.
  7. Follow the on-screen instructions to fill in the fields as described later on this page.
  8. Click Save and Test.

Fill in the following fields:

  • Name (required): A unique name for this connector.

  • Server Hostname (required): The target Databricks cluster’s or SQL warehouse’s Server Hostname value.

  • HTTP Path (required): The cluster’s or SQL warehouse’s HTTP Path value.

  • Token (required for PAT authentication): For Databricks personal access token (PAT) authentication, the target Databricks user’s PAT value.

  • UUID and OAuth Secret (required for OAuth authentication): For Databricks OAuth machine-to-machine (M2M) authentication, the Databricks managed service principal’s UUID (or Client ID or Application ID) and OAuth Secret (client secret) values.

  • Catalog (required): The name of the catalog in Unity Catalog for the target volume and table in the Databricks workspace.

  • Database: The name of the schema (formerly known as a database) in Unity Catalog for the target table. The default is default if not otherwise specified.

    If the target table and volume are in the same schema (formerly known as a database), then Database and Schema will have the same names.

  • Table Name (required): The name of the target table in Unity Catalog.

  • Schema: The name of the schema (formerly known as a database) in Unity Catalog for the target volume. The default is default if not otherwise specified.

    If the target volume and table are in the same schema (formerly known as a database), then Schema and Database will have the same names.

  • Volume (required): The name of the target volume in Unity Catalog.

  • Volume Path: Any target folder path inside of the volume to use instead of the volume’s root. If not otherwise specified, processing occurs at the volume’s root.