This article covers connecting Unstructured to Delta Tables in Databricks.For information about connecting Unstructured to Delta Tables in Amazon S3 instead, see
Delta Tables in Amazon S3.For information about connecting Unstructured to Databricks Volumes instead, see
Databricks Volumes.
If you’re new to Unstructured, read this note first.Before you can create a destination connector, you must first sign in to your Unstructured account:
- If you do not already have an Unstructured account, sign up for free. After you sign up, you are automatically signed in to your new Unstructured Starter account, at https://platform.unstructured.io. To sign up for a Team or Enterprise account instead, contact Unstructured Sales, or learn more.
- If you already have an Unstructured Starter or Team account and are not already signed in, sign in to your account at https://platform.unstructured.io. For an Enterprise account, see your Unstructured account administrator for instructions, or email Unstructured Support at support@unstructured.io.
-
After you sign in to your Unstructured Starter account, click API Keys on the sidebar.
For a Team or Enterprise account, before you click API Keys, make sure you have selected the organizational workspace you want to create an API key for. Each API key works with one and only one organizational workspace. Learn more. -
Click Generate API Key.
-
Follow the on-screen instructions to finish generating the key.
-
Click the Copy icon next to your new key to add the key to your system’s clipboard. If you lose this key, simply return and click the Copy icon again.
- A Databricks account on AWS, Azure, or GCP.
- A workspace within the Datbricks account for AWS, Azure, or GCP.
-
One of the following compute resources within the workspace:
- A SQL warehouse for AWS, Azure, or GCP. The following video shows how to create a SQL warehouse if you do not already have one available, get its Server Hostname and HTTP Path values, and set permissions for someone other than the warehouse’s owner to use it:
- An all-purpose cluster for AWS, Azure, or GCP. The following video shows how to create an all-purpose cluster if you do not already have one available, get its Server Hostname and HTTP Path values, and set permissions for someone other than the cluster’s owner to use it:
- The SQL warehouse’s or cluster’s Server Hostname and HTTP Path values for AWS, Azure, or GCP.
- Unity Catalog enabled in the workspace for AWS, Azure, or GCP.
-
Within Unity Catalog:
- A catalog for AWS, Azure, or GCP.
- A schema (formerly known as a database) for AWS, Azure, or GCP within that catalog,
-
A table
for AWS,
Azure, or
GCP
within that schema (formerly known as a database).
You can have the connector attempt to create a table for you automatically at run time. To do this, in the connector settings as described later in this article,
do one of the following:
- Specify the name of the table that you want the connector to attempt to create within the specified catalog and schema (formerly known as a database).
- Leave the table name blank. The connector will attempt to create a table within the specified catalog and schema (formerly known as a database).
For the Unstructured UI and Unstructured API, the table is named
u<short-workflow-id>
. For the Unstructured Ingest CLI and Ingest Python library, the table is namedunstructuredautocreated
.
If the table’s parent schema (formerly known as a database) is not owned by the same Databricks workspace user or Databricks managed service principal that is referenced in the connector settings, then you should grant the new table’s owner theCREATE TABLE
privilege on that parent schema (formerly known as a database) before the connector attempts to create the table. Otherwise, table creation could fail.Using dashes (-
) in the names of catalogs, schemas (formerly known as databases), and tables might cause isolated issues with the connector. It is recommended to use underscores (_
) instead of dashes in the names of catalogs, schemas, and tables.
In Databricks, a table’s schema is different than a schema (formerly known as a database) in a catalog-schema object relationship in Unity Catalog. -
Within Unity Catalog, a volume
for AWS,
Azure,
or GCP. The volume can be in the same
schema (formerly known as a database) as the table, or the volume and table can be in separate schemas. In either case, both of these
schemas must share the same parent catalog.
The following video shows how to create a catalog, schema (formerly known as a database), and a volume in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them:Using dashes (
-
) in the names of volumes might cause isolated issues with the connector. It is recommended to use underscores (_
) instead of dashes in the names of volumes. -
For Databricks managed service principal authentication (using Databricks OAuth M2M) to the workspace:
- A Databricks managed service principal. This service principal must have the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse.
- The service principal’s UUID (or Client ID or Application ID) value.
- The OAuth Secret value for the service principal.
The following video shows how to create a Databricks managed service principal if you do not already have one available:For Azure Databricks, this connector only supports Databricks managed service principals for authentication. Microsoft Entra ID managed service principals are not supported. - For Databricks personal access token authentication to the workspace, the Databricks personal access token value for AWS, Azure, or GCP. This token must be for the workspace user who has the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse, The following video shows how to create a Databricks personal access token if you do not already have one available:
-
The Databricks workspace user or Databricks managed service principal must have the following minimum set of permissions and privileges to write to an
existing volume or table in Unity Catalog. If the owner of these is that Databricks workspace user or Databricks managed service principal, then
they will have all necessary permissions and privileges by default. If the owner is someone else, then the following permissions and privileges must be
explicitly granted to them before using the connector:
-
To use an all-purpose cluster for access,
Can Restart
permission on that cluster. Learn how to check and set cluster permissions for AWS, Azure, or GCP. -
To use a SQL warehouse for access,
Can use
permission on that SQL warehouse. Learn how to check and set SQL warehouse permissions for AWS, Azure, or GCP. -
To access a Unity Catalog volume, the following privileges:
USE CATALOG
on the volume’s parent catalog in Unity Catalog.USE SCHEMA
on the volume’s parent schema (formerly known as a database) in Unity Catalog.READ VOLUME
andWRITE VOLUME
on the volume.
-
To access a Unity Catalog table, the following privileges:
USE CATALOG
on the table’s parent catalog in Unity Catalog.USE SCHEMA
on the tables’s parent schema (formerly known as a database) in Unity Catalog.- To create a new table,
CREATE TABLE
on the table’s parent schema (formerly known as a database) in Unity Catalog. - If the table already exists,
MODIFY
andSELECT
on the table.
-
To use an all-purpose cluster for access,
-
<name>
(required) - A unique name for this connector. -
<server-hostname>
(required): The target Databricks cluster’s or SQL warehouse’s Server Hostname value. -
<http-path>
(required): The cluster’s or SQL warehouse’s HTTP Path value. -
<token>
(required for PAT authentication): For Databricks personal access token (PAT) authentication, the target Databricks user’s PAT value. -
<client-id>
and<client-secret>
(required for OAuth authentication): For Databricks OAuth machine-to-machine (M2M) authentication, the Databricks managed service principal’s UUID (or Client ID or Application ID) and OAuth Secret (client secret) values. -
<catalog>
(required): The name of the catalog in Unity Catalog for the target volume and table in the Databricks workspace. -
<database>
: The name of the schema (formerly known as a database) in Unity Catalog for the target table. The default isdefault
if not otherwise specified. If the target table and volume are in the same schema (formerly known as a database), then<database>
and<schema>
will have the same values. -
<table_name>
: The name of the target table in Unity Catalog.- If a table name is specified, but a table with that name does not exist within the specified schema (formerly known as a database), the connector attempts to create a table with that name within that schema.
- If no table name is specified, the connector attempts to create a table named
u<short-workflow-id>
within the specified schema (formerly known as a database).
-
<schema>
: The name of the schema (formerly known as a database) in Unity Catalog for the target volume. The default isdefault
if not otherwise specified. If the target volume and table are in the same schema (formerly known as a database), then<schema>
and<database>
will have the same values. -
<volume>
(required): The name of the target volume in Unity Catalog. -
<volume_path>
: Any target folder path inside of the volume to use instead of the volume’s root. If not otherwise specified, processing occurs at the volume’s root.
Using dashes (
-
) in the names of catalogs, schemas (formerly known as databases), tables, and volumes might cause isolated issues with the connector. It is
recommended to use underscores (_
) instead of dashes in the names of catalogs, schemas, tables, and volumes.