Delta Tables in Databricks
This article covers connecting Unstructured to Delta Tables in Databricks.
For information about connecting Unstructured to Delta Tables in Amazon S3 instead, see Delta Tables in Amazon S3.
For information about connecting Unstructured to Databricks Volumes instead, see Databricks Volumes.
Batch process all your records to store structured outputs in a Delta Table in Databricks.
The Delta Tables in Databricks connector provides two implementations:
- The SQL-based implementation enables you to have Unstructured write its processed data directly to a table in Unity Catalog. If you use this implementation, you do not need to provide a volume as described in the following requirements.
- The volume-based implementation enables you to have Unstructured write its processed data as a file to a volume in Unity Catalog. Unstructured then uses that file in the volume to write the file’s data to a table in Unity Catalog. The file remains in the volume.
The requirements are as follows.
-
A workspace within the Datbricks account for AWS, Azure, or GCP.
-
One of the following compute resources within the workspace:
-
The SQL warehouse’s or cluster’s Server Hostname and HTTP Path values for AWS, Azure, or GCP.
-
Unity Catalog enabled in the workspace for AWS, Azure, or GCP.
-
Within Unity Catalog:
- A catalog for AWS, Azure, or GCP.
- A schema for AWS, Azure, or GCP within that catalog,
- A table for AWS, Azure, or GCP within that schema.
This table must contain the following column names and their data types:
-
Within Unity Catalog, a volume for AWS, Azure, or GCP within the same schema as the table.
-
For Databricks personal access token authentication to the workspace, the Databricks personal access token value for AWS, Azure, or GCP. This token must be for the workspace user who has the appropriate access permissions to the catalog, schema, table, volume, and cluster or SQL warehouse,
-
For Databricks managed service principal authentication (using Databricks OAuth M2M) to the workspace:
- A Databricks managed service principal. This service principal must have the appropriate access permissions to the catalog, schema, table, volume, and cluster or SQL warehouse.
- The service principal’s UUID value.
- The OAuth Secret value for the service principal.
To get this information, see Steps 1-3 of the instructions for AWS, Azure, or GCP.
For Azure Databricks, this connector only supports Databricks managed service principals. Microsoft Entra ID managed service principals are not supported.
The Delta Tables in Databricks connector dependencies:
You might also need to install additional dependencies, depending on your needs. Learn more.
The following environment variables:
DATABRICKS_HOST
- The Databricks cluster’s or SQL warehouse’s Server Hostname value, represented by--server-hostname
(CLI) orserver_hostname
(Python).DATABRICKS_HTTP_PATH
- The cluster’s or SQL warehouse’s HTTP Path value, represented by--http-path
(CLI) orhttp_path
(Python).DATABRICKS_TOKEN
- For Databricks personal access token authentication, the token’s value, represented by--token
(CLI) ortoken
(Python).DATABRICKS_CLIENT_ID
- For Databricks managed service principal authenticaton, the service principal’s UUID value, represented by--client-id
(CLI) orclient_id
(Python).DATABRICKS_CLIENT_SECRET
- For Databricks managed service principal authenticaton, the service principal’s OAuth Secret value, represented by--client-secret
(CLI) orclient_secret
(Python).DATABRICKS_CATALOG
- The name of the catalog in Unity Catalog, represented by--catalog
(CLI) orcatalog
(Python).DATABRICKS_DATABASE
- The name of the schema (database) inside of the catalog, represented by--database
(CLI) ordatabase
(Python). The default isdefault
if not otherwise specified.DATABRICKS_TABLE
- The name of the table inside of the schema (database), represented by--table-name
(CLI) ortable_name
(Python). The default iselements
if not otherwise specified.
For the SQL-based implementation, add these environment variables:
DATABRICKS_RECORD_ID_KEY
- The name of the column that uniquely identifies each record in the table, represented by--record-id-key
(CLI) orrecord_id_key
(Python).- The maximum number of records to be transmitted per batch, represented by
--batch-size
(CLI) orbatch_size
(Python). The default is50
if not otherwise specified.
For the volume-based implementation, add these environment variables:
DATABRICKS_SCHEMA
- The name of the schema (database) inside of the catalog, represented by--schema
(CLI) orschema
(Python). This name of this database (schema) must be the same as the value of theDATABRICKS_DATABASE
environment variable and is required for compatiblity. The default isdefault
if not otherwise specified.DATABRICKS_VOLUME
- The name of the volume inside of the schema (database), represented by--volume
(CLI) orvolume
(Python).DATABRICKS_VOLUME_PATH
- Optionally, a specific path inside of the volume that you want to start accessing from, starting from the volume’s root, represented by--volume-path
(CLI) orvolume_path
(Python). The default is to start accessing from the volume’s root if not otherwise specified.
These environment variables:
UNSTRUCTURED_API_KEY
- Your Unstructured API key value.UNSTRUCTURED_API_URL
- Your Unstructured API URL.
Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector:
Was this page helpful?