Delta Tables in Databricks
This article covers connecting Unstructured to Delta Tables in Databricks.
For information about connecting Unstructured to Delta Tables in Amazon S3 instead, see Delta Tables in Amazon S3.
For information about connecting Unstructured to Databricks Volumes instead, see Databricks Volumes.
Batch process all your records to store structured outputs in a Delta Table in Databricks.
The Delta Tables in Databricks connector provides two implementations:
- The SQL-based implementation enables you to have Unstructured write its processed data directly to a table in Unity Catalog. If you use this implementation, you do not need to provide a volume as described in the following requirements.
- The volume-based implementation enables you to have Unstructured write its processed data as a file to a volume in Unity Catalog. Unstructured then uses that file in the volume to write the file’s data to a table in Unity Catalog. The file remains in the volume.
The requirements are as follows.
-
A workspace within the Datbricks account for AWS, Azure, or GCP.
-
One of the following compute resources within the workspace:
-
A SQL warehouse for AWS, Azure, or GCP.
The following video shows how to create a SQL warehouse if you do not already have one available, get its Server Hostname and HTTP Path values, and set permissions for someone other than the warehouse’s owner to use it:
-
An all-purpose cluster for AWS, Azure, or GCP.
The following video shows how to create an all-purpose cluster if you do not already have one available, get its Server Hostname and HTTP Path values, and set permissions for someone other than the cluster’s owner to use it:
-
-
The SQL warehouse’s or cluster’s Server Hostname and HTTP Path values for AWS, Azure, or GCP.
-
Unity Catalog enabled in the workspace for AWS, Azure, or GCP.
-
Within Unity Catalog:
- A catalog for AWS, Azure, or GCP.
- A schema (formerly known as a database) for AWS, Azure, or GCP within that catalog,
- A table for AWS, Azure, or GCP within that schema (formerly known as a database).
The following video shows how to create a catalog, schema (formerly known as a database), and a table in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them:
This table must contain the following column names and their data types:
In Databricks, a table’s schema is different than a schema (formerly known as a database) in a catalog-schema object relationship in Unity Catalog.
-
Within Unity Catalog, a volume for AWS, Azure, or GCP. The volume can be in the same schema (formerly known as a database) as the table, or the volume and table can be in separate schemas. In either case, both of these schemas must share the same parent catalog.
The following video shows how to create a catalog, schema (formerly known as a database), and a volume in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them:
-
For Databricks managed service principal authentication (using Databricks OAuth M2M) to the workspace:
- A Databricks managed service principal. This service principal must have the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse.
- The service principal’s UUID (or Client ID or Application ID) value.
- The OAuth Secret value for the service principal.
To get this information, see Steps 1-3 of the instructions for AWS, Azure, or GCP.
For Azure Databricks, this connector only supports Databricks managed service principals for authentication. Microsoft Entra ID managed service principals are not supported.
The following video shows how to create a Databricks managed service principal if you do not already have one available:
-
For Databricks personal access token authentication to the workspace, the Databricks personal access token value for AWS, Azure, or GCP. This token must be for the workspace user who has the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse,
The following video shows how to create a Databricks personal access token if you do not already have one available:
-
The Databricks workspace user or Databricks managed service principal must have the following minimum set of permissions and privileges to write to an existing volume or table in Unity Catalog:
-
To use an all-purpose cluster for access,
Can Restart
permission on that cluster. Learn how to check and set cluster permissions for AWS, Azure, or GCP. -
To use a SQL warehouse for access,
Can use
permission on that SQL warehouse. Learn how to check and set SQL warehouse permissions for AWS, Azure, or GCP. -
To access a Unity Catalog volume, the following privileges:
USE CATALOG
on the volume’s parent catalog in Unity Catalog.USE SCHEMA
on the volume’s parent schema (formerly known as a database) in Unity Catalog.READ VOLUME
andWRITE VOLUME
on the volume.
Learn how to check and set Unity Catalog privileges for AWS, Azure, or GCP.
-
To access a Unity Catalog table, the following privileges:
USE CATALOG
on the table’s parent catalog in Unity Catalog.USE SCHEMA
on the tables’s parent schema (formerly known as a database) in Unity Catalog.MODIFY
andSELECT
on the table.
Learn how to check and set Unity Catalog privileges for AWS, Azure, or GCP.
-
The Delta Tables in Databricks connector dependencies:
You might also need to install additional dependencies, depending on your needs. Learn more.
The following environment variables:
-
DATABRICKS_HOST
- The Databricks cluster’s or SQL warehouse’s Server Hostname value, represented by--server-hostname
(CLI) orserver_hostname
(Python). -
DATABRICKS_HTTP_PATH
- The cluster’s or SQL warehouse’s HTTP Path value, represented by--http-path
(CLI) orhttp_path
(Python). -
DATABRICKS_TOKEN
- For Databricks personal access token authentication, the token’s value, represented by--token
(CLI) ortoken
(Python). -
DATABRICKS_CLIENT_ID
- For Databricks managed service principal authenticaton, the service principal’s UUID (or Client ID or Application ID) value, represented by--client-id
(CLI) orclient_id
(Python). -
DATABRICKS_CLIENT_SECRET
- For Databricks managed service principal authenticaton, the service principal’s OAuth Secret value, represented by--client-secret
(CLI) orclient_secret
(Python). -
DATABRICKS_CATALOG
- The name of the catalog in Unity Catalog, represented by--catalog
(CLI) orcatalog
(Python). -
DATABRICKS_DATABASE
- The name of the schema (formerly known as a database) inside of the catalog for the target table, represented by--database
(CLI) ordatabase
(Python). The default isdefault
if not otherwise specified.If you are also using a volume, and the target table and volume are in the same schema (formerly known as a database), then
DATABRICKS_DATABASE
andDATABRICKS_SCHEMA
will have the same values. -
DATABRICKS_TABLE
- The name of the table inside of the schema (formerly known as a database), represented by--table-name
(CLI) ortable_name
(Python). The default iselements
if not otherwise specified.
For the SQL-based implementation, add these environment variables:
DATABRICKS_RECORD_ID_KEY
- The name of the column that uniquely identifies each record in the table, represented by--record-id-key
(CLI) orrecord_id_key
(Python).- The maximum number of records to be transmitted per batch, represented by
--batch-size
(CLI) orbatch_size
(Python). The default is50
if not otherwise specified.
For the volume-based implementation, add these environment variables:
-
DATABRICKS_SCHEMA
- The name of the schema (formerly known as a database) inside of the catalog for the target volume, represented by--schema
(CLI) orschema
(Python). The default isdefault
if not otherwise specified.If the target volume and table are in the same schema (formerly known as a database), then
DATABRICKS_SCHEMA
andDATABRICKS_SCHEMA
will have the same values. -
DATABRICKS_VOLUME
- The name of the volume inside of the schema (formerly known as a database), represented by--volume
(CLI) orvolume
(Python). -
DATABRICKS_VOLUME_PATH
- Optionally, a specific path inside of the volume that you want to start accessing from, starting from the volume’s root, represented by--volume-path
(CLI) orvolume_path
(Python). The default is to start accessing from the volume’s root if not otherwise specified.
Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector.
This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page.
For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the --partition-by-api
option (CLI) or partition_by_api
(Python) parameter to specify where files are processed:
-
To do local file processing, omit
--partition-by-api
(CLI) orpartition_by_api
(Python), or explicitly specifypartition_by_api=False
(Python).Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:
--api-key $UNSTRUCTURED_API_KEY
(CLI) orapi_key=os.getenv("UNSTRUCTURED_API_KEY")
(Python)--partition-endpoint $UNSTRUCTURED_API_URL
(CLI) orpartition_endpoint=os.getenv("UNSTRUCTURED_API_URL")
(Python)- The environment variables
UNSTRUCTURED_API_KEY
andUNSTRUCTURED_API_URL
-
To send files to the Unstructured Platform Partition Endpoint for processing, specify
--partition-by-api
(CLI) orpartition_by_api=True
(Python).Unstructured also requires an Unstructured API key and API URL, by adding the following:
--api-key $UNSTRUCTURED_API_KEY
(CLI) orapi_key=os.getenv("UNSTRUCTURED_API_KEY")
(Python)--partition-endpoint $UNSTRUCTURED_API_URL
(CLI) orpartition_endpoint=os.getenv("UNSTRUCTURED_API_URL")
(Python)- The environment variables
UNSTRUCTURED_API_KEY
andUNSTRUCTURED_API_URL
, representing your API key and API URL, respectively.
Was this page helpful?