First time creating a connector? Read this first.
Requirements
You will need:Accounts and resources
- A Databricks account on Amazon Web Services (AWS), Microsoft Azure (Azure), or Google Cloud Platform (GCP).
- A workspace within the Databricks account for AWS, Azure, or GCP.
-
One of the following compute resources within the workspace:
-
A SQL warehouse for AWS, Azure, or GCP.
The following video shows how to create a SQL warehouse if you do not already have one available, get its Server Hostname and HTTP Path values, and set permissions for someone other than the warehouse’s owner to use it:
-
An all-purpose cluster for AWS, Azure, or GCP.
The following video shows how to create an all-purpose cluster if you do not already have one available, get its Server Hostname and HTTP Path values, and set permissions for someone other than the cluster’s owner to use it:
-
A SQL warehouse for AWS, Azure, or GCP.
The following video shows how to create a SQL warehouse if you do not already have one available, get its Server Hostname and HTTP Path values, and set permissions for someone other than the warehouse’s owner to use it:
- The SQL warehouse’s or cluster’s Server Hostname and HTTP Path values for AWS, Azure, or GCP.
- Unity Catalog enabled in the workspace for AWS, Azure, or GCP.
Unity Catalog resources
Within Unity Catalog:- A catalog for AWS, Azure, or GCP.
-
A schema for AWS, Azure, or GCP within that catalog.
A schema was formerly known as a database in Unity Catalog.
-
A table for AWS, Azure, or GCP within that schema.
How you create and specify the table depends on whether you want document metadata stored as nested JSON in a single column, or have Unstructured flatten the metadata and store each field as a separate columns. For more information, see Storing document metadata later in this topic.
-
If you choose to store document metadata as a JSON blob, you can specify a table, or have the connector attempt to create a table for you automatically at run time.
To specify an existing table
In the connector settings as described later in this article, specify the name of an existing table.
The following video shows how to create a catalog, schema, and a table in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them:
If you want to use an existing table or create one yourself beforehand, this table must contain at minimum the following column names and their data types:In Databricks, a table’s schema is different than a schema in a catalog-schema object relationship in Unity Catalog.To have the connector create a table You can have the connector attempt to create a table for you automatically at run time. In the connector settings as described later in this article, do one of the following:The
record_id,element_id, andidfields are closely related, but each has a distinct purpose. For more information, see How connectors use record IDs, element IDs, and IDs.- Specify the name of the table that you want the connector to attempt to create within the specified catalog and schema.
-
Leave the table name blank. The connector will attempt to create a table within the specified catalog and schema.
For Unstructured Pipelines and Unstructured API, the table is named
u<short-workflow-id>. For the Unstructured Ingest CLI and Ingest Python library, the table is namedunstructuredautocreated.
Using dashes (-) in the names of catalogs, schemas, and tables might cause isolated issues with the connector. It is recommended to use underscores (_) instead of dashes in the names of catalogs, schemas, and tables. - If you choose to flatten the document metadata, you must create the table before you configure your connector. For more information, see Storing document metadata later in this topic.
-
If you choose to store document metadata as a JSON blob, you can specify a table, or have the connector attempt to create a table for you automatically at run time.
To specify an existing table
In the connector settings as described later in this article, specify the name of an existing table.
The following video shows how to create a catalog, schema, and a table in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them:
Authentication and permissions
-
For Databricks managed service principal authentication (using Databricks OAuth M2M) to the workspace:
- A Databricks managed service principal. This service principal must have the appropriate access permissions to the catalog, schema, table, volume, and cluster or SQL warehouse.
- The service principal’s UUID (or Client ID or Application ID) value.
- The OAuth Secret value for the service principal.
The following video shows how to create a Databricks managed service principal if you do not already have one available:For Azure Databricks, this connector only supports Databricks managed service principals for authentication. Microsoft Entra ID managed service principals are not supported. -
For Databricks personal access token authentication to the workspace, the
Databricks personal access token value for
AWS,
Azure, or
GCP.
This token must be for the workspace user who
has the appropriate access permissions to the catalog, schema, table, volume, and cluster or SQL warehouse,
The following video shows how to create a Databricks personal access token if you do not already have one available:
-
The Databricks workspace user or Databricks managed service principal must have the following minimum set of permissions and privileges to write to an
existing volume or table in Unity Catalog. If the owner of these is that Databricks workspace user or Databricks managed service principal, then
they will have all necessary permissions and privileges by default. If the owner is someone else, then the following permissions and privileges must be
explicitly granted to them before using the connector:
-
To use an all-purpose cluster for access,
Can Restartpermission on that cluster. Learn how to check and set cluster permissions for AWS, Azure, or GCP. -
To use a SQL warehouse for access,
Can usepermission on that SQL warehouse. Learn how to check and set SQL warehouse permissions for AWS, Azure, or GCP. -
To access a Unity Catalog volume, the following privileges:
USE CATALOGon the volume’s parent catalog in Unity Catalog.USE SCHEMAon the volume’s parent schema in Unity Catalog.READ VOLUMEandWRITE VOLUMEon the volume.
-
To access a Unity Catalog table, the following privileges:
USE CATALOGon the table’s parent catalog in Unity Catalog.USE SCHEMAon the table’s parent schema in Unity Catalog.- To create a new table,
CREATE TABLEon the table’s parent schema in Unity Catalog. - If the table already exists,
MODIFYandSELECTon the table.
-
To use an all-purpose cluster for access,
-
Within Unity Catalog, a volume
for AWS,
Azure,
or GCP. The volume can be in the same
schema as the table, or the volume and table can be in separate schemas. In either case, both of these
schemas must share the same parent catalog.
The following video shows how to create a catalog, schema, and a volume in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them:Using dashes (
-) in the names of volumes might cause isolated issues with the connector. It is recommended to use underscores (_) instead of dashes in the names of volumes.Storing document metadata
Unstructured offers the following options for storing document metadata in the destination table:- Store the metadata as a single nested JSON field.
- Flatten the metadata by writing each metadata field as its own typed, queryable column.
- Performing dot.notation queries on the stored JSON is sufficient for your needs.
- Document metadata schemas vary across file sources. When flattening document metadata, Unstructured drops fields that do not match existing columns in the schema.
- You want the connector to automatically generate the destination table. This option is not supported when flattening document metadata.
- You want to query individual metadata fields directly using standard SQL, without JSON parsing.
- The business intelligence or analytics tools you are using require columnar data.
flatten_metadatatofalse(in the Unstructured API). To flatten the metadata, check Flatten Metadata, or setflatten_metadatatotrue.How Unstructured flattens metadata
If you choose to have Unstructured flatten document metadata, you must create the Delta Table to use as the destination, and specify it when creating the connector. In order to prevent possible data loss, Unstructured will not automatically create a new table. Considerations to keep in mind when creating the Delta Table:- The table must contain a column for each metadata field you want to store. Any metadata field that does not have a corresponding column in the table is silently dropped, although the event is written to the logs. For more information, see Logging and monitoring.
- Do not declare metadata columns as
NOT NULL. Missing metadata values are written asNULL. - Unstructured passes values through as their JSON-native type: strings, numbers, boolean, and so on. For example, no special formatting is applied to timestamp values.
- Metadata fields that are lists are not further flattened. Lists remain single columns.
Metadata flattening example
The following example demonstrates how Unstructured flattens metadata into separate columns. Consider the following metadata:When flattening metadata, Unstructured generates a field name comprised of the full path to that field within the metadata structure, from the outermost object to the field itself. For example,protocol, which is included in therecord_locatorobject, which is in turn withindata_source, becomesdata_source_record_locator_protocol:
Creating the connector
To create the destination connector:- On the sidebar, click Connectors.
- Click Destinations.
- Cick New or Create Connector.
- Give the connector some unique Name.
- In the Provider area, click Delta Tables in Databricks.
- Click Continue.
- Follow the on-screen instructions to fill in the fields as described later on this page.
- Click Save and Test.
- Name (required): A unique name for this connector.
- Server Hostname (required): The target Databricks cluster’s or SQL warehouse’s Server Hostname value.
- HTTP Path (required): The cluster’s or SQL warehouse’s HTTP Path value.
- Token (required for PAT authentication): For Databricks personal access token (PAT) authentication, the target Databricks user’s PAT value.
- UUID and OAuth Secret (required for OAuth authentication): For Databricks OAuth machine-to-machine (M2M) authentication, the Databricks managed service principal’s UUID (or Client ID or Application ID) and OAuth Secret (client secret) values.
- Catalog (required): The name of the catalog in Unity Catalog for the target volume and table in the Databricks workspace.
-
Database: The name of the schema (formerly known as a database) in Unity Catalog for the target table. The default is
defaultif not otherwise specified. If the target table and volume are in the same schema (formerly known as a database), then Database and Schema will have the same names. -
Table Name: The name of the target table in Unity Catalog.
- If a table name is specified, but a table with that name does not exist within the specified schema (formerly known as a database), the connector attempts to create a table with that name within that schema.
- If no table name is specified, the connector attempts to create a table named
u<short-workflow-id>within the specified schema (formerly known as a database).
-
Schema: The name of the schema (formerly known as a database) in Unity Catalog for the target volume. The default is
defaultif not otherwise specified. If the target volume and table are in the same schema (formerly known as a database), then Schema and Database will have the same names. - Volume (required): The name of the target volume in Unity Catalog.
- Volume Path: Any target folder path inside of the volume to use instead of the volume’s root. If not otherwise specified, processing occurs at the volume’s root.
Using dashes (
-) in the names of catalogs, schemas (formerly known as databases), tables, and volumes might cause isolated issues with the connector. It is
recommended to use underscores (_) instead of dashes in the names of catalogs, schemas, tables, and volumes.- Flatten Metadata: Check to have Unstructured flatten the metadata and store each field as a separate columns. Uncheck to store document metadata as nested JSON in a single column. For more information, see Storing document metadata.

