If you’re new to Unstructured, read this note first.Before you can create a source connector, you must first sign in to your Unstructured account:
- If you do not already have an Unstructured account, sign up for free. After you sign up, you are automatically signed in to your new Unstructured Starter account, at https://platform.unstructured.io. To sign up for a Team or Enterprise account instead, contact Unstructured Sales, or learn more.
- If you already have an Unstructured Starter or Team account and are not already signed in, sign in to your account at https://platform.unstructured.io. For an Enterprise account, see your Unstructured account administrator for instructions, or email Unstructured Support at support@unstructured.io.
-
After you sign in to your Unstructured Starter account, click API Keys on the sidebar.
For a Team or Enterprise account, before you click API Keys, make sure you have selected the organizational workspace you want to create an API key for. Each API key works with one and only one organizational workspace. Learn more. -
Click Generate API Key.
-
Follow the on-screen instructions to finish generating the key.
-
Click the Copy icon next to your new key to add the key to your system’s clipboard. If you lose this key, simply return and click the Copy icon again.
- A Databricks account on AWS, Azure, or GCP.
- A workspace within the Databricks account for AWS, Azure, or GCP.
-
The workspace’s URL. Get the workspace URL for
AWS,
Azure,
or GCP.
Examples:
- AWS:
https://<workspace-id>.cloud.databricks.com
- Azure:
https://adb-<workspace-id>.<random-number>.azuredatabricks.net
- GCP:
https://<workspace-id>.<random-number>.gcp.databricks.com
Do not add a trailing slash (/
) to the workspace URL. - AWS:
-
The Databricks authentication details. For more information, see the documentation for
AWS,
Azure,
or GCP.
For the Unstructured UI or the Unstructured API, the following Databricks authentication types are supported:
-
Databricks OAuth machine-to-machine (M2M) authentication for
AWS, Azure, or GCP. You will need the the Client ID (or UUID or Application ID) and OAuth Secret (client secret) values for the corresponding service principal. Note that for Azure, only Databricks managed service principals are supported. Microsoft Entra ID managed service principals are not supported. The following video shows how to create a Databricks managed service principal: - Databricks personal access token authentication for AWS, Azure, or GCP. You will need the personal access token’s value. The following video shows how to create a Databricks personal access token:
- For Databricks personal access token authentication for AWS, Azure, or GCP: The personal access token’s value. The following video shows how to create a Databricks personal access token:
- For username and password (basic) authentication (AWS only): The user’s name and password values.
- For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal.
- For OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP): No additional values.
- For Azure managed identities (formerly Managed Service Identities (MSI) authentication) (Azure only): The client ID value for the corresponding managed identity.
- For Microsoft Entra ID service principal authentication (Azure only): The tenant ID, client ID, and client secret values for the corresponding service principal.
- For Azure CLI authentication (Azure only): No additional values.
- For Microsoft Entra ID user authentication (Azure only): The Entra ID token for the corresponding Entra ID user.
- For Google Cloud Platform credentials authentication (GCP only): The local path to the corresponding Google Cloud service account’s credentials file.
- For Google Cloud Platform ID authentication (GCP only): The Google Cloud service account’s email address.
-
Databricks OAuth machine-to-machine (M2M) authentication for
- The name of the parent catalog in Unity Catalog for AWS, Azure, or GCP for the volume.
- The name of the parent schema (formerly known as a database) in Unity Catalog for AWS, Azure, or GCP for the volume.
- The name of the volume in Unity Catalog for AWS, Azure, or GCP, and optionally any path in that volume that you want to access directly, beginning with the volume’s root.
-
The Databricks workspace user or service principal must have the following minimum set of privileges to read from or write to the
existing volume in Unity Catalog:
USE CATALOG
on the volume’s parent catalog in Unity Catalog.USE SCHEMA
on the volume’s parent schema (formerly known as a database) in Unity Catalog.READ VOLUME
andWRITE VOLUME
on the volume.
-
<name>
(required) - A unique name for this connector. -
<host>
(required) - The Databricks workspace host URL.Do not add a trailing slash (/
) to the workspace host URL. -
<client-id>
(required) - For Databricks OAuth machine-to-machine (M2M) authentication, the Client ID (or UUID or Application ID) value for the Databricks managed service principal that has the appropriate privileges to the volume. -
<client-secret>
(required) - For Databricks OAuth M2M authentication, the associated OAuth Secret value for the Databricks managed service principal that has the appropriate privileges to the volume. -
<token>
(required) - For Databricks personal access token authentication, the personal access token’s value. -
<catalog>
(required) - The name of the catalog to use. -
<schema>
- The name of the associated schema. If not specified,default
is used. -
<volume>
(required) - The name of the associated volume. -
<volume_path>
- Any optional path to access within the volume.