u<short-workflow-id>
.
For the Unstructured Ingest CLI and Ingest Python library, the table is named unstructuredautocreated
.CREATE TABLE
privilege on that parent schema (formerly known as a database)
before the connector attempts to create the table. Otherwise, table creation could fail.-
) in the names of catalogs, schemas (formerly known as databases), and tables might cause isolated issues with the connector. It is
recommended to use underscores (_
) instead of dashes in the names of catalogs, schemas, and tables.-
) in the names of volumes might cause isolated issues with the connector. It is
recommended to use underscores (_
) instead of dashes in the names of volumes.Can Restart
permission on that cluster. Learn how to check and set cluster permissions for
AWS,
Azure, or
GCP.
Can use
permission on that SQL warehouse. Learn how to check and set SQL warehouse permissions for
AWS,
Azure, or
GCP.
USE CATALOG
on the volume’s parent catalog in Unity Catalog.USE SCHEMA
on the volume’s parent schema (formerly known as a database) in Unity Catalog.READ VOLUME
and WRITE VOLUME
on the volume.USE CATALOG
on the table’s parent catalog in Unity Catalog.USE SCHEMA
on the tables’s parent schema (formerly known as a database) in Unity Catalog.CREATE TABLE
on the table’s parent schema (formerly known as a database) in Unity Catalog.MODIFY
and SELECT
on the table.DATABRICKS_HOST
- The Databricks cluster’s or SQL warehouse’s Server Hostname value, represented by --server-hostname
(CLI) or server_hostname
(Python).
DATABRICKS_HTTP_PATH
- The cluster’s or SQL warehouse’s HTTP Path value, represented by --http-path
(CLI) or http_path
(Python).
DATABRICKS_TOKEN
- For Databricks personal access token authentication, the token’s value, represented by --token
(CLI) or token
(Python).
DATABRICKS_CLIENT_ID
- For Databricks managed service principal authenticaton, the service principal’s UUID (or Client ID or Application ID) value, represented by --client-id
(CLI) or client_id
(Python).
DATABRICKS_CLIENT_SECRET
- For Databricks managed service principal authenticaton, the service principal’s OAuth Secret value, represented by --client-secret
(CLI) or client_secret
(Python).
DATABRICKS_CATALOG
- The name of the catalog in Unity Catalog, represented by --catalog
(CLI) or catalog
(Python).
DATABRICKS_DATABASE
- The name of the schema (formerly known as a database) inside of the catalog for the target table, represented by --database
(CLI) or database
(Python). The default is default
if not otherwise specified.
If you are also using a volume, and the target table and volume are in the same schema (formerly known as a database), then DATABRICKS_DATABASE
and DATABRICKS_SCHEMA
will have the same values.
DATABRICKS_TABLE
- The name of the table inside of the schema (formerly known as a database), represented by --table-name
(CLI) or table_name
(Python).
u<short-workflow-id>
within the specified schema (formerly known as a database).-
) in the names of catalogs, schemas (formerly known as databases), tables, and volumes might cause isolated issues with the connector. It is
recommended to use underscores (_
) instead of dashes in the names of catalogs, schemas, tables, and volumes.DATABRICKS_RECORD_ID_KEY
- The name of the column that uniquely identifies each record in the table, represented by --record-id-key
(CLI) or record_id_key
(Python).--batch-size
(CLI) or batch_size
(Python). The default is 50
if not otherwise specified.DATABRICKS_SCHEMA
- The name of the schema (formerly known as a database) inside of the catalog for the target volume, represented by --schema
(CLI) or schema
(Python). The default is default
if not otherwise specified.
If the target volume and table are in the same schema (formerly known as a database), then DATABRICKS_SCHEMA
and DATABRICKS_SCHEMA
will have the same values.
DATABRICKS_VOLUME
- The name of the volume inside of the schema (formerly known as a database), represented by --volume
(CLI) or volume
(Python).
DATABRICKS_VOLUME_PATH
- Optionally, a specific path inside of the volume that you want to start accessing from, starting from the volume’s root, represented by --volume-path
(CLI) or volume_path
(Python). The default is to start accessing from the volume’s root if not otherwise specified.
--partition-by-api
option (CLI) or partition_by_api
(Python) parameter to specify where files are processed:
--partition-by-api
(CLI) or partition_by_api
(Python), or explicitly specify partition_by_api=False
(Python).
Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:
--api-key $UNSTRUCTURED_API_KEY
(CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY")
(Python)--partition-endpoint $UNSTRUCTURED_API_URL
(CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")
(Python)UNSTRUCTURED_API_KEY
and UNSTRUCTURED_API_URL
--partition-by-api
(CLI) or partition_by_api=True
(Python).
Unstructured also requires an Unstructured API key and API URL, by adding the following:
--api-key $UNSTRUCTURED_API_KEY
(CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY")
(Python)--partition-endpoint $UNSTRUCTURED_API_URL
(CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")
(Python)UNSTRUCTURED_API_KEY
and UNSTRUCTURED_API_URL
, representing your API key and API URL, respectively.https://api.unstructuredapp.io/general/v0/general
, which is the API URL for the Unstructured Partition Endpoint. However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at sales@unstructured.io.If you do not have an API key, get one now.If the Unstructured API is self-hosted, the process
for generating Unstructured API keys, and the Unstructured API URL that you use, are different.
For details, contact Unstructured Sales at
sales@unstructured.io.