This page was recently updated. What do you think about it? Let us know!.

Connect Astra DB to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem.

You will need:

The Astra DB connector prerequisites:

The Astra DB connector dependencies:

CLI, Python
pip install "unstructured-ingest[astradb]"

You might also need to install additional dependencies, depending on your needs. Learn more.

These environment variables:

  • ASTRA_DB_API_ENDPOINT - The API endpoint for the Astra DB database, represented by --api-endpoint (CLI) or api_endpoint (Python). To get the endpoint, see the Database Details > API Endpoint value on your database’s Overview tab.
  • ASTRA_DB_APPLICATION_TOKEN - The database application token value for the database, represented by --token (CLI) or token (Python). To get the token, see the Database Details > Application Tokens box on your database’s Overview tab.
  • ASTRA_DB_NAMESPACE - The name of the namespace for the database, represented by --namespace (CLI) or namespace (Python).
  • ASTRA_DB_COLLECTION - The name of the collection for the namespace, represented by --collection-name (CLI) or collection_name (Python).
  • ASTRA_DB_EMBEDDING_DIMENSIONS - The number of dimensions in the collection, represented by --embedding-dimension (CLI) or embedding_dimension (Python).

These environment variables:

  • UNSTRUCTURED_API_KEY - Your Unstructured API key value.
  • UNSTRUCTURED_API_URL - Your Unstructured API URL.

Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local source connector: