This page was recently updated. What do you think about it? Let us know!.

Batch process all your records to store structured outputs in a Chroma account.

The requirements are as follows.

  • A Chroma server. See Deployment.

    For example, here is a video about how to deploy a Chroma server to AWS:

  • The Chroma server’s hostname or IP address, and the server’s port number.

  • If you are not connecting to the server through HTTP, the path to the server instance.

  • The name of the tenant that you want to access on the server.

  • The name of the database that you want to access in the tenant.

  • The name of the collection that you want to access in the database.

The Chroma connector dependencies:

CLI, Python
pip install "unstructured-ingest[chroma]"

You might also need to install additional dependencies, depending on your needs. Learn more.

The following environment variables:

  • CHROMA_HOST - The , represented by --host (CLI) or host (Python).
  • CHROMA_PORT - The , represented by --port (CLI) or port (Python).
  • CHROMA_TENANT - The name of the tenant that you want to access on the Chroma server, represented by --tenant (CLI) or tenant (Python).
  • CHROMA_DATABASE - The name of the database that you want to access in the tenant, represented by --database (CLI) or database (Python).
  • CHROMA_COLLECTION - The name of the collection that you want to access in the database, represented by --collection-name (CLI) or collection_name (Python).

Additional settings include:

  • --path (CLI) or path (Python): The location where Chroma is persisted if you are not connecting through HTTP.
  • --settings (CLI) or settings (Python): A dictionary of settings to communicate with the Chroma server, for example: '{"persist_directory":"./chroma-persist"}'.
  • --headers (CLI) or headers (Python): A dictionary of headers to send to the Chroma server, for example: '{"Authorization":"Basic()"}'.
  • --ssl (CLI) or ssl (Python): True to use SSL for the connection.

These environment variables:

  • UNSTRUCTURED_API_KEY - Your Unstructured API key value.
  • UNSTRUCTURED_API_URL - Your Unstructured API URL.

Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local source connector:

#!/usr/bin/env bash

# Chunking and embedding are optional.

unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --output-dir $LOCAL_FILE_OUTPUT_DIR \
    --chunk-elements \
    --embedding-provider huggingface \
    --num-processes 2 \
    --verbose \
    --work-dir $WORK_DIR \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --strategy hi_res \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
  chroma \
    --host $CHROMA_HOST \
    --port $CHROMA_PORT \
    --collection-name $CHROMA_COLLECTION \
    --tenant $CHROMA_TENANT \
    --database $CHROMA_DATABASE \
    --settings "{\"persist_directory\":\"./chroma-persist\"}" \
    --headers "{\"Authorization\":\"Basic()\"}" \
    --batch-size 80 \
    --ssl