This page was recently updated. What do you think about it? Let us know!.

Batch process all your records to store structured outputs in Milvus.

The requirements are as follows.

  • For the Unstructured Platform, only Milvus cloud-based instances (such as Zilliz Cloud, and Milvus on IBM watsonx.data) are supported.
  • For Unstructured Ingest, Milvus local and cloud-based instances are supported.

The following video shows how to fulfill the minimum set of requirements for Milvus cloud-based instances, demonstrating Milvus on IBM watsonx.data:

  • For Zilliz Cloud, you will need:

    • A Zilliz Cloud account.

    • A Zilliz Cloud cluster.

    • The URI of the cluster, also known as the cluster’s public endpoint, which takes a format such as https://<cluster-id>.<cluster-type>.<cloud-provider>-<region>.cloud.zilliz.com. Get the cluster’s public endpoint.

    • The token to access the cluster. Get the cluster’s token.

    • The name of the database in the instance.

    • The name of the collection in the database.

      The collection must have a a defined schema before Unstructured can write to the collection. The minimum viable schema for Unstructured contains only the fields element_id, embeddings, and record_id, as follows:

      Field NameField TypeMax LengthDimensionIndexMetric Type
      element_id (primary key field)VARCHAR200
      embeddings (vector field)FLOAT_VECTOR3072Yes (Checked)Cosine
      record_idVARCHAR200
  • For Milvus on IBM watsonx.data, you will need:

  • For Milvus local, you will need:

All Milvus instances require the target collection to have a defined schema before Unstructured can write to the collection. The minimum viable schema for Unstructured contains only the fields element_id, embeddings, and record_id, as follows. This example code demonstrates the use of the Python SDK for Milvus to create a collection with this minimum viable schema, targeting Milvus on IBM watsonx.data. For the connections.connect arguments to connect to other types of Milvus deployments, see your Milvus provider’s documentation:

Python
import os
from pymilvus import (
    connections,
    FieldSchema,
    DataType,
    CollectionSchema,
    Collection,
)

connections.connect(
    alias="default",
    host=os.getenv("MILVUS_GRPC_HOST"),
    port=os.getenv("MILVUS_GRPC_PORT"),
    user=os.getenv("MILVUS_USER"),
    password=os.getenv("MILVUS_PASSWORD"),
    secure=True
)

primary_key = FieldSchema(
    name="element_id",
    dtype=DataType.VARCHAR,
    is_primary=True,
    max_length=200
)

vector = FieldSchema(
    name="embeddings",
    dtype=DataType.FLOAT_VECTOR,
    dim=3072
)

record_id = FieldSchema(
    name="record_id",
    dtype=DataType.VARCHAR,
    max_length=200
)

schema = CollectionSchema(
    fields=[primary_key, vector, record_id],
    enable_dynamic_field=True
)

collection = Collection(
    name="my_collection",
    schema=schema,
    using="default"
)

index_params = {
    "metric_type": "L2",
    "index_type": "IVF_FLAT",
    "params": {"nlist": 1024}
}

collection.create_index(
    field_name="embeddings",
    index_params=index_params
)

Other approaches, such as creating collections instantly or setting nullable and default fields, have not been fully evaluated by Unstructured and might produce unexpected results.

Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files’ types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors.

The Milvus connector dependencies:

CLI, Python
pip install "unstructured-ingest[milvus]"

You might also need to install additional dependencies, depending on your needs. Learn more.

The following environment variables:

  • MILVUS_URI - The Milvus instance’s URI, represented by --uri (CLI) or uri (Python).
  • MILVUS_USER and MILVUS_PASSWORD, or MILVUS_TOKEN - The username and password, or token, to access the instance. This is represented by --user and --password, or --token (CLI); or user and password, or token (Python).
  • MILVUS_DB - The database’s name, represented by --db-name (CLI) or db_name (Python).
  • MILVUS_COLLECTION - The collection’s name, represented by --collection-name (CLI) or collection_name (Python).
  • MILVUS_FIELDS_TO_INCLUDE - A list of fields to include a comma-separated list (CLI) or an array of strings (Python), represented by --field-to-include (CLI) or fields_to_include (Python).

Additional settings include:

  • To emit the metadata field’s child fields directly into the output, include --flatten-metadata (CLI) or flatten_metadata=True (Python). This is the default if not specified.
  • To keep the metadata field with its child fields intact in the output, include --no-flatten-metadata (CLI) or flatten_metadata=False (Python).

Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector:

This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page.

#!/usr/bin/env bash

# Chunking and embedding are optional.

unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --chunking-strategy by_title \
    --embedding-provider huggingface \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --strategy hi_res \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
  milvus \
    --uri $MILVUS_URI \
    --user $MILVUS_USER \
    --password $MILVUS_PASSWORD \
    --db-name $MILVUS_DB \
    --collection-name $MILVUS_COLLECTION \
    --fields-to-include type,element_id,text,embeddings

For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the --partition-by-api option (CLI) or partition_by_api (Python) parameter to specify where files are processed:

  • To do local file processing, omit --partition-by-api (CLI) or partition_by_api (Python), or explicitly specify partition_by_api=False (Python).

    Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:

    • --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
    • --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
    • The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL
  • To send files to the Unstructured Platform Partition Endpoint for processing, specify --partition-by-api (CLI) or partition_by_api=True (Python).

    Unstructured also requires an Unstructured API key and API URL, by adding the following:

    • --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
    • --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
    • The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL, representing your API key and API URL, respectively.

    Get an API key and API URL.