Milvus
Batch process all your records to store structured outputs in Milvus.
The requirements are as follows.
- For the Unstructured Platform, only Milvus cloud-based instances (such as Zilliz Cloud, and Milvus on IBM watsonx.data) are supported.
- For Unstructured Ingest, Milvus local and cloud-based instances are supported.
The following video shows how to fulfill the minimum set of requirements for Milvus cloud-based instances, demonstrating Milvus on IBM watsonx.data:
-
For Zilliz Cloud, you will need:
-
The URI of the cluster, also known as the cluster’s public endpoint, which takes a format such as
https://<cluster-id>.<cluster-type>.<cloud-provider>-<region>.cloud.zilliz.com
. Get the cluster’s public endpoint. -
The token to access the cluster. Get the cluster’s token.
-
The name of the database in the instance.
-
The name of the collection in the database.
The collection must have a a defined schema before Unstructured can write to the collection. The minimum viable schema for Unstructured contains only the fields
element_id
,embeddings
, andrecord_id
, as follows:Field Name Field Type Max Length Dimension Index Metric Type element_id
(primary key field)VARCHAR 200
— — — embeddings
(vector field)FLOAT_VECTOR — 3072
Yes (Checked) Cosine record_id
VARCHAR 200
— — —
-
For Milvus on IBM watsonx.data, you will need:
- An IBM Cloud account.
- The IBM watsonx.data subscription plan.
- A Milvus service instance in IBM watsonx.data.
- The URI of the instance, which takes the format of
https://
, followed by instance’s GRPC host, followed by a colon and the GRPC port. This takes the format ofhttps://<host>:<port>
. Get the instance’s GRPC host and GRPC port. - The name of the database in the instance.
- The name of the collection in the database. Note the collection requirements at the end of this section.
- The uername and password to access the instance.
The username for Milvus on IBM watsonx.data is always
ibmlhapikey
. The password for Milvus on IBM watsonx.data is in the form of an IBM Cloud user API key. Get the user API key.
-
For Milvus local, you will need:
- A Milvus instance.
- The URI of the instance.
- The name of the database in the instance.
- The name of the collection in the database. Note the collection requirements at the end of this section.
- The username and password, or token to access the instance.
All Milvus instances require the target collection to have a defined schema before Unstructured can write to the collection. The minimum viable
schema for Unstructured contains only the fields element_id
, embeddings
, and record_id
, as follows. This example code demonstrates the use of the
Python SDK for Milvus to create a collection with this minimum viable schema,
targeting Milvus on IBM watsonx.data. For the connections.connect
arguments to connect to other types of Milvus deployments, see your Milvus provider’s documentation:
Other approaches, such as creating collections instantly or setting nullable and default fields, have not been fully evaluated by Unstructured and might produce unexpected results.
Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files’ types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors.
The Milvus connector dependencies:
You might also need to install additional dependencies, depending on your needs. Learn more.
The following environment variables:
MILVUS_URI
- The Milvus instance’s URI, represented by--uri
(CLI) oruri
(Python).MILVUS_USER
andMILVUS_PASSWORD
, orMILVUS_TOKEN
- The username and password, or token, to access the instance. This is represented by--user
and--password
, or--token
(CLI); oruser
andpassword
, ortoken
(Python).MILVUS_DB
- The database’s name, represented by--db-name
(CLI) ordb_name
(Python).MILVUS_COLLECTION
- The collection’s name, represented by--collection-name
(CLI) orcollection_name
(Python).MILVUS_FIELDS_TO_INCLUDE
- A list of fields to include a comma-separated list (CLI) or an array of strings (Python), represented by--field-to-include
(CLI) orfields_to_include
(Python).
Additional settings include:
- To emit the
metadata
field’s child fields directly into the output, include--flatten-metadata
(CLI) orflatten_metadata=True
(Python). This is the default if not specified. - To keep the
metadata
field with its child fields intact in the output, include--no-flatten-metadata
(CLI) orflatten_metadata=False
(Python).
These environment variables:
UNSTRUCTURED_API_KEY
- Your Unstructured API key value.UNSTRUCTURED_API_URL
- Your Unstructured API URL.
Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local source connector: