This page was recently updated. What do you think about it? Let us know!.

Batch process all your records to store structured outputs in a Weaviate database.

You will need:

The Weaviate prerequisites:

  1. A Weaviate database instance. The following information assumes that you have a Weaviate Cloud (WCD) account with a Weaviate database cluster in that account. Create a WCD account. Create a database cluster. For other database options, learn more.

  2. The URL and API key for the database cluster. Get the URL and API key.

  3. A collection in the database cluster. Note the name of the collection, also known as the collection’s class name. Create a collection.

The Weaviate connector dependencies:

CLI, Python
pip install "unstructured-ingest[weaviate]"

You might also need to install additional dependencies, depending on your needs. Learn more.

The following environment variables:

  • WEAVIATE_URL - THE REST endpoint for the Weaviate database cluster, represented by --host-url (CLI) or host_url (Python).

  • WEAVIATE_API_KEY - The API key for the database cluster, represented by --api-key (CLI) or api_key (Python).

    For the CLI, the --api-key option here is part of the weaviate command. For Python, the api_key parameter here is part of the WeaviateAccessConfig object.
  • WEAVIATE_COLLECTION_CLASS_NAME - The name of the collection in the database, represented by --class-name (CLI) or class_name (Python).

These environment variables:

  • UNSTRUCTURED_API_KEY - Your Unstructured API key value.
  • UNSTRUCTURED_API_URL - Your Unstructured API URL.

Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector:

Sample index schema

To make sure the schema of the index matches the data being written to it, a sample schema JSON can be used:

{
    "class": "Elements",
    "invertedIndexConfig": {
        "bm25": {
            "b": 0.75,
            "k1": 1.2
        },
        "cleanupIntervalSeconds": 60,
        "stopwords": {
            "additions": null,
            "preset": "en",
            "removals": null
        }
    },
    "multiTenancyConfig": {
        "enabled": false
    },
    "properties": [
        {
            "dataType": [
                "text"
            ],
            "indexFilterable": true,
            "indexSearchable": true,
            "name": "element_id",
            "tokenization": "word"
        },
        {
            "dataType": [
                "text"
            ],
            "indexFilterable": true,
            "indexSearchable": true,
            "name": "text",
            "tokenization": "word"
        },
        {
            "dataType": [
                "text"
            ],
            "indexFilterable": true,
            "indexSearchable": true,
            "name": "type",
            "tokenization": "word"
        },
        {
            "dataType": [
                "object"
            ],
            "indexFilterable": true,
            "indexSearchable": false,
            "name": "metadata",
            "nestedProperties": [
                {
                    "dataType": [
                        "int"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": false,
                    "name": "category_depth"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "parent_id",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "attached_to_filename",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "filetype",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "date"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": false,
                    "name": "last_modified"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "file_directory",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "filename",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "object"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": false,
                    "name": "data_source",
                    "nestedProperties": [
                        {
                            "dataType": [
                                "text"
                            ],
                            "indexFilterable": true,
                            "indexSearchable": true,
                            "name": "url",
                            "tokenization": "word"
                        },
                        {
                            "dataType": [
                                "text"
                            ],
                            "indexFilterable": true,
                            "indexSearchable": true,
                            "name": "version",
                            "tokenization": "word"
                        },
                        {
                            "dataType": [
                                "date"
                            ],
                            "indexFilterable": true,
                            "indexSearchable": false,
                            "name": "date_created"
                        },
                        {
                            "dataType": [
                                "date"
                            ],
                            "indexFilterable": true,
                            "indexSearchable": false,
                            "name": "date_modified"
                        },
                        {
                            "dataType": [
                                "date"
                            ],
                            "indexFilterable": true,
                            "indexSearchable": false,
                            "name": "date_processed"
                        },
                        {
                            "dataType": [
                                "text"
                            ],
                            "indexFilterable": true,
                            "indexSearchable": true,
                            "name": "record_locator",
                            "tokenization": "word"
                        },
                        {
                            "dataType": [
                                "text"
                            ],
                            "indexFilterable": true,
                            "indexSearchable": true,
                            "name": "permissions_data",
                            "tokenization": "word"
                        }
                    ]
                },
                {
                    "dataType": [
                        "object"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": false,
                    "name": "coordinates",
                    "nestedProperties": [
                        {
                            "dataType": [
                                "text"
                            ],
                            "indexFilterable": true,
                            "indexSearchable": true,
                            "name": "system",
                            "tokenization": "word"
                        },
                        {
                            "dataType": [
                                "number"
                            ],
                            "indexFilterable": true,
                            "indexSearchable": false,
                            "name": "layout_width"
                        },
                        {
                            "dataType": [
                                "number"
                            ],
                            "indexFilterable": true,
                            "indexSearchable": false,
                            "name": "layout_height"
                        },
                        {
                            "dataType": [
                                "text"
                            ],
                            "indexFilterable": true,
                            "indexSearchable": true,
                            "name": "points",
                            "tokenization": "word"
                        }
                    ]
                },
                {
                    "dataType": [
                        "text[]"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "languages",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": false,
                    "name": "page_number"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "page_name",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "url",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "links",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text[]"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "link_urls",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text[]"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "link_texts",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "sent_from",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "sent_to",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "subject",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "section",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "header_footer_type",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text[]"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "emphasized_text_contents",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text[]"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "emphasized_text_tags",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "text_as_html",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": true,
                    "name": "regex_metadata",
                    "tokenization": "word"
                },
                {
                    "dataType": [
                        "number"
                    ],
                    "indexFilterable": true,
                    "indexSearchable": false,
                    "name": "detection_class_prob"
                }
            ]
        }
    ],
    "replicationConfig": {
        "factor": 1
    },
    "shardingConfig": {
        "virtualPerPhysical": 128,
        "desiredCount": 1,
        "actualCount": 1,
        "desiredVirtualCount": 128,
        "actualVirtualCount": 128,
        "key": "_id",
        "strategy": "hash",
        "function": "murmur3"
    },
    "vectorIndexConfig": {
        "skip": false,
        "cleanupIntervalSeconds": 300,
        "maxConnections": 64,
        "efConstruction": 128,
        "ef": -1,
        "dynamicEfMin": 100,
        "dynamicEfMax": 500,
        "dynamicEfFactor": 8,
        "vectorCacheMaxObjects": 1000000000000,
        "flatSearchCutoff": 40000,
        "distance": "cosine",
        "pq": {
            "enabled": false,
            "bitCompression": false,
            "segments": 0,
            "centroids": 256,
            "trainingLimit": 100000,
            "encoder": {
                "type": "kmeans",
                "distribution": "log-normal"
            }
        }
    },
    "vectorIndexType": "hnsw",
    "vectorizer": "none"
}