This page was recently updated. What do you think about it? Let us know!.

Batch process all your records to store structured outputs in Elasticsearch.

You will need:

The Elasticsearch prerequisites:

  • An Elasticsearch instance, such as an Elastic Cloud service instance…

    …or a self-managed Elasticsearch instance.

  • The name of the index on the instance. See Create index and Get index.

    The Elasticsearch index that you use must have a schema that is compatible with the schema of the documents that Unstructured produces for you. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files’ types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors.

    You can adapt the following index schema example for your own needs:

    {
        "settings": {
            "index": {
                "knn": true,
                "knn.algo_param.ef_search": 100
            }
        },
        "mappings": {
            "properties": {
                "element_id": {
                    "type": "keyword"
                },
                "text": {
                    "type": "text"
                },
                "embeddings": {
                    "type": "dense_vector",
                    "dims": 384,
                    "index": true,
                    "similarity": "cosine"
                },
                "metadata": {
                    "type": "object",
                    "properties": {
                        "parent_id": {
                            "type": "text"
                        },
                        "page_number": {
                            "type": "integer"
                        },
                        "is_continuation": {
                            "type": "boolean"
                        },
                        "orig_elements": {
                            "type": "text"
                        }
                    }
                }
            }
        }
    }
    

    See also:

  • If you’re connecting to an Elastic Cloud instance, the Cloud ID and API key. To get these, see your Elasticsearch Service web console.

  • If you’re connecting to a self-managed instance, the instance’s hostname and port number. See Networking.

  • If you’re using basic authentication to the instance, the user’s name and password.

  • If you’re using token-based authentication to the instance, the bearer token or API key for the instance. See Token-based authentication services and Create API key.

  • If you’re using certificate, the path to the Certificate Authority (CA) file on the instance, and the certificate fingerprint. See SSL certificate API and Where can I see my Certificate Fingerprint?.

The Elasticsearch connector dependencies:

CLI, Python
pip install "unstructured-ingest[elasticsearch]"

You might also need to install additional dependencies, depending on your needs. Learn more.

The following environment variables:

  • ELASTICSEARCH_HOST - The hostname and port number, defined as <hostname>:<port-number> and represented by --hosts (CLI) or hosts (Python).
  • ELASTICSEARCH_INDEX_NAME - The name of the search index, represented by --index-name (CLI) or index_name (Python).

If you’re using Elastic Cloud to connect to ElasticSearch:

  • ELASTIC_CLOUD_ID - The instance’s Cloud ID, represented by --cloud-id (CLI) or cloud_id (Python).
  • ELASTIC_CLOUD_API_KEY - The corresponding Cloud API key value, represented by --api-key-id (CLI) or api_key_id (Python).

If you’re using basic authentication to the instance:

  • ELASTICSEARCH_USERNAME - The user’s name, represented by --username (CLI) or username (Python).
  • ELASTICSEARCH_PASSWORD - The user’s password, represented by --password (CLI) or password (Python).

If you’re using token-based authentication to the instance instead:

  • ELASTICSEARCH_BEARER_TOKEN - The bearer token’s value, represented by --bearer-auth (CLI) or bearer_auth (Python).
  • ELASTIC_CLOUD_API_KEY_ID - The API key’s value, represented by --api-key (CLI) or api_key (Python).

If you’re using certificates:

  • ELASTICSEARCH_CA_CERTS - The path to the Certificate Authority (CA) file, represented by --ca-certs (CLI) or ca_certs (Python).
  • ELASTICSEARCH_SSL_ASSERT_FINGERPRINT - The certificate’s fingerprint, represented by --ssl-assert-fingerprint or ssl_assert_fingerprint (Python).

These environment variables:

  • UNSTRUCTURED_API_KEY - Your Unstructured API key value.
  • UNSTRUCTURED_API_URL - Your Unstructured API URL.

Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local source connector: