This page was recently updated. What do you think about it? Let us know!.

Batch process all your records to store structured outputs in Elasticsearch.

You will need:

The Elasticsearch prerequisites:

  • An Elasticsearch instance, such as an Elastic Cloud service instance…

    …or a self-managed Elasticsearch instance.

  • The name of the index on the instance. See Create index and Get index.

  • If you’re connecting to an Elastic Cloud instance, the Cloud ID and API key. To get these, see your Elasticsearch Service web console.

  • If you’re connecting to a self-managed instance, the instance’s hostname and port number. See Networking.

  • If you’re using basic authentication to the instance, the user’s name and password.

  • If you’re using token-based authentication to the instance, the bearer token or API key for the instance. See Token-based authentication services and Create API key.

  • If you’re using certificate, the path to the Certificate Authority (CA) file on the instance, and the certificate fingerprint. See SSL certificate API and Where can I see my Certificate Fingerprint?.

The Elasticsearch connector dependencies:

CLI, Python
pip install "unstructured-ingest[elasticsearch]"

You might also need to install additional dependencies, depending on your needs. Learn more.

The following environment variables:

  • ELASTICSEARCH_HOST - The hostname and port number, defined as <hostname>:<port-number> and represented by --hosts (CLI) or hosts (Python).
  • ELASTICSEARCH_INDEX_NAME - The name of the search index, represented by --index-name (CLI) or index_name (Python).

If you’re using Elastic Cloud to connect to ElasticSearch:

  • ELASTIC_CLOUD_ID - The instance’s Cloud ID, represented by --cloud-id (CLI) or cloud_id (Python).
  • ELASTIC_CLOUD_API_KEY - The corresponding Cloud API key value, represented by --api-key-id (CLI) or api_key_id (Python).

If you’re using basic authentication to the instance:

  • ELASTICSEARCH_USERNAME - The user’s name, represented by --username (CLI) or username (Python).
  • ELASTICSEARCH_PASSWORD - The user’s password, represented by --password (CLI) or password (Python).

If you’re using token-based authentication to the instance instead:

  • ELASTICSEARCH_BEARER_TOKEN - The bearer token’s value, represented by --bearer-auth (CLI) or bearer_auth (Python).
  • ELASTIC_CLOUD_API_KEY_ID - The API key’s value, represented by --api-key (CLI) or api_key (Python).

If you’re using certificates:

  • ELASTICSEARCH_CA_CERTS - The path to the Certificate Authority (CA) file, represented by --ca-certs (CLI) or ca_certs (Python).
  • ELASTICSEARCH_SSL_ASSERT_FINGERPRINT - The certificate’s fingerprint, represented by --ssl-assert-fingerprint or ssl_assert_fingerprint (Python).

These environment variables:

  • UNSTRUCTURED_API_KEY - Your Unstructured API key value.
  • UNSTRUCTURED_API_URL - Your Unstructured API URL.

Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local source connector:

Vector search sample mapping

To make sure the schema of the index matches the data being written to it, a sample mapping JSON can be used.

{
  "properties": {
    "element_id": {
      "type": "keyword"
    },
    "text": {
      "type": "text",
      "analyzer": "english"
    },
    "type": {
      "type": "text"
    },
    "embeddings": {
      "type": "dense_vector",
      "dims": 384
    },
    "metadata": {
      "type": "object",
      "properties": {
        "category_depth": {
          "type": "integer"
        },
        "parent_id": {
          "type": "keyword"
        },
        "attached_to_filename": {
          "type": "keyword"
        },
        "filetype": {
          "type": "keyword"
        },
        "last_modified": {
          "type": "date"
        },
        "file_directory": {
          "type": "keyword"
        },
        "filename": {
          "type": "keyword"
        },
        "data_source": {
          "type": "object",
          "properties": {
            "url": {
              "type": "text",
              "analyzer": "standard"
            },
            "version": {
              "type": "keyword"
            },
            "date_created": {
              "type": "date"
            },
            "date_modified": {
              "type": "date"
            },
            "date_processed": {
              "type": "date"
            },
            "record_locator": {
              "type": "keyword"
            },
            "permissions_data": {
              "type": "object"
            }
          }
        },
        "coordinates": {
          "type": "object",
          "properties": {
            "system": {
              "type": "keyword"
            },
            "layout_width": {
              "type": "float"
            },
            "layout_height": {
              "type": "float"
            },
            "points": {
              "type": "float"
            }
          }
        },
        "languages": {
          "type": "keyword"
        },
        "page_number": {
          "type": "integer"
        },
        "page_name": {
          "type": "keyword"
        },
        "url": {
          "type": "text",
          "analyzer": "standard"
        },
        "links": {
          "type": "object"
        },
        "link_urls": {
          "type": "text"
        },
        "link_texts": {
          "type": "text"
        },
        "sent_from": {
          "type": "text",
          "analyzer": "standard"
        },
        "sent_to": {
          "type": "text",
          "analyzer": "standard"
        },
        "subject": {
          "type": "text",
          "analyzer": "standard"
        },
        "section": {
          "type": "text",
          "analyzer": "standard"
        },
        "header_footer_type": {
          "type": "keyword"
        },
        "emphasized_text_contents": {
          "type": "text"
        },
        "emphasized_text_tags": {
          "type": "keyword"
        },
        "text_as_html": {
          "type": "text",
          "analyzer": "standard"
        },
        "regex_metadata": {
          "type": "object"
        },
        "detection_class_prob": {
          "type": "float"
        }
      }
    }
  }
}