This page was recently updated. What do you think about it? Let us know!.

Batch process all your records to store structured outputs in OpenSearch.

You will need:

The OpenSearch prerequisites:

The OpenSearch connector dependencies:

CLI, Python
pip install "unstructured-ingest[opensearch]"

You might also need to install additional dependencies, depending on your needs. Learn more.

The following environment variables:

  • OPENSEARCH_HOST - The hostname and port number, defined as <hostname>:<port-number> and represented by --hosts (CLI) or hosts (Python).
  • OPENSEARCH_INDEX_NAME - The name of the search index, represented by --index-name (CLI) or index_name (Python).

If you’re using basic authentication to the instance:

  • OPENSEARCH_USERNAME - The user’s name, represented by --username (CLI) or username (Python).
  • OPENSEARCH_PASSWORD - The user’s password, represented by --password (CLI) or password (Python).

If you’re using certificates for authentication instead:

  • OPENSEARCH_CA_CERTS - The path to the Certificate Authority (CA) bundle, if you use intermediate CAs with your root CA. This is represented by --ca-certs (CLI) or ca_certs (Python).
  • OPENSEARCH_CLIENT_CERT - The path to the combined private key and certificate file, or the path to just the certificate file. This is represented by --client-cert (CLI) or client_cert (Python).
  • OPENSEARCH_CLIENT_KEY - The path to the private key file, if OPENSEARCH_CLIENT_CERT refers to just the certificate file. This is represented by --client-key (CLI) or client_key (Python).

Additional related settings include:

  • --use-ssl (CLI) or use_ssl=True (Python) to use SSL for the connection.
  • --verify-certs (CLI) or verify_certs=True (Python) to verify SSL certificates.
  • --ssl-show-warn (CLI) or ssl_show_warn=True (Python) to show a warning when verifying SSL certificates is disabled.

These environment variables:

  • UNSTRUCTURED_API_KEY - Your Unstructured API key value.
  • UNSTRUCTURED_API_URL - Your Unstructured API URL.

Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local source connector:

Vector Search Sample Mapping

To make sure the schema of the index matches the data being written to it, a sample mapping JSON can be used.

Object description
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
      "element_id": {
        "type": "keyword"
      },
      "text": {
        "type": "text",
        "analyzer": "english"
      },
      "type": {
        "type": "text"
      },
      "embeddings": {
        "type": "knn_vector",
        "dimension": 384
      },
      "metadata": {
        "type": "object",
        "properties": {
          "category_depth": {
            "type": "integer"
          },
          "parent_id": {
            "type": "keyword"
          },
          "attached_to_filename": {
            "type": "keyword"
          },
          "filetype": {
            "type": "keyword"
          },
          "last_modified": {
            "type": "date"
          },
          "file_directory": {
            "type": "keyword"
          },
          "filename": {
            "type": "keyword"
          },
          "data_source": {
            "type": "object",
            "properties": {
              "url": {
                "type": "text",
                "analyzer": "standard"
              },
              "version": {
                "type": "keyword"
              },
              "date_created": {
                "type": "date"
              },
              "date_modified": {
                "type": "date"
              },
              "date_processed": {
                "type": "date"
              },
              "record_locator": {
                "type": "keyword"
              },
              "permissions_data": {
                "type": "object"
              }
            }
          },
          "coordinates": {
            "type": "object",
            "properties": {
              "system": {
                "type": "keyword"
              },
              "layout_width": {
                "type": "float"
              },
              "layout_height": {
                "type": "float"
              },
              "points": {
                "type": "float"
              }
            }
          },
          "languages": {
            "type": "keyword"
          },
          "page_number": {
            "type": "integer"
          },
          "page_name": {
            "type": "keyword"
          },
          "url": {
            "type": "text",
            "analyzer": "standard"
          },
          "links": {
            "type": "object"
          },
          "link_urls": {
            "type": "text"
          },
          "link_texts": {
            "type": "text"
          },
          "sent_from": {
            "type": "text",
            "analyzer": "standard"
          },
          "sent_to": {
            "type": "text",
            "analyzer": "standard"
          },
          "subject": {
            "type": "text",
            "analyzer": "standard"
          },
          "section": {
            "type": "text",
            "analyzer": "standard"
          },
          "header_footer_type": {
            "type": "keyword"
          },
          "emphasized_text_contents": {
            "type": "text"
          },
          "emphasized_text_tags": {
            "type": "keyword"
          },
          "text_as_html": {
            "type": "text",
            "analyzer": "standard"
          },
          "regex_metadata": {
            "type": "object"
          },
          "detection_class_prob": {
            "type": "float"
          }
        }
      }
    }
  }
}