OpenSearch
Batch process all your records to store structured outputs in OpenSearch.
You will need:
The OpenSearch prerequisites:
-
An OpenSearch instance, such as an AWS OpenSearch instance…
…or a local instance.
In all cases, the OpenSearch version must be compatible with the client version of the opensearch-py package on PyPI.
-
The instance’s hostname and port number.
-
The name of the search index on the instance.
-
If you’re using basic authentication to the instance, the user’s name and password.
-
If you’re using certificates for authentication instead:
- The path to the Certificate Authority (CA) bundle, if you use intermediate CAs with your root CA.
- The path to the combined private key and certificate file, or
- The paths to the separate private key and certificate file.
To learn more, see:
- For AWS OpenSearch instances, see Tutorial: Configure a domain with the internal user database and HTTP basic authentication and Creating a custom endpoint for Amazon OpenSearch Service.
- For local OpenSearch instances, see Authentication backends, HTTP basic authentication, and Client certificate authentication.
The OpenSearch connector dependencies:
pip install "unstructured-ingest[opensearch]"
You might also need to install additional dependencies, depending on your needs. Learn more.
The following environment variables:
OPENSEARCH_HOST
- The hostname and port number, defined as<hostname>:<port-number>
and represented by--hosts
(CLI) orhosts
(Python).OPENSEARCH_INDEX_NAME
- The name of the search index, represented by--index-name
(CLI) orindex_name
(Python).
If you’re using basic authentication to the instance:
OPENSEARCH_USERNAME
- The user’s name, represented by--username
(CLI) orusername
(Python).OPENSEARCH_PASSWORD
- The user’s password, represented by--password
(CLI) orpassword
(Python).
If you’re using certificates for authentication instead:
OPENSEARCH_CA_CERTS
- The path to the Certificate Authority (CA) bundle, if you use intermediate CAs with your root CA. This is represented by--ca-certs
(CLI) orca_certs
(Python).OPENSEARCH_CLIENT_CERT
- The path to the combined private key and certificate file, or the path to just the certificate file. This is represented by--client-cert
(CLI) orclient_cert
(Python).OPENSEARCH_CLIENT_KEY
- The path to the private key file, ifOPENSEARCH_CLIENT_CERT
refers to just the certificate file. This is represented by--client-key
(CLI) orclient_key
(Python).
Additional related settings include:
--use-ssl
(CLI) oruse_ssl=True
(Python) to use SSL for the connection.--verify-certs
(CLI) orverify_certs=True
(Python) to verify SSL certificates.--ssl-show-warn
(CLI) orssl_show_warn=True
(Python) to show a warning when verifying SSL certificates is disabled.
These environment variables:
UNSTRUCTURED_API_KEY
- Your Unstructured API key value.UNSTRUCTURED_API_URL
- Your Unstructured API URL.
Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local source connector:
Vector Search Sample Mapping
To make sure the schema of the index matches the data being written to it, a sample mapping JSON can be used.
{
"settings": {
"index": {
"knn": true,
"knn.algo_param.ef_search": 100
}
},
"mappings": {
"properties": {
"element_id": {
"type": "keyword"
},
"text": {
"type": "text",
"analyzer": "english"
},
"type": {
"type": "text"
},
"embeddings": {
"type": "knn_vector",
"dimension": 384
},
"metadata": {
"type": "object",
"properties": {
"category_depth": {
"type": "integer"
},
"parent_id": {
"type": "keyword"
},
"attached_to_filename": {
"type": "keyword"
},
"filetype": {
"type": "keyword"
},
"last_modified": {
"type": "date"
},
"file_directory": {
"type": "keyword"
},
"filename": {
"type": "keyword"
},
"data_source": {
"type": "object",
"properties": {
"url": {
"type": "text",
"analyzer": "standard"
},
"version": {
"type": "keyword"
},
"date_created": {
"type": "date"
},
"date_modified": {
"type": "date"
},
"date_processed": {
"type": "date"
},
"record_locator": {
"type": "keyword"
},
"permissions_data": {
"type": "object"
}
}
},
"coordinates": {
"type": "object",
"properties": {
"system": {
"type": "keyword"
},
"layout_width": {
"type": "float"
},
"layout_height": {
"type": "float"
},
"points": {
"type": "float"
}
}
},
"languages": {
"type": "keyword"
},
"page_number": {
"type": "integer"
},
"page_name": {
"type": "keyword"
},
"url": {
"type": "text",
"analyzer": "standard"
},
"links": {
"type": "object"
},
"link_urls": {
"type": "text"
},
"link_texts": {
"type": "text"
},
"sent_from": {
"type": "text",
"analyzer": "standard"
},
"sent_to": {
"type": "text",
"analyzer": "standard"
},
"subject": {
"type": "text",
"analyzer": "standard"
},
"section": {
"type": "text",
"analyzer": "standard"
},
"header_footer_type": {
"type": "keyword"
},
"emphasized_text_contents": {
"type": "text"
},
"emphasized_text_tags": {
"type": "keyword"
},
"text_as_html": {
"type": "text",
"analyzer": "standard"
},
"regex_metadata": {
"type": "object"
},
"detection_class_prob": {
"type": "float"
}
}
}
}
}
}