Azure Cognitive Search
Batch process all your records to store structured outputs in an Azure Cognitive Search account.
You will need:
The Azure AI Search (formerly Azure Cognitive Search) prerequisites:
- The API key for Azure AI Search.
- The endpoint for Azure AI Search.
- The name of the index in Azure AI Search.
See the Azure AI Search documentation.
Make sure that you use a compatible index schema. For reference, see the sample index schema.
The Azure Cognitive Search connector dependencies:
pip install "unstructured-ingest[azure-cognitive-search]"
You might also need to install additional dependencies, depending on your needs. Learn more.
These environment variables:
AZURE_SEARCH_ENDPOINT
- The endpoint URL for Azure AI Search (formerly Azure Cognitive Search), represented by--endpoint
(CLI) orendpoint
(Python).AZURE_SEARCH_API_KEY
- The API key for Azure AI Search, represented by--key
(CLI) orkey
(Python).AZURE_SEARCH_INDEX
- The name of the index for Azure AI Search, represented by--index
(CLI) orindex
(Python).
These environment variables:
UNSTRUCTURED_API_KEY
- Your Unstructured API key value.UNSTRUCTURED_API_URL
- Your Unstructured API URL.
Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local destination connector:
Sample index schema
To make sure the schema of the index matches the data being written to it, a sample schema JSON can be used:
{
"@odata.context": "https://utic-test-ingest-fixtures.search.windows.net/$metadata#indexes/$entity",
"@odata.etag": "\"0x8DBB93E09C8F4BD\"",
"name": "your-index-here",
"fields": [
{
"name": "id",
"type": "Edm.String",
"key": true
},
{
"name": "element_id",
"type": "Edm.String"
},
{
"name": "text",
"type": "Edm.String"
},
{
"name": "embeddings",
"type": "Collection(Edm.Single)",
"dimensions": 400,
"vectorSearchConfiguration": "embeddings-config"
},
{
"name": "type",
"type": "Edm.String"
},
{
"name": "metadata",
"type": "Edm.ComplexType",
"fields": [
{
"name": "category_depth",
"type": "Edm.Int32"
},
{
"name": "parent_id",
"type": "Edm.String"
},
{
"name": "attached_to_filename",
"type": "Edm.String"
},
{
"name": "filetype",
"type": "Edm.String"
},
{
"name": "last_modified",
"type": "Edm.DateTimeOffset"
},
{
"name": "file_directory",
"type": "Edm.String"
},
{
"name": "filename",
"type": "Edm.String"
},
{
"name": "data_source",
"type": "Edm.ComplexType",
"fields": [
{
"name": "url",
"type": "Edm.String"
},
{
"name": "version",
"type": "Edm.String"
},
{
"name": "date_created",
"type": "Edm.DateTimeOffset"
},
{
"name": "date_modified",
"type": "Edm.DateTimeOffset"
},
{
"name": "date_processed",
"type": "Edm.DateTimeOffset"
},
{
"name": "permissions_data",
"type": "Edm.String"
},
{
"name": "record_locator",
"type": "Edm.String"
}
]
},
{
"name": "coordinates",
"type": "Edm.ComplexType",
"fields": [
{
"name": "system",
"type": "Edm.String"
},
{
"name": "layout_width",
"type": "Edm.Double"
},
{
"name": "layout_height",
"type": "Edm.Double"
},
{
"name": "points",
"type": "Edm.String"
}
]
},
{
"name": "page_number",
"type": "Edm.String"
},
{
"name": "links",
"type": "Collection(Edm.String)"
},
{
"name": "url",
"type": "Edm.String"
},
{
"name": "link_urls",
"type": "Collection(Edm.String)"
},
{
"name": "link_texts",
"type": "Collection(Edm.String)"
},
{
"name": "sent_from",
"type": "Collection(Edm.String)"
},
{
"name": "sent_to",
"type": "Collection(Edm.String)"
},
{
"name": "subject",
"type": "Edm.String"
},
{
"name": "section",
"type": "Edm.String"
},
{
"name": "header_footer_type",
"type": "Edm.String"
},
{
"name": "emphasized_text_contents",
"type": "Collection(Edm.String)"
},
{
"name": "emphasized_text_tags",
"type": "Collection(Edm.String)"
},
{
"name": "text_as_html",
"type": "Edm.String"
},
{
"name": "regex_metadata",
"type": "Edm.String"
},
{
"name": "detection_class_prob",
"type": "Edm.Double"
}
]
}
],
"vectorSearch": {
"algorithmConfigurations": [
{
"name": "embeddings-config",
"kind": "hnsw",
"hnswParameters": {
"metric": "cosine",
"m": 4,
"efConstruction": 400,
"efSearch": 500
}
}
]
}
}