This page was recently updated. What do you think about it? Let us know!.

Batch process all your records to store structured outputs in an Azure AI Search account.

You will need:

The Azure AI Search prerequisites:

The following video shows how to fulfill the minimum set of Azure AI Search prerequisites:

Here are some more details about these prerequisites:

  • The endpoint and API key for Azure AI Search. Create an endpoint and API key.

  • The name of the index in Azure AI Search. Create an index.

    The Azure AI Search index that you use must have an index schema that is compatible with the schema of the documents that Unstructured produces for you. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files’ types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors.

    You can adapt the following index schema example for your own needs:

    {
      "@odata.context": "https://ingest-test-azure-ai-search.search.windows.net/$metadata#indexes/$entity",
      "@odata.etag": "\"0x8DCED5D96393CA9\"",
      "name": "<my-index-name>",
      "defaultScoringProfile": null,
      "fields": [
        {
          "name": "id",
          "type": "Edm.String",
          "searchable": true,
          "filterable": true,
          "retrievable": true,
          "stored": true,
          "sortable": true,
          "facetable": true,
          "key": true,
          "indexAnalyzer": null,
          "searchAnalyzer": null,
          "analyzer": null,
          "normalizer": null,
          "dimensions": null,
          "vectorSearchProfile": null,
          "vectorEncoding": null,
          "synonymMaps": []
        },
        {
          "name": "element_id",
          "type": "Edm.String",
          "searchable": true,
          "filterable": true,
          "retrievable": true,
          "stored": true,
          "sortable": true,
          "facetable": true,
          "key": false,
          "indexAnalyzer": null,
          "searchAnalyzer": null,
          "analyzer": null,
          "normalizer": null,
          "dimensions": null,
          "vectorSearchProfile": null,
          "vectorEncoding": null,
          "synonymMaps": []
        },
        {
          "name": "text",
          "type": "Edm.String",
          "searchable": true,
          "filterable": true,
          "retrievable": true,
          "stored": true,
          "sortable": true,
          "facetable": true,
          "key": false,
          "indexAnalyzer": null,
          "searchAnalyzer": null,
          "analyzer": null,
          "normalizer": null,
          "dimensions": null,
          "vectorSearchProfile": null,
          "vectorEncoding": null,
          "synonymMaps": []
        },
        {
          "name": "embeddings",
          "type": "Collection(Edm.Single)",
          "searchable": true,
          "filterable": false,
          "retrievable": true,
          "stored": true,
          "sortable": false,
          "facetable": false,
          "key": false,
          "indexAnalyzer": null,
          "searchAnalyzer": null,
          "analyzer": null,
          "normalizer": null,
          "dimensions": 3072,
          "vectorSearchProfile": "embeddings-config-profile",
          "vectorEncoding": null,
          "synonymMaps": []
        },
        {
          "name": "type",
          "type": "Edm.String",
          "searchable": true,
          "filterable": true,
          "retrievable": true,
          "stored": true,
          "sortable": true,
          "facetable": true,
          "key": false,
          "indexAnalyzer": null,
          "searchAnalyzer": null,
          "analyzer": null,
          "normalizer": null,
          "dimensions": null,
          "vectorSearchProfile": null,
          "vectorEncoding": null,
          "synonymMaps": []
        },
        {
          "name": "metadata",
          "type": "Edm.ComplexType",
          "fields": [
            {
              "name": "category_depth",
              "type": "Edm.Int32",
              "searchable": false,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "parent_id",
              "type": "Edm.String",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "attached_to_filename",
              "type": "Edm.String",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "filetype",
              "type": "Edm.String",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "last_modified",
              "type": "Edm.DateTimeOffset",
              "searchable": false,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "is_continuation",
              "type": "Edm.Boolean",
              "searchable": false,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "file_directory",
              "type": "Edm.String",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "filename",
              "type": "Edm.String",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "data_source",
              "type": "Edm.ComplexType",
              "fields": [
                {
                  "name": "url",
                  "type": "Edm.String",
                  "searchable": true,
                  "filterable": true,
                  "retrievable": true,
                  "stored": true,
                  "sortable": true,
                  "facetable": true,
                  "key": false,
                  "indexAnalyzer": null,
                  "searchAnalyzer": null,
                  "analyzer": null,
                  "normalizer": null,
                  "dimensions": null,
                  "vectorSearchProfile": null,
                  "vectorEncoding": null,
                  "synonymMaps": []
                },
                {
                  "name": "version",
                  "type": "Edm.String",
                  "searchable": true,
                  "filterable": true,
                  "retrievable": true,
                  "stored": true,
                  "sortable": true,
                  "facetable": true,
                  "key": false,
                  "indexAnalyzer": null,
                  "searchAnalyzer": null,
                  "analyzer": null,
                  "normalizer": null,
                  "dimensions": null,
                  "vectorSearchProfile": null,
                  "vectorEncoding": null,
                  "synonymMaps": []
                },
                {
                  "name": "date_created",
                  "type": "Edm.DateTimeOffset",
                  "searchable": false,
                  "filterable": true,
                  "retrievable": true,
                  "stored": true,
                  "sortable": true,
                  "facetable": true,
                  "key": false,
                  "indexAnalyzer": null,
                  "searchAnalyzer": null,
                  "analyzer": null,
                  "normalizer": null,
                  "dimensions": null,
                  "vectorSearchProfile": null,
                  "vectorEncoding": null,
                  "synonymMaps": []
                },
                {
                  "name": "date_modified",
                  "type": "Edm.DateTimeOffset",
                  "searchable": false,
                  "filterable": true,
                  "retrievable": true,
                  "stored": true,
                  "sortable": true,
                  "facetable": true,
                  "key": false,
                  "indexAnalyzer": null,
                  "searchAnalyzer": null,
                  "analyzer": null,
                  "normalizer": null,
                  "dimensions": null,
                  "vectorSearchProfile": null,
                  "vectorEncoding": null,
                  "synonymMaps": []
                },
                {
                  "name": "date_processed",
                  "type": "Edm.DateTimeOffset",
                  "searchable": false,
                  "filterable": true,
                  "retrievable": true,
                  "stored": true,
                  "sortable": true,
                  "facetable": true,
                  "key": false,
                  "indexAnalyzer": null,
                  "searchAnalyzer": null,
                  "analyzer": null,
                  "normalizer": null,
                  "dimensions": null,
                  "vectorSearchProfile": null,
                  "vectorEncoding": null,
                  "synonymMaps": []
                },
                {
                  "name": "permissions_data",
                  "type": "Edm.String",
                  "searchable": true,
                  "filterable": true,
                  "retrievable": true,
                  "stored": true,
                  "sortable": true,
                  "facetable": true,
                  "key": false,
                  "indexAnalyzer": null,
                  "searchAnalyzer": null,
                  "analyzer": null,
                  "normalizer": null,
                  "dimensions": null,
                  "vectorSearchProfile": null,
                  "vectorEncoding": null,
                  "synonymMaps": []
                },
                {
                  "name": "record_locator",
                  "type": "Edm.String",
                  "searchable": true,
                  "filterable": true,
                  "retrievable": true,
                  "stored": true,
                  "sortable": true,
                  "facetable": true,
                  "key": false,
                  "indexAnalyzer": null,
                  "searchAnalyzer": null,
                  "analyzer": null,
                  "normalizer": null,
                  "dimensions": null,
                  "vectorSearchProfile": null,
                  "vectorEncoding": null,
                  "synonymMaps": []
                }
              ]
            },
            {
              "name": "coordinates",
              "type": "Edm.ComplexType",
              "fields": [
                {
                  "name": "system",
                  "type": "Edm.String",
                  "searchable": true,
                  "filterable": true,
                  "retrievable": true,
                  "stored": true,
                  "sortable": true,
                  "facetable": true,
                  "key": false,
                  "indexAnalyzer": null,
                  "searchAnalyzer": null,
                  "analyzer": null,
                  "normalizer": null,
                  "dimensions": null,
                  "vectorSearchProfile": null,
                  "vectorEncoding": null,
                  "synonymMaps": []
                },
                {
                  "name": "layout_width",
                  "type": "Edm.Double",
                  "searchable": false,
                  "filterable": true,
                  "retrievable": true,
                  "stored": true,
                  "sortable": true,
                  "facetable": true,
                  "key": false,
                  "indexAnalyzer": null,
                  "searchAnalyzer": null,
                  "analyzer": null,
                  "normalizer": null,
                  "dimensions": null,
                  "vectorSearchProfile": null,
                  "vectorEncoding": null,
                  "synonymMaps": []
                },
                {
                  "name": "layout_height",
                  "type": "Edm.Double",
                  "searchable": false,
                  "filterable": true,
                  "retrievable": true,
                  "stored": true,
                  "sortable": true,
                  "facetable": true,
                  "key": false,
                  "indexAnalyzer": null,
                  "searchAnalyzer": null,
                  "analyzer": null,
                  "normalizer": null,
                  "dimensions": null,
                  "vectorSearchProfile": null,
                  "vectorEncoding": null,
                  "synonymMaps": []
                },
                {
                  "name": "points",
                  "type": "Edm.String",
                  "searchable": true,
                  "filterable": true,
                  "retrievable": true,
                  "stored": true,
                  "sortable": true,
                  "facetable": true,
                  "key": false,
                  "indexAnalyzer": null,
                  "searchAnalyzer": null,
                  "analyzer": null,
                  "normalizer": null,
                  "dimensions": null,
                  "vectorSearchProfile": null,
                  "vectorEncoding": null,
                  "synonymMaps": []
                }
              ]
            },
            {
              "name": "languages",
              "type": "Collection(Edm.String)",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": false,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "page_number",
              "type": "Edm.String",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "orig_elements",
              "type": "Edm.String",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "links",
              "type": "Collection(Edm.String)",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": false,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "page_name",
              "type": "Edm.String",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "url",
              "type": "Edm.String",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "link_urls",
              "type": "Collection(Edm.String)",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": false,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "link_texts",
              "type": "Collection(Edm.String)",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": false,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "sent_from",
              "type": "Collection(Edm.String)",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": false,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "sent_to",
              "type": "Collection(Edm.String)",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": false,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "subject",
              "type": "Edm.String",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "section",
              "type": "Edm.String",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "header_footer_type",
              "type": "Edm.String",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "emphasized_text_contents",
              "type": "Collection(Edm.String)",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": false,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "emphasized_text_tags",
              "type": "Collection(Edm.String)",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": false,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "text_as_html",
              "type": "Edm.String",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "regex_metadata",
              "type": "Edm.String",
              "searchable": true,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            },
            {
              "name": "detection_class_prob",
              "type": "Edm.Double",
              "searchable": false,
              "filterable": true,
              "retrievable": true,
              "stored": true,
              "sortable": true,
              "facetable": true,
              "key": false,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "analyzer": null,
              "normalizer": null,
              "dimensions": null,
              "vectorSearchProfile": null,
              "vectorEncoding": null,
              "synonymMaps": []
            }
          ]
        }
      ],
      "scoringProfiles": [],
      "corsOptions": null,
      "suggesters": [],
      "analyzers": [],
      "normalizers": [],
      "tokenizers": [],
      "tokenFilters": [],
      "charFilters": [],
      "encryptionKey": null,
      "similarity": {
        "@odata.type": "#Microsoft.Azure.Search.BM25Similarity",
        "k1": null,
        "b": null
      },
      "semantic": null,
      "vectorSearch": {
        "algorithms": [
          {
            "name": "embeddings-config",
            "kind": "hnsw",
            "hnswParameters": {
              "metric": "cosine",
              "m": 4,
              "efConstruction": 400,
              "efSearch": 500
            },
            "exhaustiveKnnParameters": null
          }
        ],
        "profiles": [
          {
            "name": "embeddings-config-profile",
            "algorithm": "embeddings-config",
            "vectorizer": null,
            "compression": null
          }
        ],
        "vectorizers": [],
        "compressions": []
      }
    }
    

    See also:

The Azure AI Search connector dependencies:

CLI, Python
pip install "unstructured-ingest[azure-ai-search]"

You might also need to install additional dependencies, depending on your needs. Learn more.

These environment variables:

  • AZURE_SEARCH_ENDPOINT - The endpoint URL for Azure AI Search, represented by --endpoint (CLI) or endpoint (Python).
  • AZURE_SEARCH_API_KEY - The API key for Azure AI Search, represented by --key (CLI) or key (Python).
  • AZURE_SEARCH_INDEX - The name of the index for Azure AI Search, represented by --index (CLI) or index (Python).

Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector.

This example sends files to Unstructured API services for processing by default. To process files locally instead, see the instructions at the end of this page.

For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the --partition-by-api option (CLI) or partition_by_api (Python) parameter to specify where files are processed:

  • To do local file processing, omit --partition-by-api (CLI) or partition_by_api (Python), or explicitly specify partition_by_api=False (Python).

    Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:

    • --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
    • --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
    • The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL
  • To send files to Unstructured API services for processing, specify --partition-by-api (CLI) or partition_by_api=True (Python).

    Unstructured API services also requires an Unstructured API key and API URL, by adding the following:

    • --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
    • --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
    • The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL, representing your API key and API URL, respectively.

    Get an API key and API URL.