> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unstructured.io/llms.txt
> Use this file to discover all available pages before exploring further.

# OpenSearch

Batch process all your records to store structured outputs in OpenSearch.

The requirements are as follows.

## Set up OpenSearch

Supported OpenSearch installations vary by product:

* **[Unstructured Pipelines](/pipelines/overview) and [Unstructured API](/api-reference/overview):** Non-local OpenSearch only. Local OpenSearch instances are not supported.
* **[Unstructured Ingest](/open-source/ingestion/overview):** Both local and non-local OpenSearch instances are supported.

### Set up an AWS OpenSearch Service domain

To set up an [AWS OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html) domain, complete steps similar to the following:

1. Sign in to your AWS account, and then open your AWS Management Console.

2. Open your Amazon OpenSearch Service console.

3. On the sidebar, expand **Managed clusters**, and then click **Dashboard**.

4. Click **Create domain**.

5. In the **Name** tile, for **Domain name**, enter some unique domain name for your new OpenSearch domain.

6. In the **Domain creation method** tile, select **Easy create**. This option provides faster setup using default configurations and enables fine-grained access control (FGAC) by default. (With **Standard create**, you must enable FGAC manually.) [Standard create method documentation](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html).

7. In the **Engine options** tile, for **Version**, AWS recommends that you select the latest version.

8. In the **Network** tile, for **Network**, select a network access method.
   For faster setup, this example uses the **Public access** method.
   [VPC access method documentation](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/vpc.html#prerequisites-vpc-endpoints).

9. For **IP address type**, select **Dual-stack mode**.

10. In the **Fine-grained access control** (FGAC) tile, do one of the following:

    * If you want to use an existing AWS IAM user in the AWS account as the domain's master user, then for **Master user**, select **Set IAM ARN as master user**. Then enter the IAM ARN for the master user in the **IAM ARN** box.
    * If you want to create a master user and password as the domain's master user instead, then for **Master user**, select **Create master user**. Then specify some username and password for this
      new master user by filling in the **Master username**, **Master password**, and **Confirm master password** fields. Make
      sure to save the master user's password in a secure location.

11. Click **Create**.

12. After the domain is created, you must allow Unstructured to access the domain, as follows:

    a. If the new domain's settings page is not already showing, open it as follows:
    in your Amazon OpenSearch Service console, on the sidebar, expand **Managed clusters**, and then click **Domains**. Then,
    in the list of available domains, click the name of the newly created domain.<br />
    b. On the **Security configuration** tab, click **Edit**.<br />
    c. In the **Access policy** tile, for **Domain access policy**, select **Only use fine-grained access control**.<br />
    d. Click **Clear policy**. This removes any existing resource-based access policy from the domain. With no domain access policy, access control relies entirely on fine-grained access control (FGAC).<br />
    e. Click **Save changes**.

### Set up an Amazon OpenSearch Serverless collection

To set up an [Amazon OpenSearch Serverless collection](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-create-console.html), complete steps similar to the following:

1. Sign in to your AWS account, and then open your AWS Management Console.

2. Open your Amazon OpenSearch Service console.

3. On the sidebar, expand **Serverless**, and then click **Dashboard**.

4. Click **Create collection**.

5. In the **Collection details** tile, for **Collection name**, enter some unique name for your new OpenSearch Serverless collection.
   Optionally, for **Description**, enter some meaningful description for your new collection.

6. For **Collection type**, select **Search**.

   <Note>
     Unstructured does not support the **Vector search** collection type. If you need vector search support, you can either continue
     with these steps to use the **Search** collection type, or you can follow the preceding steps to set up an Amazon OpenSearch Service managed cluster instead.
     However, note that the Amazon OpenSearch Serverless **Search** collection type is not as optimal as the **Vector search** collection type.
   </Note>

7. In the **Collection creation method** tile, select **Standard create**.

8. For **Encryption**, choose an AWS KMS key type.

9. For **Network access settings**, choose an **Access type**.

10. For **Resource type**, select both **Enable access to OpenSearch endpoint** and **Enable access to OpenSearch Dashboards**.

11. Click **Next**.

12. In the **Definition method** tile, select **JSON**.

13. In the **JSON editor** box, enter the following JSON, replacing the following placeholders:

    * Replace `<collection-name>` with the name of the new OpenSearch Serverless collection.
    * Replace `<account-id>` with the target AWS account ID.
    * Replace `<user-id>` with the ID of the target AWS IAM user.

    ```json theme={null}
    [
        {
            "Rules": [
                {
                    "Resource": ["collection/<collection-name>"],
                    "Permission": [
                        "aoss:CreateCollectionItems",
                        "aoss:UpdateCollectionItems",
                        "aoss:DescribeCollectionItems"
                    ],
                    "ResourceType": "collection"
                },
                {
                    "Resource": ["index/<collection-name>/*"],
                    "Permission": [
                        "aoss:CreateIndex",
                        "aoss:DescribeIndex",
                        "aoss:ReadDocument",
                        "aoss:WriteDocument",
                        "aoss:UpdateIndex",
                        "aoss:DeleteIndex"
                    ],
                    "ResourceType": "index"
                },
                {
                    "Resource": ["model/<collection-name>/*"],
                    "Permission": [
                        "aoss:DescribeMLResource",
                        "aoss:CreateMLResource",
                        "aoss:UpdateMLResource",
                        "aoss:DeleteMLResource",
                        "aoss:ExecuteMLResource"
                    ],
                    "ResourceType": "model"
                }
            ],
            "Principal": ["arn:aws:iam::<account-id>:user/<user-id>"]
        }
    ]
    ```

14. Click **Next**.

15. For **Data access policy settings**, select **Create as a new data access policy**.

16. In the **Name and description** tile, enter some unique name and an optional description for the new data access policy.

17. Click **Next**.

18. Enter any desired index details, and click **Next** again. For example:

    a. For **Index name**, enter the name of the new index in the collection.<br />
    b. For **Automatic Semantic Enrichment fields**, click **Add**, enter `embeddings` for **Automatic Semantic Enrichment field name**, click **Add**, and click **Confirm**.<br />
    c. For **Lexical search fields**, click **Add**, enter `text` for **Field name** and select **Text** for **Data type**, click **Add**, and click **Confirm**.<br />

19. Click **Submit**.

### Set up a local OpenSearch instance

The following video shows how to set up a [local OpenSearch](https://opensearch.org/downloads.html) instance.

<iframe width="560" height="315" src="https://www.youtube.com/embed/Rew3_pNnYIs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

### Get the host URL

Find your host URL in the AWS console using the steps for your OpenSearch type.

#### Amazon OpenSearch Service domain

1. Sign in to your AWS account, and then open your AWS Management Console.
2. Open your Amazon OpenSearch Service console.
3. On the sidebar, expand **Managed clusters**, and then click **Dashboard**.
4. In the list of available domains, click the name of your domain.
5. In the **General information** tile, copy the value of **Domain endpoint v2 (dual stack)**.

#### Amazon OpenSearch Serverless collection

1. Sign in to your AWS account, and then open your AWS Management Console.
2. Open your Amazon OpenSearch Service console.
3. On the sidebar, expand **Serverless**, and then click **Dashboard**.
4. In the list of available collections, click the name of your collection.
5. On the **Overview** tab, in the **Endpoint** tile, copy the value of **OpenSearch endpoint**.

#### Local OpenSearch instance

Your local instance URL depends on how OpenSearch is installed and configured. For guidance, see [Communicate with OpenSearch](https://opensearch.org/docs/latest/getting-started/communicate/) in the OpenSearch documentation.

### Create a search index

The name of the search index on the instance is required.

For the destination connector, if you need to create an index and you're using a master user and password as the domain's master user, you can use for example the following `curl` command. Replace the following placeholders:

* Replace `<host>` with the instance's host URL.
* Replace `<port>` with the instance's port number, which is typically `443` (for encrypted connections, and less commonly `9200` for unencrypted connections).
* Replace `<master-username>` with the master user's name, and replace `<master-password>` with the master user's password.
* Replace `<index-name>` with the name of the new search index on the instance.
* Replace `<index-schema>` with the schema for the new search index on the instance. A schema is optional; see the explanation
  following this `curl` command for more information.

```bash theme={null}
curl --request PUT "<host>:<port>/<index-name>" \
--user "<master-username>:<master-password>" \
[--header "Content-Type: application/json" \
--data '<index-schema>']
```

If you're using an existing AWS IAM user as the domain's master user instead, you should use the AWS Command Line Interface (CLI) to create the index instead of using the preceding `curl` command. To learn how, see [create-index](https://docs.aws.amazon.com/cli/latest/reference/opensearch/create-index.html) in the AWS CLI Command Reference.

For the destination connector, the index does not need to contain a schema beforehand. If Unstructured encounters an index without a schema,
Unstructured will automatically create a compatible schema for you before inserting items into the index. Nonetheless,
to reduce possible schema compatibility issues, Unstructured recommends that you create a schema that is compatible with Unstructured's schema.
Unstructured cannot provide a schema that is guaranteed to work in all
circumstances. This is because these schemas will vary based on your source files' types; how you
want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors.

For objects in the `metadata` field that Unstructured produces and that you want to store in an OpenSearch index, you must create fields in your index's schema that
follows Unstructured's `metadata` field naming convention. For example, if Unstructured produces a `metadata` field with the following
child objects:

```json theme={null}
"metadata": {
  "is_extracted": "true",
  "coordinates": {
    "points": [
      [
        134.20055555555555,
        241.36027777777795
      ],
      [
        134.20055555555555,
        420.0269444444447
      ],
      [
        529.7005555555555,
        420.0269444444447
      ],
      [
        529.7005555555555,
        241.36027777777795
      ]
    ],
    "system": "PixelSpace",
    "layout_width": 1654,
    "layout_height": 2339
  },
  "filetype": "application/pdf",
  "languages": [
    "eng"
  ],
  "page_number": 1,
  "image_mime_type": "image/jpeg",
  "filename": "realestate.pdf",
  "data_source": {
    "url": "file:///home/etl/node/downloads/00000000-0000-0000-0000-000000000001/7458635f-realestate.pdf",
    "record_locator": {
      "protocol": "file",
      "remote_file_path": "file:///home/etl/node/downloads/00000000-0000-0000-0000-000000000001/7458635f-realestate.pdf"
    }
  },
  "entities": {
    "items": [
      {
        "entity": "HOME FOR FUTURE",
        "type": "ORGANIZATION"
      },
      {
        "entity": "221 Queen Street, Melbourne VIC 3000",
        "type": "LOCATION"
      }
    ],
    "relationships": [
      {
        "from": "HOME FOR FUTURE",
        "relationship": "based_in",
        "to": "221 Queen Street, Melbourne VIC 3000"
      }
    ]
  }
}
```

You can adapt the following index schema example for your own needs. Note that outside of `metadata`, the following fields are
required by Unstructured whenever you create your own index schema:

* `element_id`
* `record_id`, which is required by Unstructured for intelligent record updates.
* `type`, which is not required, but highly recommended.
* `text`
* `embeddings` if embeddings are generated; make sure to set `dimension` to the same number of dimensions as the embedding model generates.

```json theme={null}
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
      "element_id": {
        "type": "keyword"
      },
      "record_id": {
        "type": "text"
      },
      "text": {
        "type": "text"
      },
      "type": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "embeddings": {
        "type": "knn_vector",
        "dimension": 1536
      },
      "metadata": {
        "properties": {
          "is_extracted": {
            "type": "boolean"
          },
          "coordinates-points": {
            "type": "float"
          },
          "coordinates-system": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "coordinates-layout_width": {
            "type": "long"
          },
          "coordinates-layout_height": {
            "type": "long"
          },
          "filetype": {
            "type": "keyword"
          },
          "languages": {
            "type": "keyword"
          },
          "page_number": {
            "type": "integer"
          },
          "image_mime_type": {
            "type": "keyword"
          },
          "filename": {
            "type": "keyword"
          },
          "data_source-url": {
            "type": "keyword"
          },
          "data_source-record_locator-protocol": {
            "type": "keyword"
          },
          "data_source-record_locator-remote_file_path": {
            "type": "keyword"
          },
          "entities-items": {
            "properties": {
              "entity": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "type": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "entities-relationships": {
            "properties": {
              "from": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "relationship": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "to": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
```

See also:

* [Create an index](https://opensearch.org/docs/latest/api-reference/index-apis/create-index/)
* [Mappings and field types](https://opensearch.org/docs/latest/field-types/)
* [Explicit mapping](https://opensearch.org/docs/latest/field-types/#explicit-mapping)
* [Dynamic mapping](https://opensearch.org/docs/latest/field-types/#dynamic-mapping)
* [Unstructured document elements and metadata](/concepts/document-elements)

### Set up master user authentication

<Note>
  If you are using Enterprise Connect on a [dedicated instance](/business/dedicated-instances/overview), you do not need **master user credentials**. Skip to the **Enterprise Connect** section on this page.
</Note>

For non-local OpenSearch instances, or if you're using basic authentication to a local OpenSearch instance, you will need the master user's name and password.

For local OpenSearch instances, if you're using certificates for authentication instead of basic authentication, you will need:

* The path to the Certificate Authority (CA) bundle, if you use intermediate CAs with your root CA.
* The path to the combined private key and certificate file, or
* The paths to the separate private key and certificate files.

To learn more, see [Authentication backends](https://opensearch.org/docs/latest/security/authentication-backends/authc-index/), [HTTP basic authentication](https://opensearch.org/docs/latest/security/authentication-backends/basic-authc/), and [Client certificate authentication](https://opensearch.org/docs/latest/security/authentication-backends/client-auth/).

The OpenSearch connector dependencies:

```bash CLI, Python theme={null}
pip install "unstructured-ingest[opensearch]"
```

You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies).

The following environment variables:

* `OPENSEARCH_HOST_URL` - The OpenSearch instance's host URL, which typically takes the form of `https://<host>:<port>`, and is represented by `--hosts` (CLI) or `hosts` (Python).
* `OPENSEARCH_INDEX_NAME` - The name of the search index, represented by `--index-name` (CLI) or `index_name` (Python).

If you're using basic authentication to the instance:

* `OPENSEARCH_USERNAME` - The user's name, represented by `--username` (CLI) or `username` (Python).
* `OPENSEARCH_PASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python).

If you're using certificates for authentication instead:

* `OPENSEARCH_CA_CERTS` - The path to the Certificate Authority (CA) bundle, if you use intermediate CAs with your root CA. This is represented by `--ca-certs` (CLI) or `ca_certs` (Python).
* `OPENSEARCH_CLIENT_CERT` - The path to the combined private key and certificate file, or the path to just the certificate file. This is represented by `--client-cert` (CLI) or `client_cert` (Python).
* `OPENSEARCH_CLIENT_KEY` - The path to the private key file, if `OPENSEARCH_CLIENT_CERT` refers to just the certificate file. This is represented by `--client-key` (CLI) or `client_key` (Python).

Additional related settings include:

* `--use-ssl` (CLI) or `use_ssl=True` (Python) to use SSL for the connection.
* `--verify-certs` (CLI) or `verify_certs=True` (Python) to verify SSL certificates.
* `--ssl-show-warn` (CLI) or `ssl_show_warn=True` (Python) to show a warning when verifying SSL certificates is disabled.

Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector:

This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page.

<CodeGroup>
  ```bash CLI theme={null}
  #!/usr/bin/env bash

  # Chunking and embedding are optional.

  unstructured-ingest \
    local \
      --input-path $LOCAL_FILE_INPUT_DIR \
      --strategy hi_res \
      --chunking-strategy by_title \
      --embedding-provider huggingface \
      --partition-by-api \
      --api-key $UNSTRUCTURED_API_KEY \
      --partition-endpoint $UNSTRUCTURED_API_URL \
      --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
    opensearch \
      --hosts $OPENSEARCH_HOST_URL \
      --username $OPENSEARCH_USERNAME \
      --password $OPENSEARCH_PASSWORD \
      --index-name $OPENSEARCH_INDEX_NAME
  ```

  ```python Python Ingest theme={null}
  import os

  from unstructured_ingest.pipeline.pipeline import Pipeline
  from unstructured_ingest.interfaces import ProcessorConfig

  from unstructured_ingest.processes.connectors.elasticsearch.opensearch import (
      OpenSearchConnectionConfig,
      OpenSearchAccessConfig,
      OpensearchUploadStagerConfig,
      OpensearchUploaderConfig
  )

  from unstructured_ingest.processes.connectors.local import (
      LocalIndexerConfig,
      LocalDownloaderConfig,
      LocalConnectionConfig
  )
  from unstructured_ingest.processes.partitioner import PartitionerConfig
  from unstructured_ingest.processes.chunker import ChunkerConfig
  from unstructured_ingest.processes.embedder import EmbedderConfig

  # Chunking and embedding are optional.

  if __name__ == "__main__":
      Pipeline.from_configs(
          context=ProcessorConfig(),
          indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
          downloader_config=LocalDownloaderConfig(),
          source_connection_config=LocalConnectionConfig(),
          partitioner_config=PartitionerConfig(
              partition_by_api=True,
              api_key=os.getenv("UNSTRUCTURED_API_KEY"),
              partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
              strategy="hi_res",
              additional_partition_args={
                  "split_pdf_page": True,
                  "split_pdf_allow_failed": True,
                  "split_pdf_concurrency_level": 15
              }
          ),
          chunker_config=ChunkerConfig(chunking_strategy="by_title"),
          embedder_config=EmbedderConfig(embedding_provider="huggingface"),
          destination_connection_config=OpenSearchConnectionConfig(
              access_config=OpenSearchAccessConfig(
                  password=os.getenv("OPENSEARCH_PASSWORD"),
                  use_ssl=True
              ),
              hosts=[os.getenv("OPENSEARCH_HOST_URL")],
              username=os.getenv("OPENSEARCH_USERNAME")
          ),
          stager_config=OpensearchUploadStagerConfig(
              index_name=os.getenv("OPENSEARCH_INDEX_NAME")
          ),
          uploader_config=OpensearchUploaderConfig(
              index_name=os.getenv("OPENSEARCH_INDEX_NAME")
          )
      ).run()
  ```
</CodeGroup>

For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed:

* To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python).

  Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:

  * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python)
  * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python)
  * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`

* To send files to the legacy [Unstructured Partition Endpoint](/api-reference/legacy-api/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python).

  Unstructured also requires an Unstructured API key and API URL, by adding the following:

  * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python)
  * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python)
  * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively.

  <Note>
    You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, which applies to **Let's Go**, **Pay-As-You-Go**, and **Business SaaS** accounts.

    The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the legacy [Unstructured Partition Endpoint](/api-reference/legacy-api/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).

    If you do not have an API key, [get one now](/api-reference/legacy-api/partition/overview).

    If you are using a **Business** account, the process
    for generating Unstructured API keys, and the Unstructured API URL that you use, are different.
    For instructions, see your Unstructured account administrator, or email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).
  </Note>
