> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unstructured.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Weaviate

Batch process all your records to store structured outputs in a Weaviate database.

The requirements are as follows.

* For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview): only [Weaviate Cloud](https://weaviate.io/developers/wcs) clusters are supported.
* For [Unstructured Ingest](/open-source/ingestion/overview): Weaviate Cloud clusters,
  [Weaviate installed locally](https://weaviate.io/developers/weaviate/quickstart/local),
  and [Embedded Weaviate](https://weaviate.io/developers/weaviate/installation/embedded) are supported.
* For Weaviate installed locally, you will need the name of the target collection on the local instance.
* For Embedded Weaviate, you will need the instance's connection URL and the name of the target collection on the instance.
* For Weaviate Cloud, you will need:

  <iframe width="560" height="315" src="https://www.youtube.com/embed/FvhJDUZM078" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

  * A Weaviate database instance. The following information assumes that you have a Weaviate Cloud (WCD) account with a Weaviate database cluster in that account.
    [Create a WCD account](https://weaviate.io/developers/wcs/quickstart#create-a-wcd-account). [Create a database cluster](https://weaviate.io/developers/wcs/quickstart#create-a-weaviate-cluster). For other database options, [learn more](https://weaviate.io/developers/weaviate/installation).
  * The URL and API key for the database cluster. [Get the URL and API key](https://weaviate.io/developers/wcs/quickstart#explore-the-details-panel).
  * The name of the target collection in the database. [Create a collection](https://weaviate.io/developers/wcs/tools/collections-tool).

    An existing collection is not required. At runtime, the collection behavior is as follows:

    For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview):

    * If an existing collection name is specified, and Unstructured generates embeddings,
      but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail.
      You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again.
    * If a collection name is not specified, Unstructured creates a new collection in your Weaviate cluster. If Unstructured generates embeddings,
      the new collection's name will be `U<short-workflow-id>_<short-embedding-model-name>_<number-of-dimensions>`.
      If Unstructured does not generate embeddings, the new collection's name will be `U<short-workflow-id`.

    For [Unstructured Ingest](/open-source/ingestion/overview):

    * If an existing collection name is specified, and Unstructured generates embeddings,
      but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail.
      You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again.
    * If a collection name is not specified, Unstructured creates a new collection in your Weaviate cluster. The new collection's name will be `Unstructuredautocreated`.

    If Unstructured creates a new collection and generates embeddings, you will not see an `embeddings` property in tools such as the Weaviate Cloud
    **Collections** user interface. To view the generated embeddings, you can run a Weaviate GraphQL query such as the following. In this query, replace `<collection-name>` with
    the name of the new collection, and replace `<property-name>` with the name of each additional available property that
    you want to return results for, such as `text`, `type`, `element_id`, `record_id`, and so on. The embeddings will be
    returned in the `vector` property.

    ```text  theme={null}
    {
      Get {
        <collection-name> {
          _additional {
            vector
          }
          <property-name>
          <property-name>
        }
      }
    }
    ```

If [auto-schema](https://docs.weaviate.io/weaviate/config-refs/collections#auto-schema) is enabled in Weaviate (which it is by default),
Weaviate can infer missing properties and add them to the collection definition at run time. However, it is a Weaviate best practice to manually define as much
of the data schema in advance as possible, since manual definition gives you the most control.

The minimum viable schema for Unstructured includes only the `element_id` and `record_id` properties. The `text` and `type` properties should also be included, but they are technically optional.
If you are using Unstructured to generate embeddings, you must

The following code example shows how to use the [weaviate-client](https://pypi.org/project/weaviate-client/) Python package to create a
collection in a Weaviate Cloud database cluster with this minimum viable schema, and to specify that Unstructured will generate the embeddings for this collection.
To connect to a locally hosted Weaviate instance instead, call [weaviate.connect\_to\_local](https://docs.weaviate.io/weaviate/connections/connect-local).
To connect to Embedded Weaviate instead, call [weaviate.connect\_to\_embedded](https://docs.weaviate.io/weaviate/connections/connect-embedded).

```python  theme={null}
import os
import weaviate
from weaviate.classes.init import Auth
import weaviate.classes.config as wvc

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),
    auth_credentials=Auth.api_key(api_key=os.getenv("WEAVIATE_API_KEY")),
)

collection = client.collections.create(
    name="MyCollection",
    properties=[
        wvc.Property(name="element_id", data_type=wvc.DataType.UUID),
        wvc.Property(name="record_id", data_type=wvc.DataType.TEXT),
        wvc.Property(name="text", data_type=wvc.DataType.TEXT),
        wvc.Property(name="type", data_type=wvc.DataType.TEXT),
    ],
    vectorizer_config=None, # Unstructured will generate the embeddings instead of Weaviate.
)

client.close()
```

For objects in the `metadata` field that Unstructured produces and that you want to store in a Weaviate collection, be sure to follow
Unstructured's `metadata` field naming convention. For example, if Unstructured produces a `metadata` field with the following
child objects:

```json  theme={null}
"metadata": {
  "is_extracted": "true",
  "coordinates": {
    "points": [
      [
        134.20055555555555,
        241.36027777777795
      ],
      [
        134.20055555555555,
        420.0269444444447
      ],
      [
        529.7005555555555,
        420.0269444444447
      ],
      [
        529.7005555555555,
        241.36027777777795
      ]
    ],
    "system": "PixelSpace",
    "layout_width": 1654,
    "layout_height": 2339
  },
  "filetype": "application/pdf",
  "languages": [
    "eng"
  ],
  "page_number": 1,
  "image_mime_type": "image/jpeg",
  "filename": "realestate.pdf",
  "data_source": {
    "url": "file:///home/etl/node/downloads/00000000-0000-0000-0000-000000000001/7458635f-realestate.pdf",
    "record_locator": {
      "protocol": "file",
      "remote_file_path": "file:///home/etl/node/downloads/00000000-0000-0000-0000-000000000001/7458635f-realestate.pdf"
    }
  },
  "entities": {
    "items": [
      {
        "entity": "HOME FOR FUTURE",
        "type": "ORGANIZATION"
      },
      {
        "entity": "221 Queen Street, Melbourne VIC 3000",
        "type": "LOCATION"
      }
    ],
    "relationships": [
      {
        "from": "HOME FOR FUTURE",
        "relationship": "based_in",
        "to": "221 Queen Street, Melbourne VIC 3000"
      }
    ]
  }
}
```

You could create corresponding properties in your collection's schema by using the following property names and data types:

```python  theme={null}
import os
import weaviate
from weaviate.classes.init import Auth
import weaviate.classes.config as wvc

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),
    auth_credentials=Auth.api_key(api_key=os.getenv("WEAVIATE_API_KEY")),
)

collection = client.collections.create(
    name="MyCollection",
    properties=[
        wvc.Property(name="element_id", data_type=wvc.DataType.UUID),
        wvc.Property(name="record_id", data_type=wvc.DataType.TEXT),
        wvc.Property(name="text", data_type=wvc.DataType.TEXT),
        wvc.Property(name="type", data_type=wvc.DataType.TEXT),
        wvc.Property(
            name="metadata",
            data_type=wvc.DataType.OBJECT,
            nested_properties=[
                wvc.Property(name="is_extracted", data_type=wvc.DataType.TEXT),
                wvc.Property(
                    name="coordinates",
                    data_type=wvc.DataType.OBJECT,
                    nested_properties=[
                        wvc.Property(name="points", data_type=wvc.DataType.TEXT),
                        wvc.Property(name="system", data_type=wvc.DataType.TEXT),
                        wvc.Property(name="layout_width", data_type=wvc.DataType.NUMBER),
                        wvc.Property(name="layout_height", data_type=wvc.DataType.NUMBER),
                    ],
                ),
                wvc.Property(name="filetype", data_type=wvc.DataType.TEXT),
                wvc.Property(name="languages", data_type=wvc.DataType.TEXT_ARRAY),
                wvc.Property(name="page_number", data_type=wvc.DataType.TEXT),
                wvc.Property(name="image_mime_type", data_type=wvc.DataType.TEXT),
                wvc.Property(name="filename", data_type=wvc.DataType.TEXT),
                wvc.Property(
                    name="data_source",
                    data_type=wvc.DataType.OBJECT,
                    nested_properties=[
                        wvc.Property(name="url", data_type=wvc.DataType.TEXT),
                        wvc.Property(name="record_locator", data_type=wvc.DataType.TEXT),
                    ],
                ),
                wvc.Property(
                    name="entities", 
                    data_type=wvc.DataType.OBJECT,
                    nested_properties=[
                        wvc.Property(
                            name="items", 
                            data_type=wvc.DataType.OBJECT_ARRAY,
                            nested_properties=[
                                wvc.Property(name="entity", data_type=wvc.DataType.TEXT),
                                wvc.Property(name="type", data_type=wvc.DataType.TEXT),
                            ],
                        ),
                        wvc.Property(
                            name="relationships", 
                            data_type=wvc.DataType.OBJECT_ARRAY,
                            nested_properties=[
                                wvc.Property(name="to", data_type=wvc.DataType.TEXT),
                                wvc.Property(name="from", data_type=wvc.DataType.TEXT),
                                wvc.Property(name="relationship", data_type=wvc.DataType.TEXT),
                            ],
                        ),
                    ],
                ),
            ],
        ),
    ],
    vectorizer_config=None, # Unstructured will generate the embeddings instead of Weaviate.
)

client.close()
```

Unstructured cannot provide a schema that is guaranteed to work in all
circumstances. This is because these schemas will vary based on your source files' types; how you
want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors.

See also:

* [Collection schema](https://weaviate.io/developers/weaviate/config-refs/schema)
* [Unstructured document elements and metadata](/api-reference/legacy-api/partition/document-elements)

The Weaviate connector dependencies:

```bash CLI, Python theme={null}
pip install "unstructured-ingest[weaviate]"
```

You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies).

The following environment variables:

* For Weaviate installed locally, `WEAVIATE_COLLECTION` - The name of the target collection in the instance, represented by `--collection` (CLI) or `collection` (Python).

* For Embedded Weaviate:

  * `WEAVIATE_HOST` - The connection URL to the instance, represented by `--hostname` (CLI) or `hostname` (Python).
  * `WEAVIATE_COLLECTION` - The name of the target collection in the instance, represented by `--collection` (CLI) or `collection` (Python).
    If no value is provided, see the beginning of this article for the behavior at run time.

* For Weaviate Cloud:

  * `WEAVIATE_CLUSTER_URL` - THE REST endpoint for the Weaviate database cluster, represented by `--cluster-url` (CLI) or `cluster_url` (Python).

  * `WEAVIATE_API_KEY` - The API key for the database cluster, represented by `--api-key` (CLI) or `api_key` (Python).

    <Note>For the CLI, the `--api-key` option here is part of the `weaviate-cloud` command. For Python, the `api_key` parameter here is part of the `CloudWeaviateAccessConfig` object.</Note>

  * `WEAVIATE_COLLECTION` - The name of the target collection in the database, represented by `--collection` (CLI) or `collection` (Python).
    If no value is provided, see the beginning of this article for the behavior at run time.

Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector:

This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page.

<CodeGroup>
  ```bash CLI theme={null}
  #!/usr/bin/env bash

  # Chunking and embedding is optional.

  # For Weaviate installed locally:
  unstructured-ingest \
    local \
      --input-path $LOCAL_FILE_INPUT_DIR \
      --partition-by-api \
      --api-key $UNSTRUCTURED_API_KEY \
      --partition-endpoint $UNSTRUCTURED_API_URL \
      --chunking-strategy by_title \
      --embedding-provider huggingface \
      --strategy hi_res \
      --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
    weaviate-local \
      --collection $WEAVIATE_COLLECTION

  # For Embedded Weaviate:
  unstructured-ingest \
    local \
      --input-path $LOCAL_FILE_INPUT_DIR \
      --partition-by-api \
      --api-key $UNSTRUCTURED_API_KEY \
      --partition-endpoint $UNSTRUCTURED_API_URL \
      --chunking-strategy by_title \
      --embedding-provider huggingface \
      --strategy hi_res \
      --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
    weaviate-embedded \
      --hostname $WEAVIATE_HOST \
      --collection $WEAVIATE_COLLECTION

  # For Weaviate Cloud:
  unstructured-ingest \
    local \
      --input-path $LOCAL_FILE_INPUT_DIR \
      --partition-by-api \
      --api-key $UNSTRUCTURED_API_KEY \
      --partition-endpoint $UNSTRUCTURED_API_URL \
      --chunking-strategy by_title \
      --embedding-provider huggingface \
      --strategy hi_res \
      --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
    weaviate-cloud \
      --cluster-url $WEAVIATE_URL \
      --api-key $WEAVIATE_API_KEY \
      --collection $WEAVIATE_COLLECTION
  ```

  ```python Python Ingest theme={null}
  import os

  from unstructured_ingest.pipeline.pipeline import Pipeline
  from unstructured_ingest.interfaces import ProcessorConfig

  from unstructured_ingest.processes.connectors.local import (
      LocalIndexerConfig,
      LocalDownloaderConfig,
      LocalConnectionConfig
  )

  from unstructured_ingest.processes.partitioner import PartitionerConfig
  from unstructured_ingest.processes.chunker import ChunkerConfig
  from unstructured_ingest.processes.embedder import EmbedderConfig

  # For Weaviate installed locally:
  # from unstructured_ingest.processes.connectors.weaviate.local import (
  #     LocalWeaviateConnectionConfig,
  #     LocalWeaviateAccessConfig,
  #     LocalWeaviateUploadStagerConfig,
  #     LocalWeaviateUploaderConfig
  # )

  # For Embedded Weaviate:
  # from unstructured_ingest.processes.connectors.weaviate.embedded import (
  #     EmbeddedWeaviateConnectionConfig,
  #     EmbeddedWeaviateAccessConfig,
  #     EmbeddedWeaviateUploadStagerConfig,
  #     EmbeddedWeaviateUploaderConfig
  # )

  # For Weaviate Cloud:
  from unstructured_ingest.processes.connectors.weaviate.cloud import (
      CloudWeaviateConnectionConfig,
      CloudWeaviateAccessConfig,
      CloudWeaviateUploaderConfig,
      CloudWeaviateUploadStagerConfig
  )

  if __name__ == "__main__":
      Pipeline.from_configs(
          context=ProcessorConfig(),
          indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
          downloader_config=LocalDownloaderConfig(),
          source_connection_config=LocalConnectionConfig(),
          partitioner_config=PartitionerConfig(
              partition_by_api=True,
              api_key=os.getenv("UNSTRUCTURED_API_KEY"),
              partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
              additional_partition_args={
                  "split_pdf_page": True,
                  "split_pdf_allow_failed": True,
                  "split_pdf_concurrency_level": 15
              }
          ),
          chunker_config=ChunkerConfig(chunking_strategy="by_title"),
          embedder_config=EmbedderConfig(
              embedding_provider="openai",
              embedding_model_name=os.getenv("EMBEDDING_MODEL_NAME"),
              embedding_api_key=os.getenv("OPENAI_APIKEY")
          ),

          # For Weaviate installed locally:
          # destination_connection_config=LocalWeaviateConnectionConfig(
          #     access_config=LocalWeaviateAccessConfig()
          # ),
          # stager_config=LocalWeaviateUploadStagerConfig(),
          # uploader_config=LocalWeaviateUploaderConfig(
          #     collection=os.getenv("WEAVIATE_COLLECTION")
          # )

          # For Embedded Weaviate:
          # destination_connection_config=EmbeddedWeaviateConnectionConfig(
          #     access_config=EmbeddedWeaviateAccessConfig(),
          #     hostname=os.getenv("WEAVIATE_HOST")
          # ),
          # stager_config=EmbeddedWeaviateUploadStagerConfig(),
          # uploader_config=EmbeddedWeaviateUploaderConfig(
          #     collection=os.getenv("WEAVIATE_COLLECTION")
          # )

          # For Weaviate Cloud:
          destination_connection_config=CloudWeaviateConnectionConfig(
              access_config=CloudWeaviateAccessConfig(
                  api_key=os.getenv("WEAVIATE_API_KEY")
              ),
              cluster_url=os.getenv("WEAVIATE_CLUSTER_URL")
          ),
          stager_config=CloudWeaviateUploadStagerConfig(),
          uploader_config=CloudWeaviateUploaderConfig(
              collection=os.getenv("WEAVIATE_COLLECTION")
          )
      ).run()
  ```
</CodeGroup>

For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed:

* To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python).

  Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:

  * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python)
  * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python)
  * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`

* To send files to the legacy [Unstructured Partition Endpoint](/api-reference/legacy-api/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python).

  Unstructured also requires an Unstructured API key and API URL, by adding the following:

  * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python)
  * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python)
  * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively.

  <Note>
    You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, which applies to **Let's Go**, **Pay-As-You-Go**, and **Business SaaS** accounts.

    The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the legacy[Unstructured Partition Endpoint](/api-reference/legacy-api/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).

    If you do not have an API key, [get one now](/api-reference/legacy-api/partition/overview).

    If you are using a **Business** account, the process
    for generating Unstructured API keys, and the Unstructured API URL that you use, are different.
    For instructions, see your Unstructured account administrator, or email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).
  </Note>
