Set embedding behavior
The following information applies only to the Unstructured Ingest CLI and the Unstructured Ingest Python library.
For the Unstructured open-source library, see Embedding instead.
The Unstructured SDKs for Python and JavaScript/TypeScript do not support this functionality.
Concepts
You can use the Unstructured Ingest CLI or the Unstructured Ingest Python library to generate embeddings after the partitioning and chunking steps in an ingest pipeline. The chunking step is particularly important to ensure that the text pieces (also known as the documents or elements) can fit the input limits of an embedding model.
You generate embeddings by specifying an embedding model that is provided or used by an embedding provider. An embedding model creates lists of numbers known as vectors, with each number representing a specific feature, attribute, or relationship of the data. (In this case, this data is the text that is extracted by Unstructured.) These vectors are stored or embedded next to the data itself.
These vector embeddings allow large language models (LLMs) and vector databases to more quickly and efficiently analyze and process these inherent properties and relationships between data. For example, you can save the extracted text along with its embeddings in a vector store. When a user queries a retrieval augmented generation (RAG) application, the application can use a vector database to perform a similarity search in that vector store and then return the documents whose embeddings are the closest to that user’s query.
Learn more about chunking and embedding.
Generate embeddings
To use the Ingest CLI or Ingest Python library to generate embeddings, do the following:
-
Choose an embedding provider that you want to use from among the following allowed providers, and note the provider’s ID:
- The provider ID
langchain-aws-bedrock
for Amazon Bedrock. Learn more. langchain-huggingface
for Hugging Face. Learn more.langchain-openai
for OpenAI. Learn more.langchain-vertexai
for Google Vertex AI PaLM. Learn more.langchain-voyageai
for Voyage AI. Learn more.octoai
for Octo AI. Learn more.
- The provider ID
-
Run the following command to install the required Python pacakge for the embedding provider:
- For
langchain-aws-bedrock
, runpip install "unstructured-ingest[bedrock]"
. - For
langchain-huggingface
, runpip install "unstructured-ingest[embed-huggingface]"
. - For
langchain-openai
, runpip install "unstructured-ingest[openai]"
. - For
langchain-vertexai
, runpip install "unstructured-ingest[embed-vertexai]"
. - For
langchain-voyageai
, runpip install "unstructured-ingest[embed-voyageai]"
. - For
octoai
, runpip install "unstructured-ingest[embed-octoai]"
.
- For
-
For the following embedding providers, you can choose the model that you want to use. If you do choose a model, note the model’s name:
langchain-huggingface
. Choose a model, or use the default model sentence-transformers/all-MiniLM-L6-v2.langchain-openai
. Choose a model, or use the default modeltext-embedding-ada-002
.langchain-vertexai
. Choose a model, or use the default modeltextembedding-gecko@001
.langchain-voyageai
. Choose a model. No default model is provided.octoai
. Choose a model, or use the default modelthenlper/gte-large
.
-
Note the special settings to connect to the provider:
- For
langchain-aws-bedrock
, you’ll need an AWS access key value, the corresponding AWS secret access key value, and the corresponding AWS Region identifier. Get an AWS access key and secret access key. - For
langchain-huggingface
, if you use a gated model (a model with special conditions that you must accept before you can use it, or a privately published model), you’ll need an HF inference API key value, beginning withhf_
. Get an HF inference API key. To learn whether your model requires an HF inference API key, see your model provider’s documentation - For
langchain-openai
, you’ll need an OpenAI API key value. Get an OpenAI API key. - For
langchain-vertexai
, you’ll need the path to a Google Cloud credentials JSON file. Learn more here and here. - For
langchain-voyageai
, you’ll need a Voyage AI API key value. Get a Voyage AI API key. - For
octoai
, you’ll need an Octo AI API token value. Get an Octo AI API token.
- For
-
Now, apply all of this information as follows, and then run your command or code:
Was this page helpful?