After partitioning, chunking, and summarizing, the embedding step creates arrays of numbers
known as vectors, representing the text that is extracted by Unstructured.
These vectors are stored or embedded next to the text itself. These vector embeddings are generated by an
embedding model that is provided by
an embedding provider.
You typically save these embeddings in a vector store.
When a user queries a retrieval-augmented generation (RAG) application, the application can use a vector database to perform
a similarity search in that vector store
and then return the items whose embeddings are the closest to that user’s query.
Here is an example of a document element generated by Unstructured, along with its vector embeddings generated by
the embedding model sentence-transformers/all-MiniLM-L6-v2
on Hugging Face:
{
"type": "Title",
"element_id": "fdbf5369-4485-453b-9701-1bb42c83b00b",
"text": "THE CONSTITUTION of the United States",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 1,
"filename": "constitution.pdf",
"data_source": {
"record_locator": {
"path": "/input/constitution.pdf"
},
"date_created": "1723069423.0536132",
"date_modified": "1723069423.055078",
"date_processed": "1725666244.571788",
"permissions_data": [
{
"mode": 33188
}
]
}
},
"embeddings": [
-0.06138836592435837,
0.08634615689516068,
-0.019471267238259315,
"<full-results-omitted-for-brevity>",
0.0895417109131813,
0.05604064092040062,
0.01376157347112894
]
}
Learn more.
Generate embeddings
To generate embeddings, choose one of the following embedding providers and models in the Select Embedding Model section of an Embedder node in a workflow:
You can change a workflow’s preconfigured provider only through
Custom workflow settings.
-
Azure OpenAI: Use Azure OpenAI to generate embeddings with one of the following models:
- text-embedding-3-small, with 1536 dimensions.
- text-embedding-3-large, with 3072 dimensions.
- Ada 002 (Text) (
text-embedding-ada-002
), with 1536 dimensions.
Learn more.
-
Amazon Bedrock: Use Amazon Bedrock to generate embeddings with one of the following models:
- Titan Text Embeddings V2, with 1024 dimensions. Learn more.
- Titan Embeddings G1 - Text, with 1536 dimensions. Learn more.
- Titan Multimodal Embeddings G1, with 1024 dimensions. Learn more.
- Cohere Embed English, with 1024 dimensions. Learn more.
- Cohere Embed Multilingual, with 1024 dimensions. Learn more.
-
TogetherAI: Use TogetherAI to generate embeddings with one of the following models:
- M2-BERT-80M-32K-Retrieval, with 768 dimensions.
Learn more.
-
Voyage AI: Use Voyage AI to generate embeddings with one of the following models:
- voyage-code-2, with 1536 dimensions.
- voyage-3, with 1024 dimensions.
- voyage-3-large, with 1024 dimensions.
- voyage-3-lite, with 512 dimensions.
- voyage-code-3, with 1024 dimensions.
- voyage-finance-2, with 1024 dimensions.
- voyage-law-2, with 1024 dimensions.
- voyage-multimodal-3, with 1024 dimensions.
Learn more.