Embedding
After partitioning, chunking, and summarizing, the embedding step creates arrays of numbers known as vectors, representing the text that is extracted by Unstructured. These vectors are stored or embedded next to the text itself. These vector embeddings are generated by an embedding model that is provided by an embedding provider.
You typically save these embeddings in a vector store. When a user queries a retrieval augmented generation (RAG) application, the application can use a vector database to perform a similarity search in that vector store and then return the items whose embeddings are the closest to that user’s query.
Here is an example of a document element generated by Unstructured, along with its vector embeddings generated by the embedding model sentence-transformers/all-MiniLM-L6-v2 on Hugging Face:
{
"type": "Title",
"element_id": "fdbf5369-4485-453b-9701-1bb42c83b00b",
"text": "THE CONSTITUTION of the United States",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 1,
"filename": "constitution.pdf",
"data_source": {
"record_locator": {
"path": "/input/constitution.pdf"
},
"date_created": "1723069423.0536132",
"date_modified": "1723069423.055078",
"date_processed": "1725666244.571788",
"permissions_data": [
{
"mode": 33188
}
]
}
},
"embeddings": [
-0.06138836592435837,
0.08634615689516068,
-0.019471267238259315,
"<full-results-omitted-for-brevity>",
0.0895417109131813,
0.05604064092040062,
0.01376157347112894
]
}
Generate embeddings
To generate embeddings, choose one of the following embedding providers in the Vendor drop-down list in the Embed section of a workflow:
-
OpenAI: Use OpenAI to generate embeddings. Also choose the embedding model to use, from one of the following:
- text-embedding-3-small (1536 dimensions): Learn more.
- text-embedding-3-large (3072 dimensions): Learn more.
- Ada 002 (Text) (1536 dimensions): Learn more.
-
Anthropic: Use Anthropic to generate embeddings. Also choose the embedding model to use, from one of the following:
- voyage-2 (1024 dimensions): Learn more.
- voyage-large-2 (1536 dimensions): Learn more.
- voyage-code-2 (1536 dimensions): Learn more.
- voyage-lite-02-instruct (1024 dimensions): Learn more.
-
Hugging Face: Use Hugging Face to generate embeddings. Also choose the embedding model to use, from one of the following:
- nvidia/NV-Embed-v1 (4096 dimensions): Learn more.
- voyage-large-2-instruct (1024 dimensions): Learn more.
- stella_en_400M_v5 (1024 dimensions): Learn more.
- stella_en_1.5B_v5 (1024 dimensions): Learn more.
- Alibaba-NLP/gte-Qwen2-7B-instruct (3584 dimensions): Learn more.
-
OctoAI: Use OctoAI to generate embeddings. Also choose the embedding model to use, from one of the following:
- GTE Large (1024 dimensions): Learn more.
-
Vertex AI: Use Vertex AI to generate embeddings. Also choose the embedding model to use, from one of the following:
- textembedding-gecko@003 (768 dimensions): Learn more.
- text-embedding-004 (768 dimensions): Learn more.
Was this page helpful?