Using Unstructured open source
Embedding
The Unstructured open-source library does not offer built-in support for calling embedding providers to obtain embeddings for pieces of text.
Alternatively, the Unstructured Ingest CLI and the Unstructured Ingest Python library offer built-in support for calling embedding providers as part of an ingest pipeline. Learn how.
Also, you can use common third-party tools and libraries to get embeddings for document elements’ text within JSON files that are produced by calling the Unstructured open-source library. For example, the following sample Python script:
- Takes an Unstructured open-source library-generated JSON file as input.
- Reads in the JSON file’s contents as a JSON object.
- Uses the sentence-transformers/all-MiniLM-L6-v2
model on Hugging Face to generate embeddings for each
text
field of each document element in the JSON file. - Adds the generated embeddings next to each corresponding
text
field in the original JSON. - Saves the results back to the original JSON file.
Python
Additional resources
For information about how to use Python scripts to call various embedding providers, see for example: