Embedding

The Unstructured open-source library does not offer built-in support for calling embedding providers to obtain embeddings for pieces of text. Alternatively, the Unstructured Ingest CLI and the Unstructured Ingest Python library offer built-in support for calling embedding providers as part of an ingest pipeline. Learn how. Also, you can use common third-party tools and libraries to get embeddings for document elements’ text within JSON files that are produced by calling the Unstructured open-source library. For example, the following sample Python script:

Takes an Unstructured open-source library-generated JSON file as input.
Reads in the JSON file’s contents as a JSON object.
Uses the sentence-transformers/all-MiniLM-L6-v2 model on Hugging Face to generate embeddings for each text field of each document element in the JSON file.
Adds the generated embeddings next to each corresponding text field in the original JSON.
Saves the results back to the original JSON file.

Python

# Filename: embeddings.py
# pip install langchain sentence_transformers

import sys
import json
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

if __name__ == "__main__":
    embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2"
    )
    
    # Get the JSON file's path.
    if len(sys.argv) < 2:
        print("Error: Specify the path to the input JSON file.")
        print("For example, 'python embeddings.py myfile.json'")
        sys.exit(1)

    file_path = sys.argv[1]
    
    try:
        # Get the JSON file's contents.
        with open(file_path, 'r') as file:
            file_elements = json.load(file)

        # Process each element in the JSON file.
        for element in file_elements:
            # Get the element's "text" field.
            text = element["text"]
            # Generate the embeddings for that "text" field.
            query_result = embeddings.embed_query(text)
            # Add the embeddings to that element as an "embeddings" field.
            element["embeddings"] = query_result

        # Save the updated JSON back into the original file.
        with open(file_path, 'w') as file:
            json.dump(file_elements, file, indent=2)

        print(f"Done! Updated JSON saved to '{file_path}'.")

    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found.")
    except IOError:
        print(f"Error: Unable to access file '{file_path}'.")

Additional resources

For information about how to use Python scripts to call various embedding providers, see for example:

Unstructured open source

Getting started with open source

Using Unstructured open source

Ingestion

How to

Best practices

Concepts

Integrations

Additional resources

Unstructured open source

Getting started with open source

Using Unstructured open source

Ingestion

How to

Best practices

Concepts

Integrations

​Additional resources

Additional resources