Extract images and tables from documents

Task

You want to get, decode, and show elements, such as images and tables, that are embedded in a PDF document.

Approach

Extract the Base64-encoded representation of specific elements, such as images and tables, in the document. For each of these extracted elements, decode the Base64-encoded representation of the element into its original visual representation and then show it.

To run this example

You will need a document that is one of the document types supported by the extract_image_block_types argument. See the extract_image_block_types entry in API Parameters. This example uses a PDF file with embedded images and tables.

Code

For the Unstructured Ingest Python library, you can use the standard Python json.load function to load into a Python dictionary the contents of a JSON file that the Ingest Python library outputs after the processing is complete.

Python

import json, base64, io
from PIL import Image

def get_image_block_types(input_json_file_path: str):
    with open(input_json_file_path, 'r') as file:
        file_elements = json.load(file)

    for element in file_elements:
        if "image_base64" in element["metadata"]:
            # Decode the Base64-encoded representation of the 
            # processed "Image" or "Table" element into its original
            # visual representation, and then show it.
            image_data = base64.b64decode(element["metadata"]["image_base64"])
            image = Image.open(io.BytesIO(image_data))
            image.show()

if __name__ == "__main__":
    # Source: https://github.com/Unstructured-IO/unstructured-ingest/blob/main/example-docs/pdf/embedded-images-tables.pdf

    # Specify where to get the local file, relative to this .py file.
    get_image_block_types(
        input_json_file_path="local-ingest-output/embedded-images-tables.json"
    )

Extract tables as HTML Get chunked elements

Unstructured open source

Getting started with open source

Using Unstructured open source

Ingestion

How to

Best practices

Concepts

Integrations

Extract images and tables from documents

Task

Approach

To run this example

Code

​Task

​Approach

​To run this example

​Code

Task

Approach

To run this example

Code