Extract images and tables from documents

Task

You want to get, decode, and show elements, such as images and tables, that are embedded in a PDF document.

Approach

Extract the Base64-encoded representation of specific elements, such as images and tables, in the document. For each of these extracted elements, decode the Base64-encoded representation of the element into its original visual representation and then show it.

To run this example

You will need a document that is one of the document types supported by the extract_image_block_types argument. See the extract_image_block_types entry in API Parameters. This example uses a PDF file with embedded images and tables.

Code

For the Unstructured Python SDK, you’ll need: These environment variables:

UNSTRUCTURED_API_KEY - Your Unstructured API key value.
UNSTRUCTURED_API_URL - Your Unstructured API URL.

Python SDK

from unstructured_client import UnstructuredClient
from unstructured_client.models import operations, shared
from unstructured.staging.base import elements_from_dicts, elements_to_json

import os
import base64
from PIL import Image
import io

if __name__ == "__main__":
    client = UnstructuredClient(
        api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")
    )

    # Source: https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/embedded-images-tables.pdf
    
    # Where to get the input file and store the processed data, relative to this .py file.
    local_input_filepath = "local-ingest-input-pdf/embedded-images-tables.pdf"
    local_output_filepath = "local-ingest-output/embedded-images-tables.json"

    with open(local_input_filepath, "rb") as f:
        files = shared.Files(
            content=f.read(),
            file_name=local_input_filepath
        )

    request = operations.PartitionRequest(
        shared.PartitionParameters(
            files=files,
            split_pdf_page=True,
            split_pdf_allow_failed=True,
            split_pdf_concurrency_level=15,
            # Extract the Base64-encoded representation of each
            # processed "Image" and "Table" element. Extract each into
            # an "image_base64" object, as a child of the
            # "metadata" object, for that element in the result.
            # Element type names, such as "Image" and "Table" here,
            # are case-insensitive.
            # Any available Unstructured element type is allowed.
            extract_image_block_types=["Image", "Table"]
        )
    )

    try:
        result = client.general.partition(
            request=request
        )

        for element in result.elements:
            if "image_base64" in element["metadata"]:
                # Decode the Base64-encoded representation of the 
                # processed "Image" or "Table" element into its original
                # visual representation, and then show it.
                image_data = base64.b64decode(element["metadata"]["image_base64"])
                image = Image.open(io.BytesIO(image_data))
                image.show()
        
        # Optionally, prepare to print or save the elements as JSON.
        dict_elements = elements_from_dicts(
            element_dicts=result.elements
        )

        # Print the elements as JSON...
        json_elements = elements_to_json(
            elements=dict_elements,
            indent=2
        )

        print(json_elements)

        # ...or save as JSON.
        elements_to_json(
            elements=dict_elements,
            indent=2,
            filename=local_output_filepath
        )
    except Exception as e:
        print(e)

Unstructured API

Workflow Endpoint

Partition Endpoint

Legacy APIs

Troubleshooting

Extract images and tables from documents

Task

Approach

To run this example

Code

See also

Unstructured API

Workflow Endpoint

Partition Endpoint

Legacy APIs

Troubleshooting

​Task

​Approach

​To run this example

​Code

​See also

Task

Approach

To run this example

Code

See also