Extract tables as HTML

Task

You want to get, save, or show the contents of elements that are represented as HTML, such as tables that are embedded in a PDF document.

Approach

Extract the contents of an element’s text_as_html JSON object, which is nested inside of its parent metadata object.

To run this example

You will need a document that is one of the document types that can output the text_as_html JSON object. For the list of applicable document types, see the entries in the table at the beginning of Partitioning where “Table Support” is “Yes.”

This example uses a PDF file with an embedded table.

Code

For the Unstructured Ingest Python library, you can use the standard Python json.load function to load into a Python dictionary the contents of a JSON file that the Ingest Python library outputs after the processing is complete.

Python

import json, os, webbrowser

def get_tables_as_html(
        input_json_file_path: str,
        output_html_dir_path: str
    ):
    with open(input_json_file_path, 'r') as file:
        file_elements = json.load(file)

    # Provide some minimal CSS for better table readability.
    table_css = "<head><style>table, th, td { border: 1px solid; }</style></head>"

    for element in file_elements:
        if "text_as_html" in element["metadata"]:
            # Surround the element's HTML with basic <html> and <body> tags, and add the minimal CSS.
            html_string = f"<!DOCTYPE html><html>{table_css}<body>{element["metadata"]["text_as_html"]}</body></html>"

            # Save the element's HTML to a local file.
            save_path = f"{output_html_dir_path}/{element["element_id"]}.html"
            file = open(save_path, 'w')
            file.write(html_string)
            file.close()

            # View the locally saved file in the local default web browser.
            webbrowser.open_new(f"file:///{os.getcwd()}/{save_path}")

if __name__ == "__main__":
    # Source: https://github.com/Unstructured-IO/unstructured-ingest/blob/main/example-docs/pdf/embedded-images-tables.pdf

    # Specify where to get the local file, relative to this .py file, and
    # where to store the retrieved HTML, relative to this .py file.
    get_tables_as_html(
        input_json_file_path="local-ingest-output/embedded-images-tables.json",
        output_html_dir_path="local-ingest-output/html/"
    )

Unstructured open source

Getting started with open source

Using Unstructured open source

Ingestion

How to

Best practices

Concepts

Integrations

Task

Approach

To run this example

Code

Unstructured open source

Getting started with open source

Using Unstructured open source

Ingestion

How to

Best practices

Concepts

Integrations

​Task

​Approach

​To run this example

​Code

Task

Approach

To run this example

Code