Extract tables as HTML
Task
You want to get, save, or show the contents of elements that are represented as HTML, such as tables that are embedded in a PDF document.
Approach
Extract the contents of an element’s text_as_html
JSON object, which is nested inside of its parent metadata
object.
To run this example
You will need a document that is one of the document types that can output the text_as_html
JSON object. For the list of applicable document types, see the entries in the table at the beginning of Partitioning where “Table Support” is “Yes.”
This example uses a PDF file with an embedded table.
Code
For the Unstructured Ingest Python library, you can use the standard Python json.load function to load into a Python dictionary the contents of a JSON file that the Ingest Python library outputs after the processing is complete.
Was this page helpful?