Table extraction from PDF

On this page

Method 1: Using partition_pdf
Method 2: Using Auto Partition or Unstructured API

This sample code utilizes the Unstructured Open Source library and also provides an alternative method the utilizing the Unstructured Partition Endpoint.

Method 1: Using partition_pdf

To extract the tables from PDF files using the partition_pdf, set the skip_infer_table_types parameter to False and strategy parameter to hi_res. Usage

from unstructured.partition.pdf import partition_pdf

fname = "example-docs/pdf/layout-parser-paper.pdf"

elements = partition_pdf(filename=fname,
                         skip_infer_table_types=False,
                         strategy='hi_res',
           )

tables = [el for el in elements if el.category == "Table"]

print(tables[0].text)
print(tables[0].metadata.text_as_html)

Method 2: Using Auto Partition or Unstructured API

By default, table extraction from all file types is enabled. To extract tables from PDFs and images using Auto Partition or Unstructured API parameters simply set strategy parameter to hi_res. Usage: Auto Partition

from unstructured.partition.auto import partition

filename = "example-docs/pdf/layout-parser-paper.pdf"

elements = partition(filename=filename,
                     strategy='hi_res',
           )

tables = [el for el in elements if el.category == "Table"]

print(tables[0].text)
print(tables[0].metadata.text_as_html)

Usage: API Parameters

curl -X 'POST' \
    'https://api.unstructured.io' \
    -H 'accept: application/json' \
    -H 'Content-Type: multipart/form-data' \
    -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
    -F 'strategy=hi_res' \
    | jq -C . | less -R

Multi-file API processing Query processed PDF with HuggingChat

Notebooks

Code samples

Tool demos

Table extraction from PDF

Method 1: Using partition_pdf

Method 2: Using Auto Partition or Unstructured API

Notebooks

Code samples

Tool demos

​Method 1: Using partition_pdf

​Method 2: Using Auto Partition or Unstructured API

Method 1: Using partition_pdf

Method 2: Using Auto Partition or Unstructured API