Task

You want to get, and print or save, the contents of elements that have been chunked.

Approach

Chunked elements are typically represented in this format:

{
  "type": "CompositeElement",
  "element_id": "3800f7ff047e97cbf0a5f7df7ad52c80",
  "text": "4,5 Exception noted in \u201cSpecial Considerations for Promotional Labeling and Advertising Material\u201d\n\n3",
  "metadata": {
    "filetype": "application/pdf",
    "languages": ["eng"],
    "page_number": 6,
    "orig_elements": "eJy ... Full content omitted for brevity ... x8=",
    "filename": "Portable-Document-Format-Specifications.pdf"
  }
}
{
  "type": "Table",
  "element_id": "758127b42c51b93b59abf08640d1ccab",
  "text": "Gels and karyotypes High pressure liquid chromatography 600 dpi (8 bit grayscale depth) 300 dpi",
  "metadata": {
    "text_as_html": "<table><tbody><tr><td>Gels and karyotypes</td><td>600 dpi (8 bit grayscale depth)</td></tr><tr><td>High pressure liquid chromatography</td><td>300</td></tr></tbody></table>",
    "filetype": "application/pdf",
    "languages": ["eng"],
    "page_number": 8,
    "orig_elements": "eJy ... Full content omitted for brevity ... MnD",
    "filename": "Portable-Document-Format-Specifications.pdf"
}

To get the element’s chunked content, extract the contents of the element’s text field.

The chunked content might not contain all of its associated content. To get all of the elements that were used to derive this chunked content, extract the contents of the element’s orig_elements field, which is nested inside of its parent metadata field.

The contents of the orig_elements field is in compressed Base64 gzipped format. To get the content, Base64-decode the bytes, decompress them, and then decode them using UTF-8.

To run this example

You will need to chunk a document during processing. This example uses a PDF file chunked into 200- to 300-character elements.

Code

See also