You want to get, and print or save, the contents of elements that have been chunked.
Chunked elements are typically represented in this format:
To get the element’s chunked content, extract the contents of the element’s text
field.
The chunked content might not contain all of its associated content. To get all of the elements that were used to derive this chunked content, extract the contents of the element’s orig_elements
field, which is nested inside of its parent metadata
field.
The contents of the orig_elements
field is in compressed Base64 gzipped format. To get the content, Base64-decode the bytes, decompress them, and then decode them using UTF-8.
You will need to chunk a document during processing. This example uses a PDF file chunked into 200- to 300-character elements.
For the Unstructured Ingest Python library, you can use the standard Python json.load function to load into a Python dictionary the contents of a JSON file that the Ingest Python library outputs after the processing is complete.
You want to get, and print or save, the contents of elements that have been chunked.
Chunked elements are typically represented in this format:
To get the element’s chunked content, extract the contents of the element’s text
field.
The chunked content might not contain all of its associated content. To get all of the elements that were used to derive this chunked content, extract the contents of the element’s orig_elements
field, which is nested inside of its parent metadata
field.
The contents of the orig_elements
field is in compressed Base64 gzipped format. To get the content, Base64-decode the bytes, decompress them, and then decode them using UTF-8.
You will need to chunk a document during processing. This example uses a PDF file chunked into 200- to 300-character elements.
For the Unstructured Ingest Python library, you can use the standard Python json.load function to load into a Python dictionary the contents of a JSON file that the Ingest Python library outputs after the processing is complete.