Document elements and metadata

unstructured simplifies and streamline the preprocessing of structured and unstructured documents for downstream tasks. What that means is no matter where your data is and no matter what format that data is in, Unstructured’s toolkit will transform and preprocess that data into an easily digestible and usable format that is uniform across data formats. When you partition a document with Unstructured, the result is a list of document Element objects. These element objects represent different components of the source document.

Element example

Here’s an example of what an element might look like:

{'type': 'NarrativeText',
 'element_id': '5ef1d1117721f0472c1ad825991d7d37',
 'text': 'The Unstructured API documentation covers the following API services:',
 'metadata': {'last_modified': '2024-05-01T14:15:22',
  'page_number': 1,
  'languages': ['eng'],
  'parent_id': '56f24319ae258b735cac3ec2a271b1d9',
  'file_directory': '/content',
  'filename': 'Unstructured - Unstructured.html',
  'filetype': 'text/html'}}

As you can see, every element will have a type, an element_id, the extracted text, and some metadata which may vary depending on the element type, document structure, and some additional parameters used during partitioning and/or chunking. Let’s explore some of these document element components in more detail.

Element type

Instead of treating all documents like a wall of plain text, Unstructured preserves the semantic structure of the documents. This gives you more control and flexibility over how you further use the processed documents and allows you to take their structure into consideration. At the same time, normalizing data from various file formats to the Unstructured element type scheme lets you treat all documents the same in your downstream processing, regardless of source format. For example, if you plan to summarize a document, you may only be interested in the narrative of the document, and not care about footers and headers. You can easily filter out the elements you don’t need using their type. Here are some examples of the element types your document may contain:

Element type	Description
`Formula`	An element containing formulas in a document.
`FigureCaption`	An element for capturing text associated with figure captions.
`NarrativeText`	NarrativeText is an element consisting of multiple, well-formulated sentences. This excludes elements such titles, headers, footers, and captions.
`ListItem`	ListItem is a NarrativeText element that is part of a list.
`Title`	A text element for capturing titles.
`Address`	A text element for capturing physical addresses.
`EmailAddress`	A text element for capturing email addresses.
`Image`	A text element for capturing image metadata.
`PageBreak`	An element for capturing page breaks.
`Table`	An element for capturing tables.
`Header`	An element for capturing document headers.
`Footer`	An element for capturing document footers.
`CodeSnippet`	An element for capturing code snippets.
`PageNumber`	An element for capturing page numbers.
`UncategorizedText`	Base element for capturing free text from within document.

If you apply chunking during partitioning of a document or later, you will also see the CompositeElement type. CompositeElement is a chunk formed from text (non-Table) elements. It is only produced by chunking. A composite element may be formed by combining one or more sequential elements produced by partitioning. For example, several individual list items may be combined into a single chunk.

Element ID

By default, the element ID is a SHA-256 hash of the element’s text, its position on the page, page number it’s on, and the name of the document file - this is to ensure that the ID is deterministic and unique at the document level. To obtain globally unique IDs in the output (UUIDs), you can pass unique_element_ids=True into any of the partition functions. This can be helpful if you’d like to use the IDs as a primary key in a database, for example.

Metadata

Unstructured tracks a variety of metadata about the elements extracted from documents. Here are a couple of examples of what element metadata enables you to do:

filter document elements based on an element metadata value. For instance, you may want to limit your scope to elements from a certain page, or you may want to use only elements that have an email matching a regular expression in their metadata.
map an element to the document page where it occurred so that original page can be retrieved when that element matches search criteria.

Metadata is tracked at the element level. You can access the metadata for a given document element with element.metadata. For a dictionary representation, use element.metadata.to_dict().

Common metadata fields

All document types return the following metadata fields when the information is available from the source file:

Metadata field name	Description
`filename`	Filename
`file_directory`	File directory
`last_modified`	Last modified Date
`filetype`	File type
`coordinates`	XY Bounding Box Coordinates. See notes below for further details about the bounding box.
`parent_id`	Element Hierarchy. `parent_id` may be used to infer where an element resides within the overall hierarchy of a document. For instance, a NarrativeText element may have a Title element as a parent (a “sub-title”), which in turn may have another Title element as its parent (a “title”).
`category_depth`	Element depth relative to other elements of the same category. Category depth is the depth of an element relative to other elements of the same category. It’s set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. Category depth may be set using native document hierarchies, e.g. reflecting <H1>, <H2>, or <H3> tags within an HTML document or the indentation level of a bulleted list item in a Word document.
`text_as_html`	HTML representation of extracted tables. Only applicable to table elements.
`languages`	Document Languages. At document level or element level. List is ordered by probability of being the primary language of the text.
`emphasized_text_contents`	Emphasized text (bold or italic) in the original document.
`emphasized_text_tags`	Tags on text that is emphasized in the original document.
`is_continuation`	True if element is a continuation of a previous element. Only relevant for chunking, if an element was divided into two due to max_characters.
`detection_class_prob`	Detection model class probabilities. From unstructured-inference, hi-res strategy.

Notes on common metadata fields:

Metadata for document hierarchy

parent_id and category_depth enhance hierarchy detection to identify the document structure in various file formats by measuring relative depth of an element within its category. This is especially useful in documents with native hierarchies like HTML or Word files, where elements like headings or list items inherently define structure.

Element’s coordinates

Some document types support location data for the elements, usually in the form of bounding boxes. If it exists, an element’s location data is available with element.metadata.coordinates. The coordinates property of an ElementMetadata stores:

points : These specify the corners of the bounding box starting from the top left corner and proceeding counter-clockwise. The points represent pixels, the origin is in the top left and the y coordinate increases in the downward direction.
system: The points have an associated coordinate system. A typical example of a coordinate system is PixelSpace, which is used for representing the coordinates of images. The coordinate system has a name, orientation, layout width, and layout height.

The Unstructured Open Source library offers a way to change the coordinates of an element to a new coordinate system by using the Element.convert_coordinates_to_new_system method. If the in_place flag is True, the coordinate system and points of the element are updated in place and the new coordinates are returned. If the in_place flag is False, only the altered coordinates are returned.

from unstructured.documents.elements import Element
from unstructured.documents.coordinates import PixelSpace, RelativeCoordinateSystem

coordinates = ((10, 10), (10, 100), (200, 100), (200, 10))
coordinate_system = PixelSpace(width=850, height=1100)

element = Element(coordinates=coordinates, coordinate_system=coordinate_system)
print(element.metadata.coordinates.to_dict())
print(element.metadata.coordinates.system.orientation)
print(element.metadata.coordinates.system.width)
print(element.metadata.coordinates.system.height)

element.convert_coordinates_to_new_system(RelativeCoordinateSystem(), in_place=True)
# Should now be in terms of new coordinate system
print(element.metadata.coordinates.to_dict())
print(element.metadata.coordinates.system.orientation)
print(element.metadata.coordinates.system.width)
print(element.metadata.coordinates.system.height)

Additional metadata fields by document type

Field name	Applicable file types	Description
`attached_to_filename`	MSG	The name of the file that the attached file is attached to.
`bcc_recipient`	EML	The related email BCC recipient.
`cc_recipient`	EML	The related email CC recipient.
`email_message_id`	EML	The related email message ID.
`header_footer_type`	Word Doc	The pages that a header or footer applies to in a Word document: `primary`, `even_only`, and `first_page`.
`image_path`	PDF	The path to the image. This is useful when you want to extract the image and save it in a specified path instead of serializing the image within the processed data.
`image_mime_type`	PDF	The MIME type of the image.
`image_url`	HTML	The URL to the image.
`link_start_indexes`	HTML, PDF	A list of the index locations within the extracted content where the `links` can be found.
`link_texts`	HTML	A list of text strings that are associated with the `link_urls`.
`link_urls`	HTML	A list of URLs within the extracted content.
`links`	PDF	A list of links within the extracted content.
`page_name`	XLSX	The related sheet’s name in an Excel file.
`page_number`	DOCX, PDF, PPT, XLSX	The related file’s page number.
`section`	EPUB	The book section title corresponding to a table of contents.
`sent_from`	EML	The related email sender.
`sent_to`	EML	The related email recipient.
`signature`	EML	The related email signature.
`subject`	EML	The related email subject.

Notes on additional metadata by document type:

Email

For emails, metadata will contain the following fields, where available:

bcc_recipient
cc_recipient
email_message_id
sent_from
sent_to
signature
subject

sent_from is a list of strings because the RFC 822 spec for emails allows for multiple sent from email addresses.

Microsoft Excel documents

For Excel documents, ElementMetadata will contain a page_name element, which corresponds to the sheet name in the Excel document.

Microsoft Word documents

Headers and footers in Word documents include a header_footer_type indicating which page a header or footer applies to. Valid values are "primary", "even_only", and "first_page".

Table-specific metadata

For Table elements, the raw text of the table will be stored in the text attribute for the Element, and HTML representation of the table will be available in the element metadata under element.metadata.text_as_html. By default, Unstructured will automatically extract all tables for all doc types unless you set skip_infer_table_types parameter. Here’s an example of a table element. The text of the element will look like this:

Dataset Base Model1 Large Model Notes PubLayNet [38] F / M M Layouts of modern scientific documents PRImA [3] M - Layouts of scanned modern magazines and scientific reports Newspaper [17] F - Layouts of scanned US newspapers from the 20th century TableBank [18] F F Table region on modern scientific and business document HJDataset [31] F / M - Layouts of history Japanese documents

And the text_as_html metadata for the same element will look like this:

<table><thead><th>Dataset</th><th>| Base Model’</th><th>| Notes</th></thead><tr><td>PubLayNet</td><td>[38] F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper</td><td>F</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset [31]</td><td>F/M</td><td>Layouts of history Japanese documents</td></tr></table>

Data connector metadata fields

Documents processed through source connectors include additional document metadata. These additional fields only ever appear if the source document was processed by a connector.

Common data connector metadata fields

Data Source metadata (on json output):
- url
- version
- date created
- date modified
- date processed
- record locator
Record locator is specific to each connector

Additional metadata fields by connector type (via record locator)

Source connector	Additional metadata
airtable	base id, table id, view id
azure (from fsspec)	protocol, remote file path
box (from fsspec)	protocol, remote file path
confluence	url, page id
discord	channel
dropbox (from fsspec)	protocol, remote file path
elasticsearch	url, index name, document id
fsspec	protocol, remote file path
google drive	drive id, file id
gcs (from fsspec)	protocol, remote file path
jira	base url, issue key
onedrive	user pname, server relative path
outlook	message id, user email
s3 (from fsspec)	protocol, remote file path
sharepoint	server path, site url
wikipedia	page title, age url

Unstructured API

Workflow Endpoint

Partition Endpoint

Legacy APIs

Troubleshooting

Document elements and metadata

Element example

Element type

Element ID

Metadata

Common metadata fields

Metadata for document hierarchy

Element’s coordinates

Additional metadata fields by document type

Email

Microsoft Excel documents

Microsoft Word documents

Table-specific metadata

Data connector metadata fields

Common data connector metadata fields

Additional metadata fields by connector type (via record locator)

Unstructured API

Workflow Endpoint

Partition Endpoint

Legacy APIs

Troubleshooting

​Element example

​Element type

​Element ID

​Metadata

​Common metadata fields

​Metadata for document hierarchy

​Element’s coordinates

​Additional metadata fields by document type

​Email

​Microsoft Excel documents

​Microsoft Word documents

​Table-specific metadata

​Data connector metadata fields

​Common data connector metadata fields

​Additional metadata fields by connector type (via record locator)

Element example

Element type

Element ID

Metadata

Common metadata fields

Metadata for document hierarchy

Element’s coordinates

Additional metadata fields by document type

Email

Microsoft Excel documents

Microsoft Word documents

Table-specific metadata

Data connector metadata fields

Common data connector metadata fields

Additional metadata fields by connector type (via record locator)