unstructured simplifies and streamline the preprocessing of structured and unstructured documents for downstream tasks. What that means is no matter where your data is and no matter what format that data is in, Unstructured’s toolkit will transform and preprocess that data into an easily digestible and usable format that is uniform across data formats.

When you partition a document with Unstructured, the result is a list of document Element objects. These element objects represent different components of the source document.

Element example

Here’s an example of what an element might look like:

{'type': 'NarrativeText',
 'element_id': '5ef1d1117721f0472c1ad825991d7d37',
 'text': 'The Unstructured API documentation covers the following API services:',
 'metadata': {'last_modified': '2024-05-01T14:15:22',
  'page_number': 1,
  'languages': ['eng'],
  'parent_id': '56f24319ae258b735cac3ec2a271b1d9',
  'file_directory': '/content',
  'filename': 'Unstructured API Services - Unstructured.html',
  'filetype': 'text/html'}}

As you can see, every element will have a type, an element_id, the extracted text, and some metadata which may vary depending on the element type, document structure, and some additional parameters used during partitioning and/or chunking.

Let’s explore some of these document element components in more detail.

Element type

Instead of treating all documents like a wall of plain text, Unstructured preserves the semantic structure of the documents. This gives you more control and flexibility over how you further use the processed documents and allows you to take their structure into consideration. At the same time, normalizing data from various file formats to the Unstructured element type scheme lets you treat all documents the same in your downstream processing, regardless of source format. For example, if you plan to summarize a document, you may only be interested in the narrative of the document, and not care about footers and headers. You can easily filter out the elements you don’t need using their type.

Here are some examples of the element types your document may contain:

Element typeDescription
FormulaAn element containing formulas in a document.
FigureCaptionAn element for capturing text associated with figure captions.
NarrativeTextNarrativeText is an element consisting of multiple, well-formulated sentences. This excludes elements such titles, headers, footers, and captions.
ListItemListItem is a NarrativeText element that is part of a list.
TitleA text element for capturing titles.
AddressA text element for capturing physical addresses.
EmailAddressA text element for capturing email addresses.
ImageA text element for capturing image metadata.
PageBreakAn element for capturing page breaks.
TableAn element for capturing tables.
HeaderAn element for capturing document headers.
FooterAn element for capturing document footers.
CodeSnippetAn element for capturing code snippets.
PageNumberAn element for capturing page numbers.
UncategorizedTextBase element for capturing free text from within document.

If you apply chunking during partitioning of a document or later, you will also see the CompositeElement type. CompositeElement is a chunk formed from text (non-Table) elements. It is only produced by chunking. A composite element may be formed by combining one or more sequential elements produced by partitioning. For example, several individual list items may be combined into a single chunk.

Element ID

By default, the element ID is a SHA-256 hash of the element’s text, its position on the page, page number it’s on, and the name of the document file - this is to ensure that the ID is deterministic and unique at the document level. To obtain globally unique IDs in the output (UUIDs), you can pass unique_element_ids=True into any of the partition functions. This can be helpful if you’d like to use the IDs as a primary key in a database, for example.

Metadata

Unstructured tracks a variety of metadata about the elements extracted from documents. Here are a couple of examples of what element metadata enables you to do:

  • filter document elements based on an element metadata value. For instance, you may want to limit your scope to elements from a certain page, or you may want to use only elements that have an email matching a regular expression in their metadata.
  • map an element to the document page where it occurred so that original page can be retrieved when that element matches search criteria.

Metadata is tracked at the element level. You can access the metadata for a given document element with element.metadata. For a dictionary representation, use element.metadata.to_dict().

Common metadata fields

All document types return the following metadata fields when the information is available from the source file:

Metadata field nameDescription
filenameFilename
file_directoryFile directory
last_modifiedLast modified Date
filetypeFile type
coordinatesXY Bounding Box Coordinates. See notes below for further details about the bounding box.
parent_idElement Hierarchy. parent_id may be used to infer where an element resides within the overall hierarchy of a document. For instance, a NarrativeText element may have a Title element as a parent (a “sub-title”), which in turn may have another Title element as its parent (a “title”).
category_depthElement depth relative to other elements of the same category. Category depth is the depth of an element relative to other elements of the same category. It’s set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. Category depth may be set using native document hierarchies, e.g. reflecting <H1>, <H2>, or <H3> tags within an HTML document or the indentation level of a bulleted list item in a Word document.
text_as_htmlHTML representation of extracted tables. Only applicable to table elements.
languagesDocument Languages. At document level or element level. List is ordered by probability of being the primary language of the text.
emphasized_text_contentsEmphasized text (bold or italic) in the original document.
emphasized_text_tagsTags on text that is emphasized in the original document.
is_continuationTrue if element is a continuation of a previous element. Only relevant for chunking, if an element was divided into two due to max_characters.
detection_class_probDetection model class probabilities. From unstructured-inference, hi-res strategy.

Notes on common metadata fields:

Metadata for document hierarchy

parent_id and category_depth enhance hierarchy detection to identify the document structure in various file formats by measuring relative depth of an element within its category. This is especially useful in documents with native hierarchies like HTML or Word files, where elements like headings or list items inherently define structure.

Element’s coordinates

Some document types support location data for the elements, usually in the form of bounding boxes. If it exists, an element’s location data is available with element.metadata.coordinates. The coordinates property of an ElementMetadata stores:

  • points : These specify the corners of the bounding box starting from the top left corner and proceeding counter-clockwise. The points represent pixels, the origin is in the top left and the y coordinate increases in the downward direction.
  • system: The points have an associated coordinate system. A typical example of a coordinate system is PixelSpace, which is used for representing the coordinates of images. The coordinate system has a name, orientation, layout width, and layout height.

The Unstructured Open Source library offers a way to change the coordinates of an element to a new coordinate system by using the Element.convert_coordinates_to_new_system method. If the in_place flag is True, the coordinate system and points of the element are updated in place and the new coordinates are returned. If the in_place flag is False, only the altered coordinates are returned.

from unstructured.documents.elements import Element
from unstructured.documents.coordinates import PixelSpace, RelativeCoordinateSystem

coordinates = ((10, 10), (10, 100), (200, 100), (200, 10))
coordinate_system = PixelSpace(width=850, height=1100)

element = Element(coordinates=coordinates, coordinate_system=coordinate_system)
print(element.metadata.coordinates.to_dict())
print(element.metadata.coordinates.system.orientation)
print(element.metadata.coordinates.system.width)
print(element.metadata.coordinates.system.height)

element.convert_coordinates_to_new_system(RelativeCoordinateSystem(), in_place=True)
# Should now be in terms of new coordinate system
print(element.metadata.coordinates.to_dict())
print(element.metadata.coordinates.system.orientation)
print(element.metadata.coordinates.system.width)
print(element.metadata.coordinates.system.height)

Additional metadata fields by document type

Field NameApplicable Doc TypesDescription
page_numberDOCX, PDF, PPT, XLSXPage mumber
page_nameXLSXSheet name in an Excel document
sent_fromEMLEmail sender
sent_toEMLEmail recipient
subjectEMLEmail subject
attached_to_filenameMSGfilename that attachment file is attached to
header_footer_typeWord DocPages a header or footer applies to: “primary”, “even_only”, and “first_page”
link_urlsHTMLThe url associated with a link in a document.
link_textsHTMLThe text associated with a link in a document.
sectionEPUBBook section title corresponding to table of contents

Notes on additional metadata by document type:

Email

Emails will include sent_from, sent_to, and subject metadata. sent_from is a list of strings because the RFC 822 spec for emails allows for multiple sent from email addresses.

Microsoft Excel documents

For Excel documents, ElementMetadata will contain a page_name element, which corresponds to the sheet name in the Excel document.

Microsoft Word documents

Headers and footers in Word documents include a header_footer_type indicating which page a header or footer applies to. Valid values are "primary", "even_only", and "first_page".

Table-specific metadata

For Table elements, the raw text of the table will be stored in the text attribute for the Element, and HTML representation of the table will be available in the element metadata under element.metadata.text_as_html. By default, Unstructured will automatically extract all tables for all doc types unless you set skip_infer_table_types parameter.

Here’s an example of a table element. The text of the element will look like this:

Dataset Base Model1 Large Model Notes PubLayNet [38] F / M M Layouts of modern scientific documents PRImA [3] M - Layouts of scanned modern magazines and scientific reports Newspaper [17] F - Layouts of scanned US newspapers from the 20th century TableBank [18] F F Table region on modern scientific and business document HJDataset [31] F / M - Layouts of history Japanese documents

And the text_as_html metadata for the same element will look like this:

<table><thead><th>Dataset</th><th>| Base Model’</th><th>| Notes</th></thead><tr><td>PubLayNet</td><td>[38] F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper</td><td>F</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset [31]</td><td>F/M</td><td>Layouts of history Japanese documents</td></tr></table>

Data connector metadata fields

Documents processed through source connectors include additional document metadata. These additional fields only ever appear if the source document was processed by a connector.

Common data connector metadata fields

  • Data Source metadata (on json output):
    • url
    • version
    • date created
    • date modified
    • date processed
    • record locator
  • Record locator is specific to each connector

Additional metadata fields by connector type (via record locator)

Source connectorAdditional metadata
airtablebase id, table id, view id
azure (from fsspec)protocol, remote file path
box (from fsspec)protocol, remote file path
confluenceurl, page id
discordchannel
dropbox (from fsspec)protocol, remote file path
elasticsearchurl, index name, document id
fsspecprotocol, remote file path
google drivedrive id, file id
gcs (from fsspec)protocol, remote file path
jirabase url, issue key
onedriveuser pname, server relative path
outlookmessage id, user email
s3 (from fsspec)protocol, remote file path
sharepointserver path, site url
wikipediapage title, age url