> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unstructured.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Document elements and metadata

`unstructured` simplifies and streamline the preprocessing of structured and unstructured documents for downstream tasks.
What that means is no matter where your data is and no matter what format that data is in, Unstructured's toolkit will
transform and preprocess that data into an easily digestible and usable format that is uniform across data formats.

When you partition a document with Unstructured, the result is a list of document `Element` objects.
These element objects represent different components of the source document.

## Element example

Here's an example of what an element might look like:

```
{'type': 'NarrativeText',
 'element_id': '5ef1d1117721f0472c1ad825991d7d37',
 'text': 'The Unstructured API documentation covers the following API services:',
 'metadata': {'last_modified': '2024-05-01T14:15:22',
  'page_number': 1,
  'languages': ['eng'],
  'parent_id': '56f24319ae258b735cac3ec2a271b1d9',
  'file_directory': '/content',
  'filename': 'Unstructured - Unstructured.html',
  'filetype': 'text/html'}}
```

As you can see, every element will have a `type`, an `element_id`, the extracted `text`, and some `metadata` which may
vary depending on the element type, document structure, and some additional parameters used during partitioning and/or chunking.

Let's explore some of these document element components in more detail.

## Element type

Instead of treating all documents like a wall of plain text, Unstructured preserves the semantic structure of the documents.
This gives you more control and flexibility over how you further use the processed documents and allows you to take their
structure into consideration. At the same time, normalizing data from various file formats to the Unstructured element
type scheme lets you treat all documents the same in your downstream processing, regardless of source format.
For example, if you plan to summarize a document, you may only be interested in the narrative
of the document, and not care about footers and headers. You can easily filter out the elements you don't need using their type.

Here are some examples of the element types your document may contain:

| Element type        | Description                                                                                                                                        |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Formula`           | An element containing formulas in a document.                                                                                                      |
| `FigureCaption`     | An element for capturing text associated with figure captions.                                                                                     |
| `NarrativeText`     | NarrativeText is an element consisting of multiple, well-formulated sentences. This excludes elements such titles, headers, footers, and captions. |
| `ListItem`          | ListItem is a NarrativeText element that is part of a list.                                                                                        |
| `Title`             | A text element for capturing titles.                                                                                                               |
| `Address`           | A text element for capturing physical addresses.                                                                                                   |
| `EmailAddress`      | A text element for capturing email addresses.                                                                                                      |
| `Image`             | A text element for capturing image metadata.                                                                                                       |
| `PageBreak`         | An element for capturing page breaks.                                                                                                              |
| `Table`             | An element for capturing tables.                                                                                                                   |
| `Header`            | An element for capturing document headers.                                                                                                         |
| `Footer`            | An element for capturing document footers.                                                                                                         |
| `CodeSnippet`       | An element for capturing code snippets.                                                                                                            |
| `PageNumber`        | An element for capturing page numbers.                                                                                                             |
| `UncategorizedText` | Base element for capturing free text from within document.                                                                                         |

If you apply chunking during partitioning of a document or later, you will also see the `CompositeElement` type.
`CompositeElement` is a chunk formed from text (non-Table) elements. It is only produced by chunking.
A composite element may be formed by combining one or more sequential elements produced by partitioning. For example,
several individual list items may be combined into a single chunk.

## Element ID

By default, the element ID is a SHA-256 hash of the element's text, its position on the page, page number it's on,
and the name of the document file - this is to ensure that the ID is deterministic and unique at the document level.
To obtain globally unique IDs in the output (UUIDs), you can pass `unique_element_ids=True` into any of the partition
functions. This can be helpful if you'd like to use the IDs as a primary key in a database, for example.

## Metadata

Unstructured tracks a variety of metadata about the elements extracted from documents.
Here are a couple of examples of what element metadata enables you to do:

* filter document elements based on an element metadata value. For instance, you may want to limit your scope to elements from a certain page, or you may want to use only elements that have an email matching a regular expression in their metadata.
* map an element to the document page where it occurred so that original page can be retrieved when that element matches search criteria.

Metadata is tracked at the element level. You can access the metadata for a given document element with
`element.metadata`. For a dictionary representation, use `element.metadata.to_dict()`.

### Common metadata fields

All document types return the following metadata fields when the information is available from the source file:

| Metadata field name        | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| -------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `filename`                 | Filename                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| `file_directory`           | File directory                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| `last_modified`            | Last modified Date                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| `filetype`                 | File type                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| `coordinates`              | XY Bounding Box Coordinates. See notes below for further details about the bounding box.                                                                                                                                                                                                                                                                                                                                                                                           |
| `parent_id`                | Element Hierarchy. `parent_id` may be used to infer where an element resides within the overall hierarchy of a document. For instance, a NarrativeText element may have a Title element as a parent (a “sub-title”), which in turn may have another Title element as its parent (a "title").                                                                                                                                                                                       |
| `category_depth`           | Element depth relative to other elements of the same category.  Category depth is the depth of an element relative to other elements of the same category. It’s set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. Category depth may be set using native document hierarchies, e.g. reflecting \<H1>, \<H2>, or \<H3> tags within an HTML document or the indentation level of a bulleted list item in a Word document. |
| `image_base64`             | A Base64 representation of the detected image or table. Only applicable to image and table elements when High Res partitioning is used. After chunking, `image_base64` is not preserved in the output.                                                                                                                                                                                                                                                                             |
| `image_mime_type`          | MIME type of the image. Only applicable to image elements.                                                                                                                                                                                                                                                                                                                                                                                                                         |
| `text_as_html`             | HTML representation of extracted tables. Only applicable to table elements.                                                                                                                                                                                                                                                                                                                                                                                                        |
| `languages`                | Document Languages.  At document level or element level. List is ordered by probability of being the primary language of the text.                                                                                                                                                                                                                                                                                                                                                 |
| `emphasized_text_contents` | Emphasized text (bold or italic) in the original document.                                                                                                                                                                                                                                                                                                                                                                                                                         |
| `emphasized_text_tags`     | Tags on text that is emphasized in the original document.                                                                                                                                                                                                                                                                                                                                                                                                                          |
| `orig_elements`            | For chunked elements, a list of the original elements that were used to create the current chunked element.                                                                                                                                                                                                                                                                                                                                                                        |
| `is_continuation`          | True if element is a continuation of a previous element.  Only relevant for chunking, if an element was divided into two due to max\_characters.                                                                                                                                                                                                                                                                                                                                   |
| `detection_class_prob`     | Detection model class probabilities.  From unstructured-inference, hi-res strategy.                                                                                                                                                                                                                                                                                                                                                                                                |

Notes on common metadata fields:

#### Metadata for document hierarchy

`parent_id` and `category_depth` enhance hierarchy detection to identify the document
structure in various file formats by measuring relative depth of an element within its category. This is especially
useful in documents with native hierarchies like HTML or Word files, where elements like headings or list items inherently define structure.

#### Element's coordinates

Some document types support location data for the elements, usually in the form of bounding boxes.
If it exists, an element's location data is available with `element.metadata.coordinates`. The `coordinates` property of an `ElementMetadata` stores:

* `points` : These specify the corners of the bounding box starting from the top left corner and proceeding counter-clockwise. The points represent pixels, the origin is in the top left and the `y` coordinate increases in the downward direction.
* `system`: The points have an associated coordinate system. A typical example of a coordinate system is `PixelSpace`, which is used for representing the coordinates of images. The coordinate system has a name, orientation, layout width, and layout height.

The Unstructured Open Source library offers a way to change the coordinates of an element to a new coordinate system by
using the `Element.convert_coordinates_to_new_system` method. If the `in_place` flag is `True`, the coordinate system
and points of the element are updated in place and the new coordinates are returned. If the `in_place` flag is `False`,
only the altered coordinates are returned.

```python  theme={null}
from unstructured.documents.elements import Element
from unstructured.documents.coordinates import PixelSpace, RelativeCoordinateSystem

coordinates = ((10, 10), (10, 100), (200, 100), (200, 10))
coordinate_system = PixelSpace(width=850, height=1100)

element = Element(coordinates=coordinates, coordinate_system=coordinate_system)
print(element.metadata.coordinates.to_dict())
print(element.metadata.coordinates.system.orientation)
print(element.metadata.coordinates.system.width)
print(element.metadata.coordinates.system.height)

element.convert_coordinates_to_new_system(RelativeCoordinateSystem(), in_place=True)
# Should now be in terms of new coordinate system
print(element.metadata.coordinates.to_dict())
print(element.metadata.coordinates.system.orientation)
print(element.metadata.coordinates.system.width)
print(element.metadata.coordinates.system.height)

```

### Additional metadata fields by document type

| Field name             | Applicable file types | Description                                                                                                                                                          |
| ---------------------- | --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `attached_to_filename` | MSG                   | The name of the file that the attached file is attached to.                                                                                                          |
| `bcc_recipient`        | EML                   | The related [email](#email) BCC recipient.                                                                                                                           |
| `cc_recipient`         | EML                   | The related [email](#email) CC recipient.                                                                                                                            |
| `email_message_id`     | EML                   | The related [email](#email) message ID.                                                                                                                              |
| `header_footer_type`   | Word Doc              | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`.                                  |
| `image_path`           | PDF                   | The path to the image. This is useful when you want to extract the image and save it in a specified path instead of serializing the image within the processed data. |
| `image_mime_type`      | PDF                   | The MIME type of the image.                                                                                                                                          |
| `image_url`            | HTML                  | The URL to the image.                                                                                                                                                |
| `link_start_indexes`   | HTML, PDF             | A list of the index locations within the extracted content where the `links` can be found.                                                                           |
| `link_texts`           | HTML                  | A list of text strings that are associated with the `link_urls`.                                                                                                     |
| `link_urls`            | HTML                  | A list of URLs within the extracted content.                                                                                                                         |
| `links`                | PDF                   | A list of links within the extracted content.                                                                                                                        |
| `page_name`            | XLSX                  | The related sheet's name in an [Excel file](#microsoft-excel-files).                                                                                                 |
| `page_number`          | DOCX, PDF, PPT, XLSX  | The related file's page number.                                                                                                                                      |
| `section`              | EPUB                  | The book section title corresponding to a table of contents.                                                                                                         |
| `sent_from`            | EML                   | The related [email](#email) sender.                                                                                                                                  |
| `sent_to`              | EML                   | The related [email](#email) recipient.                                                                                                                               |
| `signature`            | EML                   | The related [email](#email) signature.                                                                                                                               |
| `subject`              | EML                   | The related [email](#email) subject.                                                                                                                                 |

Notes on additional metadata by document type:

#### Email

For emails, metadata will contain the following fields, where available:

* `bcc_recipient`
* `cc_recipient`
* `email_message_id`
* `sent_from`
* `sent_to`
* `signature`
* `subject`

`sent_from` is a list of strings because the [RFC 822](https://www.rfc-editor.org/rfc/rfc822) spec for emails allows for multiple sent from email addresses.

#### Microsoft Excel documents

For Excel documents, `ElementMetadata` will contain a `page_name` element, which corresponds to the sheet name in the Excel
document.

#### Microsoft Word documents

Headers and footers in Word documents include a `header_footer_type` indicating which page a header or footer applies to.
Valid values are `"primary"`, `"even_only"`, and `"first_page"`.

### Table-specific metadata

For `Table` elements, the raw text of the table will be stored in the `text` attribute for the Element, and HTML representation
of the table will be available in the element metadata under `element.metadata.text_as_html`. By default,
Unstructured will automatically extract all tables for all doc types unless you set `skip_infer_table_types` parameter.

Here's an example of a table element. The `text` of the element will look like this:

```
Dataset Base Model1 Large Model Notes PubLayNet [38] F / M M Layouts of modern scientific documents PRImA [3] M - Layouts of scanned modern magazines and scientific reports Newspaper [17] F - Layouts of scanned US newspapers from the 20th century TableBank [18] F F Table region on modern scientific and business document HJDataset [31] F / M - Layouts of history Japanese documents
```

And the `text_as_html` metadata for the same element will look like this:

```py  theme={null}
<table><thead><th>Dataset</th><th>| Base Model’</th><th>| Notes</th></thead><tr><td>PubLayNet</td><td>[38] F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper</td><td>F</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset [31]</td><td>F/M</td><td>Layouts of history Japanese documents</td></tr></table>
```

### Data connector metadata fields

Documents processed through source connectors include additional document metadata. These additional fields only ever
appear if the source document was processed by a connector.

#### Common data connector metadata fields

* Data Source metadata (on json output):
  * url
  * version
  * date created
  * date modified
  * date processed
  * record locator
* Record locator is specific to each connector

#### Additional metadata fields by connector type (via record locator)

| Source connector      | Additional metadata              |
| --------------------- | -------------------------------- |
| airtable              | base id, table id, view id       |
| azure (from fsspec)   | protocol, remote file path       |
| box (from fsspec)     | protocol, remote file path       |
| confluence            | url, page id                     |
| discord               | channel                          |
| dropbox (from fsspec) | protocol, remote file path       |
| elasticsearch         | url, index name, document id     |
| fsspec                | protocol, remote file path       |
| google drive          | drive id, file id                |
| gcs (from fsspec)     | protocol, remote file path       |
| jira                  | base url, issue key              |
| onedrive              | user pname, server relative path |
| outlook               | message id, user email           |
| s3 (from fsspec)      | protocol, remote file path       |
| sharepoint            | server path, site url            |
| wikipedia             | page title, age url              |
