Dataset | | Base Model’ | | Notes | PubLayNet | [38] F/M | Layouts of modern scientific documents |
PRImA [3] | M | Layouts of scanned modern magazines and scientific reports |
Newspaper | F | Layouts of scanned US newspapers from the 20th century |
TableBank | F | Table region on modern scientific and business document |
HJDataset [31] | F/M | Layouts of history Japanese documents |
```
### Data connector metadata fields
Documents processed through source connectors include additional document metadata. These additional fields only ever
appear if the source document was processed by a connector.
#### Common data connector metadata fields
* Data Source metadata (on json output):
* url
* version
* date created
* date modified
* date processed
* record locator
* Record locator is specific to each connector
#### Additional metadata fields by connector type (via record locator)
| Source connector | Additional metadata |
| --------------------- | -------------------------------- |
| airtable | base id, table id, view id |
| azure (from fsspec) | protocol, remote file path |
| box (from fsspec) | protocol, remote file path |
| confluence | url, page id |
| discord | channel |
| dropbox (from fsspec) | protocol, remote file path |
| elasticsearch | url, index name, document id |
| fsspec | protocol, remote file path |
| google drive | drive id, file id |
| gcs (from fsspec) | protocol, remote file path |
| jira | base url, issue key |
| onedrive | user pname, server relative path |
| outlook | message id, user email |
| s3 (from fsspec) | protocol, remote file path |
| sharepoint | server path, site url |
| wikipedia | page title, age url |
# Examples
This page provides some examples of accessing Unstructured API via different methods.
For each of these examples, you'll need:
These environment variables:
* `UNSTRUCTURED_API_KEY` - Your Unstructured API key value.
* `UNSTRUCTURED_API_URL` - Your Unstructured API URL.