POST
/
general
/
v0
/
general

Authorizations

unstructured-api-key
string
headerrequired

Headers

unstructured-api-key
string | null

Body

multipart/form-data
files
file
required

The file to extract

coordinates
boolean
default: false

If True, return coordinates for each element extracted via OCR. Default: False

content_type
string | null

A hint about the content type to use (such as text/markdown), when there are problems processing a specific file. This value is a MIME type in the format type/subtype.

encoding
string | null

The encoding method used to decode the text input. Default: utf-8

extract_image_block_types
string[]

The types of elements to extract, for use in extracting image blocks as base64 encoded data stored in metadata fields.

gz_uncompressed_content_type
string | null

If file is gzipped, use this content type after unzipping.

hi_res_model_name
string | null

The name of the inference model used when strategy is hi_res

include_page_breaks
boolean
default: false

If true, the output will include page breaks if the filetype supports it. Default: false

languages
string[]

The languages present in the document, for use in partitioning and/or OCR. See the Tesseract documentation for a full list of languages.

ocr_languages
string[]

Deprecated! The languages present in the document, for use in partitioning and/or OCR

output_format
enum<string>
default: application/json

The format of the response. Supported formats are application/json and text/csv. Default: application/json.

Available options:
application/json,
text/csv
pdf_infer_table_structure
boolean
default: true

Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents.

skip_infer_table_types
string[]

The document types that you want to skip table extraction with. Default: []

starting_page_number
integer | null

When PDF is split into pages before sending it into the API, providing this information will allow the page number to be assigned correctly. Introduced in 1.0.27.

strategy
enum<string>
default: hi_res

The strategy to use for partitioning PDF/image. Options are fast, hi_res, auto. Default: hi_res

Available options:
fast,
hi_res,
auto,
ocr_only
unique_element_ids
boolean
default: false

When True, assign UUIDs to element IDs, which guarantees their uniqueness (useful when using them as primary keys in database). Otherwise a SHA-256 of element text is used. Default: False

xml_keep_tags
boolean
default: false

If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to XML documents.

chunking_strategy
enum<string> | null

Use one of the supported strategies to chunk the returned elements after partitioning. When 'chunking_strategy' is not specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: 'basic', 'by_page', 'by_similarity', or 'by_title'

Available options:
basic,
by_page,
by_similarity,
by_title
combine_under_n_chars
integer | null

If chunking strategy is set, combine elements until a section reaches a length of n chars. Default: 500

include_orig_elements
boolean | null

When a chunking strategy is specified, each returned chunk will include the elements consolidated to form that chunk as .metadata.orig_elements. Default: true.

max_characters
integer | null

If chunking strategy is set, cut off new sections after reaching a length of n chars (hard max). Default: 500

multipage_sections
boolean
default: true

If chunking strategy is set, determines if sections can span multiple sections. Default: true

new_after_n_chars
integer | null

If chunking strategy is set, cut off new sections after reaching a length of n chars (soft max). Default: 1500

overlap
integer
default: 0

Specifies the length of a string ('tail') to be drawn from each chunk and prefixed to the next chunk as a context-preserving mechanism. By default, this only applies to split-chunks where an oversized element is divided into multiple chunks by text-splitting. Default: 0

overlap_all
boolean
default: false

When True, apply overlap between 'normal' chunks formed from whole elements and not subject to text-splitting. Use this with caution as it entails a certain level of 'pollution' of otherwise clean semantic chunk boundaries. Default: False

similarity_threshold
number | null

A value between 0.0 and 1.0 describing the minimum similarity two elements must have to be included in the same chunk. Note that similar elements may be separated to meet chunk-size criteria; this value can only guarantees that two elements with similarity below the threshold will appear in separate chunks.

include_slide_notes
boolean
default: true

When True, slide notes from .ppt and .pptx files will be included in the response. Default: True

Response

200 - application/json

A list of element dictionaries extracted from the file