Summary

Summary

curl --request POST \
  --url https://api.unstructuredapp.io/general/v0/general \
  --header 'Content-Type: multipart/form-data' \
  --header 'unstructured-api-key: <api-key>' \
  --form files='@example-file' \
  --form coordinates=false \
  --form 'content_type=<string>' \
  --form 'encoding=<string>' \
  --form 'gz_uncompressed_content_type=<string>' \
  --form 'hi_res_model_name=<string>' \
  --form include_page_breaks=false \
  --form output_format=application/json \
  --form pdf_infer_table_structure=true \
  --form starting_page_number=123 \
  --form strategy=hi_res \
  --form vlm_model_provider=openai \
  --form 'vlm_model=<string>' \
  --form table_ocr_agent=unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract \
  --form unique_element_ids=false \
  --form xml_keep_tags=false \
  --form chunking_strategy=by_title \
  --form combine_under_n_chars=123 \
  --form include_orig_elements=true \
  --form tracking_enabled=true \
  --form max_characters=123 \
  --form multipage_sections=true \
  --form new_after_n_chars=123 \
  --form overlap=0 \
  --form overlap_all=false \
  --form similarity_threshold=123 \
  --form do_not_break_similarity_on_footer_header=false \
  --form include_slide_notes=true \
  --form pdfminer_line_overlap=123 \
  --form pdfminer_char_margin=123 \
  --form pdfminer_line_margin=123 \
  --form pdfminer_word_margin=0.185

[
  {
    "type": "Title",
    "element_id": "6aa0ff22f91bbe7e26e8e25ca8052acd",
    "text": "LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis",
    "metadata": {
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "filename": "layout-parser-paper.pdf",
      "filetype": "application/pdf"
    }
  }
]

POST

general

Summary

curl --request POST \
  --url https://api.unstructuredapp.io/general/v0/general \
  --header 'Content-Type: multipart/form-data' \
  --header 'unstructured-api-key: <api-key>' \
  --form files='@example-file' \
  --form coordinates=false \
  --form 'content_type=<string>' \
  --form 'encoding=<string>' \
  --form 'gz_uncompressed_content_type=<string>' \
  --form 'hi_res_model_name=<string>' \
  --form include_page_breaks=false \
  --form output_format=application/json \
  --form pdf_infer_table_structure=true \
  --form starting_page_number=123 \
  --form strategy=hi_res \
  --form vlm_model_provider=openai \
  --form 'vlm_model=<string>' \
  --form table_ocr_agent=unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract \
  --form unique_element_ids=false \
  --form xml_keep_tags=false \
  --form chunking_strategy=by_title \
  --form combine_under_n_chars=123 \
  --form include_orig_elements=true \
  --form tracking_enabled=true \
  --form max_characters=123 \
  --form multipage_sections=true \
  --form new_after_n_chars=123 \
  --form overlap=0 \
  --form overlap_all=false \
  --form similarity_threshold=123 \
  --form do_not_break_similarity_on_footer_header=false \
  --form include_slide_notes=true \
  --form pdfminer_line_overlap=123 \
  --form pdfminer_char_margin=123 \
  --form pdfminer_line_margin=123 \
  --form pdfminer_word_margin=0.185

[
  {
    "type": "Title",
    "element_id": "6aa0ff22f91bbe7e26e8e25ca8052acd",
    "text": "LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis",
    "metadata": {
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "filename": "layout-parser-paper.pdf",
      "filetype": "application/pdf"
    }
  }
]

Authorizations

unstructured-api-key

string

header

required

Headers

unstructured-api-key

string | null

Body

multipart/form-data

files

file

required

The file to extract

Example:

{
  "summary": "File to be partitioned",
  "externalValue": "https://github.com/Unstructured-IO/unstructured/blob/98d3541909f64290b5efb65a226fc3ee8a7cc5ee/example-docs/layout-parser-paper.pdf"
}

coordinates

boolean

default:false

If True, return coordinates for each element extracted via OCR. Default: False

content_type

string | null

A hint about the content type to use (such as text/markdown), when there are problems processing a specific file. This value is a MIME type in the format type/subtype.

encoding

string | null

The encoding method used to decode the text input. Default: utf-8

extract_image_block_types

string[]

The types of elements to extract, for use in extracting image blocks as base64 encoded data stored in metadata fields.

gz_uncompressed_content_type

string | null

If file is gzipped, use this content type after unzipping.

hi_res_model_name

string | null

The name of the inference model used when strategy is hi_res

include_page_breaks

boolean

default:false

If true, the output will include page breaks if the filetype supports it. Default: false

languages

string[]

The languages present in the document, for use in partitioning and/or OCR. See the Tesseract documentation for a full list of languages.

ocr_languages

string[]

Deprecated! The languages present in the document, for use in partitioning and/or OCR

output_format

enum<string>

default:application/json

The format of the response. Supported formats are application/json and text/csv. Default: application/json.

Available options:

application/json,

text/csv

pdf_infer_table_structure

boolean

default:true

Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents.

skip_infer_table_types

string[]

The document types that you want to skip table extraction with. Default: []

starting_page_number

integer | null

When PDF is split into pages before sending it into the API, providing this information will allow the page number to be assigned correctly. Introduced in 1.0.27.

strategy

enum<string>

default:hi_res

The strategy to use for partitioning PDF/image. Options are fast, hi_res, auto. Default: hi_res

Available options:

fast,

hi_res,

auto,

ocr_only,

od_only,

vlm

Examples:

"auto"

"hi_res"

"vlm"

vlm_model_provider

enum<string>

The VLM Model provider to use.

Available options:

openai,

anthropic,

bedrock,

anthropic_bedrock,

vertexai,

google,

azure_openai

Examples:

"openai"

"anthropic"

"bedrock"

vlm_model

string

The VLM Model to use.

Example:

"gpt-4o"

table_ocr_agent

string | null

default:unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract

The OCR agent to use for table ocr inference.

unique_element_ids

boolean

default:false

When True, assign UUIDs to element IDs, which guarantees their uniqueness (useful when using them as primary keys in database). Otherwise a SHA-256 of element text is used. Default: False

xml_keep_tags

boolean

default:false

If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to XML documents.

chunking_strategy

string | null

Use one of the supported strategies to chunk the returned elements after partitioning. When 'chunking_strategy' is not specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: 'basic', 'by_page', 'by_similarity', or 'by_title'

Example:

"by_title"

combine_under_n_chars

integer | null

If chunking strategy is set, combine elements until a section reaches a length of n chars. Default: 500

include_orig_elements

boolean | null

When a chunking strategy is specified, each returned chunk will include the elements consolidated to form that chunk as .metadata.orig_elements. Default: true.

tracking_enabled

boolean | null

Tracking enabled

max_characters

integer | null

If chunking strategy is set, cut off new sections after reaching a length of n chars (hard max). Default: 500

multipage_sections

boolean

default:true

If chunking strategy is set, determines if sections can span multiple sections. Default: true

new_after_n_chars

integer | null

If chunking strategy is set, cut off new sections after reaching a length of n chars (soft max). Default: 1500

overlap

integer

default:0

Specifies the length of a string ('tail') to be drawn from each chunk and prefixed to the next chunk as a context-preserving mechanism. By default, this only applies to split-chunks where an oversized element is divided into multiple chunks by text-splitting. Default: 0

overlap_all

boolean

default:false

When True, apply overlap between 'normal' chunks formed from whole elements and not subject to text-splitting. Use this with caution as it entails a certain level of 'pollution' of otherwise clean semantic chunk boundaries. Default: False

similarity_threshold

number | null

A value between 0.0 and 1.0 describing the minimum similarity two elements must have to be included in the same chunk. Note that similar elements may be separated to meet chunk-size criteria; this value can only guarantees that two elements with similarity below the threshold will appear in separate chunks.

do_not_break_similarity_on_footer_header

boolean

default:false

When True, footer, header, and page number are always considered similar to the text before them for chunk by similarity method. This allows chunk by similarity to connect contents across page better.

include_slide_notes

boolean

default:true

When True, slide notes from .ppt and .pptx files will be included in the response. Default: True

pdfminer_line_overlap

number | null

If two characters have more overlap than this they are considered to be on the same line. The overlap is specified relative to the minimum height of both characters.

pdfminer_char_margin

number | null

If two characters are closer together than this margin they are considered part of the same line. The margin is specified relative to the width of the character.

pdfminer_line_margin

number | null

If two lines are close together they are considered to be part of the same paragraph. The margin is specified relative to the height of a line.

pdfminer_word_margin

number | null

default:0.185

If two characters on the same line are further apart than this margin then they are considered to be two separate words, and an intermediate space will be added for readability. The margin is specified relative to the width of the character.

Response

Successful Response

Generate a JSON schema for a file API keys and URLs

⌘I

Unstructured API

Workflow Operations

Legacy APIs

Troubleshooting

Authorizations

Headers

Body

Response