files (shared.Files) | files (File, Blob, shared.Files) | The file to process. |
chunking_strategy (str) | chunkingStrategy (string) | Use one of the supported strategies to chunk the returned elements after partitioning. When no chunking strategy is specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: basic , by_title , by_page , and by_similarity . Learn more. |
content_type (str) | contentType (string) | A hint to Unstructured about the content type to use (such as text/markdown ), when there are problems processing a specific file. This value is a MIME type in the format type/subtype . For available MIME types, see model.py. |
coordinates (bool) | coordinates (boolean) | True to return bounding box coordinates for each element extracted with OCR. Default: false. Learn more. |
encoding (str) | encoding (string) | The encoding method used to decode the text input. Default: utf-8 . |
extract_image_block_types (List[str]) | extractImageBlockTypes (string[]) | The types of elements to extract, for use in extracting image blocks as Base64 encoded data stored in element metadata fields, for example: ["Image","Table"] . Supported filetypes are image and PDF. Learn more. |
gz_uncompressed_content_type (str) | gzUncompressedContentType (string) | If file is gzipped, use this content type after unzipping. Example: application/pdf |
hi_res_model_name (str) | hiResModelName (string) | The name of the inference model used when strategy is hi_res . Options are layout_v1.1.0 and yolox . Default: layout_v1.1.0 . Learn more. |
include_page_breaks (bool) | includePageBreaks (boolean) | True for the output to include page breaks if the filetype supports it. Default: false. |
languages (List[str]) | languages (string[]) | The languages present in the document, for use in partitioning and OCR. View the list of available languages. Learn more. |
output_format (str) | outputFormat (string) | The format of the response. Supported formats are application/json and text/csv . Default: application/json . |
pdf_infer_table_structure (bool) | pdfInferTableStructure (boolean) | Deprecated! Use skip_infer_table_types instead. If true and strategy is hi_res , any Table elements extracted from a PDF will include an additional metadata field, text_as_html , where the value (string) is a just a transformation of the data into an HTML table. |
skip_infer_table_types (List[str]) | skipInferTableTypes (string[]) | The document types that you want to skip table extraction for. Default: [] . |
starting_page_number (int) | startingPageNumber (number) | The page number to be be assigned to the first page in the document. This information will be included in elements’ metadata and can be be especially useful when partitioning a document that is part of a larger document. |
strategy (str) | strategy (string) | The strategy to use for partitioning PDF and image files. Options are auto , vlm , hi_res , fast , and ocr_only . Default: auto . Learn more. |
unique_element_ids (bool) | uniqueElementIds (boolean) | True to assign UUIDs to element IDs, which guarantees their uniqueness (useful when using them as primary keys in database). Otherwise a SHA-256 of the element’s text is used. Default: false. |
vlm_model (str) | (Not yet available) | Applies only when strategy is vlm . The name of the vision language model (VLM) provider to use for partitioning. vlm_model_provider must also be specified. For a list of allowed values, see the end of this article. |
vlm_model_provider (str) | (Not yet available) | Applies only when strategy is vlm . The name of the vision language model (VLM) to use for partitioning. vlm_model must also be specified. For a list of allowed values, see the end of this article. |
xml_keep_tags (bool) | xmlKeepTags (boolean) | True to retain the XML tags in the output. Otherwise it will just extract the text from within the tags. Only applies to XML documents. |