A standard partition configuration is a collection of parameters designed to oversee document partitioning, whether executed through API integration or by the unstructured library on a local system. These parameters serve a dual role, encompassing those passed to the partition method for the initial segmentation of documents and those responsible for coordinating data after processing, including the dynamic metadata associated with each element.

Configs for Partitioning

  •  ,

      additional_partition_args: A JSON string representation of any values to pass through to the partition function.

  •  ,

      encoding: The encoding method used to decode the text input. By default, UTF-8 will be used.

  •  ,

      ocr_languages: The languages present in the document, for use in partitioning, OCR, or both. Multiple languages indicate that the text could be in any of the specified languages.

  •   pdf_infer_table_structure:

      Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table elements will be extracted from PDF files regardless of skip_infer_table_types contents.

  •  ,

      skip_infer_table_types: List of document types that you want to skip table extraction with.

  •  ,

      strategy: Default: auto. The strategy to use for partitioning PDF and image files. Uses a layout detection model if set to hi_res. Otherwise, partitioning simply extracts the text from the document and processes it.

Configs for the Process

  •  ,

      api_key: If partition_by_api is set to True, requests that are sent to the Unstructured API will use this Unstructured API key to make authenticated calls.

  •  ,

      fields_include: Fields to include in the output JSON. By default, the following fields are included: element_id, text, type, metadata, and embeddings.

  •  ,

      flatten_metadata: Default: False. If set to True, the hierarchical metadata structure is flattened to have all values exist at the top level.

  •  ,

      hi_res_model_name: The model to use when strategy is set to hi_res. Available values are layout_v1.0.0 (the default) and yolox.

  •  ,

      metadata_exclude: Values from the metadata field to exclude from the output.

  •  ,

      metadata_include: If provided, only the specified fields are preserved in the metadata output.

  •  ,

      partition_by_api: Default: False. If set to True, uses Unstructured API services to run partitioning. If set to False, runs partitioning locally.

  •  ,

      partition_endpoint: If partition_by_api is set to True, partitioning requests are sent to this Unstructured API URL.