chunk_api_key
: If chunk_by_api
is set to True
, requests that are sent to the Unstructured API will use this Unstructured API key to make authenticated calls.
chunk_by_api
: Default: False
. If set to True
, uses Unstructured to run chunking. If set to False
, runs chunking locally.
chunk_combine_text_under_n_chars
: Combine consecutive chunks when the first does not exceed length n
and the second will fit without exceeding the hard-maximum length. Only operative for the by_title
chunking strategy.
chunk_include_orig_elements
: True
when chunking to add the original elements consolidated to form each chunk to .metadata.orig_elements
on that chunk.
chunk_max_characters
: Default: 500
. The hard-maximum chunk length. No chunk will exceed this length. An oversized element will be divided by text-splitting to fit this window.
chunk_multipage_selections
: True
to ignore page boundaries when chunking such that elements from two different pages can appear in the same chunk. Only operative for the by_title
chunking strategy.
chunk_new_after_n_chars
: The soft-maximum chunk length. Another element will not be added to a chunk of n
length even when it would fit without exceeding the hard-maximum length.
chunk_overlap
: Default: 0
. Prefix each chunk’s text with the last overlap of n
characters from the prior chunk. Only applies to oversized chunks divided by text-splitting. To apply overlap to non-oversized chunks, use chunk_overlap_all
.
chunk_overlap_all
: Applies overlap to chunks formed from whole elements as well as those formed by text-splitting oversized elements. The overlap length is taken from the chunk_overlap
value.
chunking_endpoint
: If chunk_by_api
is set to True
, chunking requests are sent to this Unstructured API URL. By default, this URL is the Unstructured Partition Endpoint URL: https://api.unstructuredapp.io/general/v0/general
. However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at sales@unstructured.io.
chunking_strategy
: One of basic
or by_title
. When omitted, no chunking is performed. The basic
strategy maximally fills each chunk with whole elements, up the specified size limits as specified by max_characters
and new_after_n_chars
. A single element that exceeds this length is divided into two or more chunks using text-splitting. A Table
element is never combined with any other element and appears as a chunk of its own or as a sequence of TableChunk
elements splitting is required. The by_title
behaviors are the same except that section and optionally page boundaries are respected such that two consecutive elements from different sections appear in separate chunks.
partition_by_api
to False
and chunking_strategy
to by_page
or by_similarity
. However, the rest of your data processing pipeline should be unaffected by this setting.