Chunking configuration
A common chunking configuration is a critical element in the data processing pipeline, particularly when creating embeddings and populating vector databases with the results. This configuration defines the parameters governing the segmentation of text into meaningful chunks, whether at the document, paragraph, or sentence level. It plays a pivotal role in determining the size and structure of these chunks, ensuring that they align with the specific requirements of downstream tasks, such as embedding generation and vector database population. By carefully configuring chunking parameters, users can optimize the granularity of data segments, ultimately contributing to more cohesive and contextually rich results. This is crucial for tasks like natural language processing and text analysis, as well as for the efficient storage and retrieval of embeddings in vector databases, enhancing the quality and relevance of the results.
Configs
-
chunk_api_key
: Ifchunk_by_api
is set toTrue
, requests that are sent to the Unstructured API will use this Unstructured API key to make authenticated calls. -
chunk_by_api
: Default:False
. If set toTrue
, uses Unstructured API services to run chunking. If set toFalse
, runs chunking locally. -
chunk_combine_text_under_n_chars
: Combine consecutive chunks when the first does not exceed lengthn
and the second will fit without exceeding the hard-maximum length. Only operative for theby_title
chunking strategy. -
chunk_elements
: Deprecated!False
(default) to not run chunking as part of the ingest process. This option is deprecated in favor of thechunking_strategy
option. Setting this toTrue
has the same effect aschunking_strategy=by_title
. -
chunk_include_orig_elements
:True
when chunking to add the original elements consolidated to form each chunk to.metadata.orig_elements
on that chunk. -
chunk_max_characters
: Default:500
. The hard-maximum chunk length. No chunk will exceed this length. An oversized element will be divided by text-splitting to fit this window. -
chunk_multipage_selections
:True
to ignore page boundaries when chunking such that elements from two different pages can appear in the same chunk. Only operative for theby_title
chunking strategy. -
chunk_new_after_n_chars
: The soft-maximum chunk length. Another element will not be added to a chunk ofn
length even when it would fit without exceeding the hard-maximum length. -
chunk_overlap
: Default:0
. Prefix each chunk’s text with the last overlap ofn
characters from the prior chunk. Only applies to oversized chunks divided by text-splitting. To apply overlap to non-oversized chunks, usechunk_overlap_all
. -
chunk_overlap_all
: Applies overlap to chunks formed from whole elements as well as those formed by text-splitting oversized elements. The overlap length is taken from thechunk_overlap
value. -
chunking_endpoint
: Ifchunk_by_api
is set toTrue
, chunking requests are sent to this Unstructured API URL. By default, this URL is the Unstructured Serverless API URL:https://api.unstructuredapp.io/general/v0/general
. -
,
chunking_strategy
: One ofbasic
orby_title
. When omitted, no chunking is performed. Thebasic
strategy maximally fills each chunk with whole elements, up the specified size limits as specified bymax_characters
andnew_after_n_chars
. A single element that exceeds this length is divided into two or more chunks using text-splitting. ATable
element is never combined with any other element and appears as a chunk of its own or as a sequence ofTableChunk
elements splitting is required. Theby_title
behaviors are the same except that section and optionally page boundaries are respected such that two consecutive elements from different sections appear in separate chunks. -
combine_text_under_n_chars
: Default:max_characters
. Combines small elements (for example a series ofTitle
elements) until a section reaches a length ofn
characters. Only operative for theby_title
chunking strategy. Defaults to max_characters which combines chunks whenever space allows. Specifying0
for this argument suppresses combining of small chunks. -
include_orig_elements
: Default:True
. Adds the document elements consolidated to form each chunk to theorig_elements
field inmetadata
. Setting this toFalse
allows for somewhat smaller payloads when you don’t need that metadata. -
max_characters
: Default:500
. Combine elements into chunks no larger thann
characters. This is a hard maximum: no chunk with text longer than this value will appear in the output stream. -
multipage_sections
: Default:True
. WhenFalse
, in addition to section boundaries, page boundaries are also respected. Only operative for theby_title
chunking strategy. -
new_after_n_chars
: Default:max_characters (off)
. Cuts off new chunks once they reach a length ofn
characters. This is a soft maximum. Defaults tomax_characters
when not specified, which effectively disables any soft window. Specifying0
for this argument causes each element to appear in a chunk by itself (although an element with text longer thanmax_characters
will be still be divided into two or more chunks using text splitting). -
overlap
: Default:0
. Prefix each chunk’s text with the last overlap ofn
characters from the prior chunk. Only applies to oversized chunks divided by text-splitting. To apply overlap to non-oversized chunks, useoverlap_all
. -
overlap_all
: Applies overlap to chunks formed from whole elements as well as those formed by text-splitting oversized elements. The overlap length is taken from theoverlap
value.
partition_by_api
to False
and chunking_strategy
to by_page
or by_similarity
. However, the rest of your data processing pipeline should be unaffected by this setting.Was this page helpful?