max_characters
will be combined. After chunking, you will only have elements of the
following types:
CompositeElement
: Any text element will become a CompositeElement
after chunking. A composite element can be a
combination of two or more original text elements that together fit within the maximum chunk size. It can also be a single
element that doesn’t leave room in the chunk for any others but fits by itself. Or it can be a fragment of an original
text element that was too big to fit in one chunk and required splitting.Table
: A table element is not combined with other elements and if it fits within max_characters
it will remain as is.TableChunk
: large tables that exceed max_characters
chunk size are split into special TableChunk
elements.max_characters
(hard-max) and new_after_n_chars
(soft-max) option values.
Table
element is always isolated and never combined with another element. A Table
can be oversized, like any other text element, and in that case is divided into two or more TableChunk
elements using text-splitting.
overlap
is applied between chunks formed by splitting oversized elements and is also applied between other chunks when overlap_all
is True
.
by_title
chunking strategy preserves section boundaries and optionally page boundaries as well. “Preserving” here means that a single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one started, even if the next element would fit in the prior chunk.
In addition to the behaviors of the basic
strategy above, the by_title
strategy has the following behaviors:
Title
element is considered to start a new section. When a Title
element is encountered, the prior chunk is closed and a new chunk started, even if the Title
element would fit in the prior chunk.
multipage_sections
argument. This defaults to True
meaning that a page break does not start a new chunk. Setting this to False
will separate elements that occur on different pages into distinct chunks.
Title
element even though it does not serve as a section heading. This can produce chunks substantially smaller than desired. This behavior can be mitigated using the combine_text_under_n_chars
argument. This defaults to the same value as max_characters
such that sequential small sections are combined to maximally fill the chunking window. Setting this to 0
will disable section combining.
by_page
chunking strategy ensures the content from different pages do not end up in the same chunk.
When a new page is detected, the existing chunk is completed and a new one is started, even if the next element would fit in the
prior chunk.
by_similarity
chunking strategy employs the sentence-transformers/multi-qa-mpnet-base-dot-v1
embedding model to
identify topically similar sequential elements and combine them into chunks.
As with other strategies, chunks will never exceed the hard-maximum chunk size set by max_characters
. For this reason,
not all elements that share a topic will necessarily appear in the same chunk. However, with this strategy you can
guarantee that two elements with low similarity will not be combined in a single chunk.
You can control the level of topic similarity you require for elements to have by setting the similarity_threshold
parameter.
similarity_threshold
expects a value between 0.0 and 1.0 specifying the minimum similarity text in consecutive elements
must have to be included in the same chunk. The default is 0.5.