The following information applies to the legacy Unstructured Partition Endpoint.Unstructured recommends that you use the
on-demand jobs functionality in the
Unstructured API instead. Unstructured’s on-demand jobs provide
many benefits over the legacy Unstructured Partition Endpoint, including support for:
- Production-level usage.
- Multiple local input files in batches.
- The latest and highest-performing models.
- Post-transform enrichments.
- All of Unstructured’s chunking strategies.
- The generation of vector embeddings.
max_characters will be combined. After chunking, you will only have elements of the
following types:
CompositeElement: Any text element will become aCompositeElementafter chunking. A composite element can be a combination of two or more original text elements that together fit within the maximum chunk size. It can also be a single element that doesn’t leave room in the chunk for any others but fits by itself. Or it can be a fragment of an original text element that was too big to fit in one chunk and required splitting.Table: A table element is not combined with other elements and if it fits withinmax_charactersit will remain as is.TableChunk: large tables that exceedmax_characterschunk size are split into specialTableChunkelements.
”basic” chunking strategy
-
The basic strategy combines sequential elements to maximally fill each chunk while respecting both the specified
max_characters(hard-max) andnew_after_n_chars(soft-max) option values. - A single element that by itself exceeds the hard-max is isolated (never combined with another element) and then divided into two or more chunks using text-splitting.
-
A
Tableelement is always isolated and never combined with another element. ATablecan be oversized, like any other text element, and in that case is divided into two or moreTableChunkelements using text-splitting. -
If specified,
overlapis applied between chunks formed by splitting oversized elements and is also applied between other chunks whenoverlap_allisTrue.
”by_title” chunking strategy
Theby_title chunking strategy preserves section boundaries and optionally page boundaries as well. “Preserving” here means that a single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one started, even if the next element would fit in the prior chunk.
In addition to the behaviors of the basic strategy above, the by_title strategy has the following behaviors:
-
Detect section headings. A
Titleelement is considered to start a new section. When aTitleelement is encountered, the prior chunk is closed and a new chunk started, even if theTitleelement would fit in the prior chunk. -
Respect page boundaries. Page boundaries can optionally also be respected using the
multipage_sectionsargument. This defaults toTruemeaning that a page break does not start a new chunk. Setting this toFalsewill separate elements that occur on different pages into distinct chunks. -
Combine small sections. In certain documents, partitioning may identify a list-item or other short paragraph as a
Titleelement even though it does not serve as a section heading. This can produce chunks substantially smaller than desired. This behavior can be mitigated using thecombine_text_under_n_charsargument. This defaults to the same value asmax_characterssuch that sequential small sections are combined to maximally fill the chunking window. Setting this to0will disable section combining.
”by_page” chunking strategy
Only available in the Unstructured UI and API. Theby_page chunking strategy ensures the content from different pages do not end up in the same chunk.
When a new page is detected, the existing chunk is completed and a new one is started, even if the next element would fit in the
prior chunk.
”by_similarity” chunking strategy
Only available in Unstructured API and Platform. Theby_similarity chunking strategy employs the sentence-transformers/multi-qa-mpnet-base-dot-v1 embedding model to
identify topically similar sequential elements and combine them into chunks.
As with other strategies, chunks will never exceed the hard-maximum chunk size set by max_characters. For this reason,
not all elements that share a topic will necessarily appear in the same chunk. However, with this strategy you can
guarantee that two elements with low similarity will not be combined in a single chunk.
You can control the level of topic similarity you require for elements to have by setting the similarity_threshold parameter.
similarity_threshold expects a value between 0.0 and 1.0 specifying the minimum similarity text in consecutive elements
must have to be included in the same chunk. The default is 0.5.

