> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unstructured.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Chunking

<iframe width="560" height="315" src="https://www.youtube.com/embed/ZWpm_TzwN-w" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

After partitioning, *chunking* rearranges the resulting document elements into manageable "chunks" to stay within
the limits of an embedding model and to improve retrieval precision. The goal is to retrieve only parts of documents
that contain only the information that is relevant to a user's query. You can specify if and how Unstructured chunks
those elements, based on your intended end use.

During chunking, Unstructured uses a [basic](#basic-chunking-strategy) chunking strategy that attempts to combine two or more consecutive text elements
into each chunk that fits together within the [max characters](#max-characters-setting) setting. To determine the best max characters setting, see the documentation
for the embedding model that you want to use.

You can further control this behavior with [by title](#chunk-by-title-strategy), [by page](#chunk-by-page-strategy), and [by similarity](#chunk-by-similarity-strategy) chunking strategies.
In all cases, Unstructured will only split individual elements if they exceed the specified max characters length.
After chunking, you will have document elements of only the following types:

* `CompositeElement`: Any text element will become a `CompositeElement` after chunking. A composite element can be a
  combination of two or more original text elements that together fit within the max characters setting. It can also be a single
  element that doesn't leave room in the chunk for any others but fits by itself. Or it can be a fragment of an original
  text element that was too big to fit in one chunk and required splitting.
* `Table`: A table element is not combined with other elements, and if it fits within the max characters setting it will remain as is.
* `TableChunk`: Large tables that exceed the max characters setting are split into special `TableChunk` elements.

Here are a few examples:

```json  theme={null}
{
    "type": "CompositeElement",
    "element_id": "3e891e70121efac5f8374e4f1d41f25c",
    "text": "PORTABLE DOCUMENT FORMAT (PDF) SPECIFICATIONS...<full-content-redacted-for-brevity>...",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "orig_elements": "eJy...<full-content-redacted-for-brevity>...tY0=",
      "filename": "Portable-Document-Format-Specifications.pdf"
    }
}
```

```json  theme={null}
{
    "type": "Table",
    "element_id": "2abca97d9f7dde6557cda4731f5e1d01",
    "text": "Date 2005-04-08 2008-06-04 2011-12-20 2012-1-13 2014-9-26 2016-9-14...<full-content-redacted-for-brevity>...",
    "metadata": {
      "text_as_html": "<table><thead><tr><th>Date</th><th>Version</th><th>Summary of Revisions</th></tr></thead><tbody>...<full-content-redacted-for-brevity>...",
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 3,
      "orig_elements": "eJz...<full-content-redacted-for-brevity>...r8=",
      "filename": "Portable-Document-Format-Specifications.pdf"
    }
}
```

```json  theme={null}
{
    "type": "TableChunk",
    "element_id": "57602d200f123a317637eb57596ab488",
    "text": "Term Definition Team A team is defined as a social",
    "metadata": {
        "filetype": "application/pdf",
        "languages": [
            "eng"
        ],
        "orig_elements": "eJy...<full-content-redacted-for-brevity>...Ys=",
        "page_number": 5
    },
    

},
{
    "type": "TableChunk",
    "element_id": "4fe0a26a9f38c16b1ac2eedb4460db61",
    "text": "unit of more than two people with specified roles",
    "metadata": {
        "filetype": "application/pdf",
        "is_continuation": true,
        "languages": [
            "eng"
        ],
        "orig_elements": "eJy...<full-content-redacted-for-brevity>...w==",
        "page_number": 5
    }
}
```

If the option to include original elements is specified, during chunking the `orig_elements` field is added to the `metadata` field of each chunked element.
The `orig_elements` field is a list of the original elements that were used to create the current chunked element. This list is output in
gzip compressed, Base64-encoded format. To get back to the original content for this list, Base64-decode the list's bytes, and then gzip decompress them as UTF-8.
[Learn how](/api-reference/legacy-api/partition/get-chunked-elements).

After chunking, `Image` elements are not preserved in the output. However,
if High Res partitioning is used and the option to include original elements is also specified, the `orig_elements` field of each chunked element will contain
an `image_base64` field for each detected image and table associated with the original elements listed within `orig_elements`. To get back to the
original content for an `image_base64` field, Base64-decode the field's bytes.
[Learn how](/api-reference/legacy-api/partition/extract-image-block-types).

The following sections provide information about the available chunking strategies and their settings.

<Note>You can change a workflow's preconfigured strategy only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.</Note>

## Basic chunking strategy

The basic chunking strategy uses only the [max characters](#max-characters-setting) setting (an absolute or "hard" limit) and
[new after n characters](#new-after-n-characters-setting) setting (an approximate or "soft" limit) to combine sequential elements to maximally
fill each chunk.

This strategy adds elements to a chunk until the new after n characters limit is reached. A new chunk is then started.
No chunk will exceed the max characters limit. For elements larger than the "max characters" limit, the text is split into
multiple chunks at spaces or new lines to avoid cutting words.

Table elements are always treated as standalone chunks. If a table is too large, the table is chunked by rows.

This strategy does not use section boundaries, page boundaries, or content similarities to determine
the chunks' contents.

The following diagram illustrates conceptually how a candidate element is chunked to fit within the max characters (hard) and
new after n characters (soft) limits.

* In scenario 1, the candidate element exceeds the hard limit, and so the candidate element will become the first element in the next chunk.
* In scenario 2, the first candidate element exceeds the soft limit but remains within the hard limit. Because the second candidate element begins
  after the soft limit has been reached, the second candidate element will become the first element in the next chunk.
* In scenario 3, the first two candidate elements exceed the soft limit but remain within the hard limit. Even though the third candidate element
  remains within the hard limit, because it begins after the soft limit has been reached, the third candidate element will become the first element in the next chunk.

<img src="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Soft_Hard_Limits.png?fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=6bf5aaac710854a0c4443983fb02c13f" alt="Chunking with hard and soft limits" data-og-width="1015" width="1015" data-og-height="400" height="400" data-path="img/chunking/Chunking_Soft_Hard_Limits.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Soft_Hard_Limits.png?w=280&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=5f521b2a4c51061851affd3695c66824 280w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Soft_Hard_Limits.png?w=560&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=f7e9506b2a96715ae3e7877190f11e01 560w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Soft_Hard_Limits.png?w=840&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=3cf6212447072e46d4f6fb7b16d20da9 840w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Soft_Hard_Limits.png?w=1100&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=95668a7cf9eef6cfdd4bbd6f86dd68d1 1100w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Soft_Hard_Limits.png?w=1650&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=40cbcddaf9194d8c30739732c482f131 1650w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Soft_Hard_Limits.png?w=2500&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=e5b7e8a9a9f4d802d88a0e4cec9c4fbf 2500w" />

The following two conceptual diagrams show how a basic chunking strategy with a max characters setting of 200 would chunk the following text and table elements.

In this first conceptual diagram, each chunk of text gets as close as possible to the 200-character hard limit without going over, and lexical constructs such as sentence endings are not recognized:

<img src="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Paragraph.png?fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=c86e20de3b8c124bffb6888f159b9307" alt="Basic chunking of text with a 200-character hard limit" data-og-width="2976" width="2976" data-og-height="1516" height="1516" data-path="img/chunking/Chunk-By-Character-200-Paragraph.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Paragraph.png?w=280&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=8a1f905f15c60a11e4ee11c4e6d0749d 280w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Paragraph.png?w=560&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=99d9bfcd036b20c8b3c9fb3c8ec84227 560w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Paragraph.png?w=840&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=2742a6f9027e9f34f10a13198562daf7 840w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Paragraph.png?w=1100&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=1167b416d8c6b066fd798110888e1318 1100w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Paragraph.png?w=1650&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=f276e777e5c0e20501b1747bdafe36e9 1650w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Paragraph.png?w=2500&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=13afa66b908768d8d3f7d3e34314dea9 2500w" />

In this second conceptual diagram, each chunk for the table also gets as close as possible to the 200-character hard limit without going over. However, for tables,
row endings are also considered in determining chunk boundaries. For this table, the first chunk is close to the 200-character hard limit and also a row ending.
The second chunk is well short of the 200-character hard limit because of a row (and, in this case, also the table) ending:

<img src="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Table.png?fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=de46e1814fa36f293908f66ba9c78236" alt="Basic chunking of a table with a 200-character hard limit" data-og-width="2978" width="2978" data-og-height="1290" height="1290" data-path="img/chunking/Chunk-By-Character-200-Table.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Table.png?w=280&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=f36ffb1282d42d8ae2c0669e464945f3 280w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Table.png?w=560&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=636499b2f178b5755bb63e7bad1c67a2 560w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Table.png?w=840&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=103556fc89229edfd6f0eb969e1ea70a 840w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Table.png?w=1100&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=309fff72d26123566d6cefd8dec091ad 1100w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Table.png?w=1650&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=a6ad9bc57bfc825e65a6e3f6509fd934 1650w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Table.png?w=2500&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=6929a480471f32be89e7aa86965f14d7 2500w" />

Context between chunks can be maintained by using the [overlap](#overlap-setting) and [overlap all](#overlap-all-setting) settings.
The overlap setting repeats the specified number of characters from the end of the previous chunk at the beginning of the next chunk.
By default, overlap all is applied only to relatively large elements. If overlap all is set to true, the overlap is applied to all chunks, regardless.

The overlap setting is based on the number of characters, so words might be split.
The overlap setting's character count is included in the chunk size; nonetheless, the chunk's total size must not exceed the max characters setting.

The following conceptual diagram illustrates conceptually how chunks are calculated by setting overlap all to true or false. In this diagram, setting overlap all to true results in a portion at the end of each chunk always being copied over to the beginning of the next chunk.

By default (or by setting overalp all to false), only a portion at the end of Element 6 Part 1 in Chunk 2 being is copied over
to the beginning of Element 6 Part 2 in Chunk 3, because Element 6 is larger than the max characters setting.

<img src="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Overlap_All.png?fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=ad0960179cfd3503198ae1079ab0c29f" alt="Chunking with overall all set to true or false" data-og-width="941" width="941" data-og-height="997" height="997" data-path="img/chunking/Chunking_Overlap_All.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Overlap_All.png?w=280&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=d17629db6e329fded5282b20a2657dc0 280w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Overlap_All.png?w=560&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=3c589d2bdf6cd5e838f4ebaad8476ca9 560w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Overlap_All.png?w=840&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=59351d062a9515f2f8bc8aa8479bb807 840w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Overlap_All.png?w=1100&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=f6b868fd789b55dcb6daa5e4f9b71013 1100w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Overlap_All.png?w=1650&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=160acd2fa84d7994b9a5ea5c295d9dd6 1650w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Overlap_All.png?w=2500&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=3e775286f751ce927922cf006d8a4e33 2500w" />

The following diagram shows how a basic chunking strategy with a max characters setting of 200, an overlap of 25 characters, and
overlap all set to true would chunk the following text. Note that some of the text is split in the middle of a word:

<img src="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Overlap-25.png?fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=def0492ac6d03d3cf0de2ddec2024b9b" alt="Basic chunking of text with a 200-character hard limit, an overlap of 25 characters, and overlap all set to true" data-og-width="3030" width="3030" data-og-height="1048" height="1048" data-path="img/chunking/Chunk-By-Character-200-Overlap-25.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Overlap-25.png?w=280&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=d3e8ed6485f67f9722432006ffa11abf 280w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Overlap-25.png?w=560&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=9301999cbee5e6e57317e82248f01d49 560w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Overlap-25.png?w=840&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=b19abdffc8c722c8205a5c5e1dce474a 840w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Overlap-25.png?w=1100&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=63713d29fa30666d187c0de56fe0916a 1100w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Overlap-25.png?w=1650&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=4442207add34950e96a7cd6e2b7de883 1650w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Character-200-Overlap-25.png?w=2500&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=8c57587072f083e9f9a5430004252adb 2500w" />

To use this chunking strategy, choose **Chunk by character** in the **Chunkers** section of a **Chunker** node in a workflow.

## Chunk by title strategy

The by-title chunking strategy attempts to preserve section boundaries when determining the chunks' contents, primarily when
a **Title** element is encountered. The title is used as the section header for the chunk. The max characters and new after n
characters settings are still respected.

The following conceptual diagram illustrates conceptually how elements are chunked when **Title** elements are encountered (see
Chunks 1, 4, and 6), while still respecting the max characters and new after n characters settings (see Chunks 2 and 3):

<img src="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_By_Title.png?fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=4053e90b610fa532f6c44230dc88567c" alt="Chunking by title" data-og-width="1662" width="1662" data-og-height="1108" height="1108" data-path="img/chunking/Chunking_By_Title.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_By_Title.png?w=280&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=2c71c0d174e18549d21de850a412694e 280w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_By_Title.png?w=560&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=d601d071b76f872ace8d426985416894 560w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_By_Title.png?w=840&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=cede068fd52be1334cbaa2edd8ad31b4 840w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_By_Title.png?w=1100&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=2a462fc3063fc054f64408fdfd2f1a83 1100w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_By_Title.png?w=1650&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=c7eb1875b46a7e543bbc99d19fad274c 1650w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_By_Title.png?w=2500&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=7147652e9fffbbbc6151a5fc9a81d582 2500w" />

A single chunk should not contain text that occurred in two different sections. When a new section starts, the existing
chunk is closed and a new one is started, even if the next element would fit in the prior chunk.

The following conceptual diagram illustrates this point, in that many **Title** elements can produce many relatively small chunks.

<img src="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_By_Title_Segmentation.png?fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=42ef590b0d153e130473a3fb70e80c72" alt="Many titles can lead to many chunks by title" data-og-width="1606" width="1606" data-og-height="700" height="700" data-path="img/chunking/Chunking_By_Title_Segmentation.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_By_Title_Segmentation.png?w=280&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=4ca814f0a263d4270cc7ee83215d894c 280w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_By_Title_Segmentation.png?w=560&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=f65fcaa1b7547a11901de45324844e50 560w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_By_Title_Segmentation.png?w=840&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=188c3f6d46c3a9c2d23ad4f37e4d4b39 840w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_By_Title_Segmentation.png?w=1100&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=bb9a4cabbcd102b5e45918424f38904e 1100w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_By_Title_Segmentation.png?w=1650&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=cf24466a34b5f301266c06f4d8d04ab7 1650w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_By_Title_Segmentation.png?w=2500&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=f958bdb51ce57992d08c4eae30662ad1 2500w" />

To reduce the number of chunks, you can use the [combine text under n characters](#combine-text-under-n-characters-setting) setting. This
setting attempts to combine elements into a single chunk until the combine text under n characters limit is reached, as shown in the
following conceptual diagram. In this case, multiple **Title** elements are combined into a single chunk. However, when the
combine text under n characters limit is reached, the chunk is closed and a new one is started. In any case, the new chunk must start with a **Title** element.
For instance, if Element 3 exceeded the combine text under n characters limit, the chunk would be closed and a new one would be started, beginning
with Title 2, followed by Element 3.

<img src="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Combine_Text.png?fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=eae3ad5ba0817e4a9158e60cafa5a140" alt="Chunking with combine text under n characters" data-og-width="1620" width="1620" data-og-height="420" height="420" data-path="img/chunking/Chunking_Combine_Text.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Combine_Text.png?w=280&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=1ebbfe90d346e1af15610adbe20a7b0c 280w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Combine_Text.png?w=560&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=a76938334e6563ea20154e8b71cd7bcb 560w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Combine_Text.png?w=840&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=163433b97f0118fcfac4c5ebea3ea516 840w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Combine_Text.png?w=1100&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=49336d2e2487d8125e630237c0722fd7 1100w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Combine_Text.png?w=1650&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=abe2b5ae12df41e210b05a60fd8c29b4 1650w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunking_Combine_Text.png?w=2500&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=a37c0ddb92caeb30ad73a00eded3034e 2500w" />

The following diagram shows how a chunk by title strategy with a max characters setting of 200 would chunk the following text.
Although the first chunk is close to the 200-character hard limit, the second chunk is well short of this limit due to encountering the
title immediately after it, which starts a new chunk:

<img src="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Title-200-Paragraph.png?fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=170b9603ab2893253a5450bfe8e64339" alt="Chunking by title with a 200-character hard limit" data-og-width="3006" width="3006" data-og-height="616" height="616" data-path="img/chunking/Chunk-By-Title-200-Paragraph.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Title-200-Paragraph.png?w=280&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=7cb23fa3dd587461d96be5dd054db178 280w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Title-200-Paragraph.png?w=560&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=163d2f84973b3f8399f5b9e390b556f6 560w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Title-200-Paragraph.png?w=840&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=75fda1f367d51adaf1add59e920e150e 840w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Title-200-Paragraph.png?w=1100&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=4f106bcc95032ebe303c5e2e4aad9ab5 1100w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Title-200-Paragraph.png?w=1650&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=7c53659835dd757185bb56907f18f10b 1650w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Title-200-Paragraph.png?w=2500&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=d19b0920814861f3587b0042de77bac0 2500w" />

To use this chunking strategy, choose **Chunk by title** in the **Chunkers** section of a **Chunker** node in a workflow.

## Chunk by page strategy

The by-page chunking strategy attempts to preserve page boundaries when determining the chunks' contents.
A single chunk should not contain text that occurred in two different page. When a new page starts, the existing
chunk is closed and a new one is started, even if the next element would fit in the prior chunk.

The following diagram shows how a chunk by page strategy with a max characters setting of 200 would chunk the following text.
Notice that due to the page break, the second chunk is very small, as it could not fit into the first chunk's hard character limit.\
Nonetheless, the second chunk is still part of same page as the first chunk:

<img src="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Page-200.png?fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=36e3dd6db65759a5f076f5cf1faa947b" alt="Chunking by page with a 200-character hard limit" data-og-width="2952" width="2952" data-og-height="1190" height="1190" data-path="img/chunking/Chunk-By-Page-200.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Page-200.png?w=280&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=381c79adc37452123342755e0b0082e1 280w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Page-200.png?w=560&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=fdf1bc04317da12ce7ed9a33f808c7a2 560w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Page-200.png?w=840&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=3103919613ce7c1d51694588f89745b4 840w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Page-200.png?w=1100&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=ea887e67d07cb6cb8ef2e6f558db19a7 1100w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Page-200.png?w=1650&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=f3a3cdaeda235748f9175a55c9ca8ca4 1650w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Page-200.png?w=2500&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=17e27a00de2fd0ab1aa49966c7a7881d 2500w" />

To use this chunking strategy, choose **Chunk by page** in the **Chunkers** section of a **Chunker** node in a workflow.

## Chunk by similarity strategy

The by-similarity chunking strategy uses the
[sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) embedding model
to identify topically similar sequential elements and combines them into chunks.

As with the other chunking strategies, chunks will never exceed the absolute maximum chunk size set by **Max characters**. For this reason,
not all elements that share a topic will necessarily appear in the same chunk. However, with this strategy you can
guarantee that two elements with low similarity will not be combined in a single chunk.

To use this chunking strategy, choose **Chunk by similarity** in the **Chunkers** section of a **Chunker** node in a workflow.

You can control the level of topic similarity you require for elements to have by setting [Similarity threshold](#similarity-threshold).

The following diagram shows how a chunk by similarity strategy with a max characters setting of 1000 and similarity threshold of 0.5 would chunk the following text.
Notice that the two chunks are well short of the 1000-character hard limit, as the paragraph break introduces a convenient lexical construct for
helping determinine the similarities of sentences to each other:

<img src="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Similarity-1000-50.png?fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=e1d572d59d31f719f57e22b9875658f4" alt="Chunking by similarity with a 1000-character hard limit and 0.5 similarity threshold" data-og-width="3014" width="3014" data-og-height="1162" height="1162" data-path="img/chunking/Chunk-By-Similarity-1000-50.png" data-optimize="true" data-opv="3" srcset="https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Similarity-1000-50.png?w=280&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=f03c94c2680b22f101d30865ac0850c2 280w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Similarity-1000-50.png?w=560&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=d7a89b0448e9e4a59738d0f2e91fc91d 560w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Similarity-1000-50.png?w=840&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=fe9902efdf29e091157ad315f4d5aaf0 840w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Similarity-1000-50.png?w=1100&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=4ac929789c02ad21023de4fa41cc471b 1100w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Similarity-1000-50.png?w=1650&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=7d94d0ae14e6b0fc95cb79637a2c4ad1 1650w, https://mintcdn.com/unstructured-53/vKFDfUfAWhz_siB3/img/chunking/Chunk-By-Similarity-1000-50.png?w=2500&fit=max&auto=format&n=vKFDfUfAWhz_siB3&q=85&s=ece67a75ce7db2cf26f8f572e4fbea02 2500w" />

## Max characters setting

Specifies the absolute maximum number of characters in a chunk.

To specify this setting, enter a number into the **Max characters** field.

This setting applies to all of the chunking strategies.

## Combine text under n characters setting

Combines elements from a section into a chunk until a section reaches a length of this many characters.

To specify this setting, enter a number into the **Combine text under n chars** field.

This setting applies only to the chunking strategy **Chunk by title**.

## Include original elements setting

If the **Include original elements** box is checked, the elements that were used to form a chunk appear in the `metadata` field's `orig_elements` field for that chunk.

This setting applies to all of the chunking strategies.

## Multipage sections setting

If the **Multipage sections** box is checked, this allows sections to span multiple pages.

This setting applies only to the chunking strategy **Chunk by title**.

## New after n characters setting

Closes new sections after reaching a length of this many characters. This is an approximate limit.

To specify this setting, enter a number into the **New after n characters** field.

This setting applies only to the chunking strategies **Chunk by character**, **Chunk by title**, and **Chunk by page**.

## Overlap setting

Applies a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting.

To specify this setting, enter a number into the **Overlap** field.

This setting applies only to the chunking strategies **Chunk by character**, **Chunk by title**, and **Chunk by page**.

## Overlap all setting

If the **Overlap all** box is checked, applies overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units.

This setting applies only to the chunking strategies **Chunk by character**, **Chunk by title**, and **Chunk by page**.

## Similarity threshold setting

Specifies the minimum similarity that text in consecutive elements must have to be included in the same chunk.
This must be a value between `0.0` and `1.0`, exclusive (`0.01` to `0.99`). The default is `0.5` if not otherwise specified.

To specify this setting, enter a number into the **Similarity threshold** field.

This setting applies only to the chunking strategy **Chunk by similarity**.

## Contextual chunking

A technique known as *contextual chunking* prepends chunk-specific explanatory context to each chunk.
Contextual chunking has been shown to enhance traditional RAG solutions by yielding
significant improvements in retrieval accuracy, which directly translates to better performance in downstream tasks.
[Learn more](https://www.anthropic.com/news/contextual-retrieval).

To apply contextual chunking, switch on the **Contextual chunking** toggle in the settings for any chunking strategy.

This chunk-specific explanatory context information is typically a couple of sentences in length.
Contextual chunking happens before any embeddings are generated.

When contextual chunking is applied, the contextual information in each chunk begins with `Prefix:` and ends with a semicolon (`;`).
The chunk's original content begins with `Original:`.

For example, without contextual chunking applied, elements would for instance be generated similar to the following.
Line breaks have been inserted here for readability. The output will not contain these line breaks:

```json  theme={null}
{
    "type": "CompositeElement",
    "element_id": "aa482034de5ade41b7223bb3beeb6a22",
    "text": "THE\n\nCONSTITUTION of the United States\n\nG\n\nNATIONAL 
      CONSTITUTION CENTER\n\nWe the People of the United States, in 
      Order to form a more perfect Union, establish Justice, insure 
      ...<full-content-redacted-for-brevity>...",
    "metadata": {
        "filename": "constitution.pdf",
        "filetype": "application/pdf",
        "languages": [
            "eng"
        ],
        "page_number": 1
    }
},
{
    "type": "CompositeElement",
    "element_id": "59fbfcfb51e52c426df4c48a620c6031",
    "text": "SECTION. 2\n\nThe House of Representatives shall be 
        composed of Mem- bers chosen every second Year by the People 
        of the several States, and the Electors in each State shall 
        ...<full-content-redacted-for-brevity>...",
    "metadata": {
        "filename": "constitution.pdf",
        "filetype": "application/pdf",
        "languages": [
            "eng"
        ],
        "page_number": 2
    }
},
```

Applying contextual chunking to those same elements would result in the following output.
Line breaks and blank lines have been inserted here for readability. The output will not contain these line breaks and blank lines:

```json  theme={null}
{
    "type": "CompositeElement",
    "element_id": "063ed41d2a989191f2281b2d35c4b4ae",
    "text": "Prefix: This is the opening preamble and first section of 
        Article I of the U.S. Constitution, establishing the fundamental 
        purpose of the document and the basic structure of legislative 
        power in Congress. It appears at the very beginning of the main 
        constitutional text, before all other articles and amendments.;
        
        Original: THE\n\nCONSTITUTION of the United States\n\nG\n\nNATIONAL 
        CONSTITUTION CENTER\n\nWe the People of the United States, in 
        Order to form a more perfect Union, establish Justice, insure 
        ...<full-content-redacted-for-brevity>...",
    "metadata": {
        "filename": "constitution.pdf",
        "filetype": "application/pdf",
        "languages": [
            "eng"
        ],
        "page_number": 1
    }
},
{
    "type": "CompositeElement",
    "element_id": "2270f6b8c8b4afc668f6277789370ffd",
    "text": "Prefix: This chunk appears in Article I, Section 2 of the 
        U.S. Constitution, which establishes the structure, composition, 
        and powers of the House of Representatives as one of the two 
        chambers of Congress. It follows Section 1's establishment of 
        Congress and precedes Section 3's establishment of the Senate.;
        
        Original: SECTION. 2\n\nThe House of Representatives shall be 
        composed of Mem- bers chosen every second Year by the People 
        of the several States, and the Electors in each State shall
        ...<full-content-redacted-for-brevity>...",
    "metadata": {
        "filename": "constitution.pdf",
        "filetype": "application/pdf",
        "languages": [
            "eng"
        ],
        "page_number": 2
    }
}

```

## Learn more

* <Icon icon="blog" />  "Breaking It Down: Chunking Strategies" in [Level Up Your GenAI Apps: Essential Data Preprocessing for Any RAG System](https://unstructured.io/blog/level-up-your-genai-apps-essential-data-preprocessing-for-any-rag-system)
* <Icon icon="blog" />  [Contextual Chunking in Unstructured Platform: Boost Your RAG Retrieval Accuracy](https://unstructured.io/blog/contextual-chunking-in-unstructured-platform-boost-your-rag-retrieval-accuracy)
* <Icon icon="blog" />  [Chunking for RAG: best practices](https://unstructured.io/blog/chunking-for-rag-best-practices).
