Skip to main content

Chunk Document

Controlled node

Overview

The Chunk Document node splits a large text document into smaller, manageable pieces (chunks) based on configurable strategies. This is essential for processing long documents that exceed AI model context windows or for creating semantic chunks for vector databases.

The node supports two primary chunking strategies:

  • Count: Divides text into chunks of a specified size (e.g., 500 words per chunk)
  • Divide: Splits text into a specified number of equal parts (e.g., divide into 10 chunks)

You can chunk by various units including words, sentences, paragraphs, pages, or custom separators. The node also supports overlap between chunks to maintain context continuity.

Chunk Options

Configure chunking behavior through the properties panel:

StrategyDescription
CountCreates chunks containing a specific number of units (words, sentences, etc.)
DivideDivides the entire text into a specified number of equal parts
UnitDescription
WordChunks by word count
SentenceChunks by sentence boundaries
ParagraphChunks by paragraph breaks (\n\n)
PageChunks by page breaks
CustomChunks using a custom separator string
Overlap

Use the Overlap setting to include units from the previous chunk at the start of the next chunk. This helps maintain context between chunks, especially useful when processing chunks independently through AI models.

Inputs

InputTypeDescriptionDefault
RunEventTriggers the chunking operation-
TextTextThe document content to be chunked-
OptionsDataChunking configuration object (set via properties panel)See defaults below

Outputs

OutputTypeDescription
DoneEventFires when chunking is complete
ChunksDataArray of chunk metadata objects containing start, end, and index positions
TextsDataArray of the actual text strings for each chunk

Runtime Behavior and Defaults

When the Run event fires, the node processes the input text according to the configured options:

  • If no text is provided, the node outputs empty arrays for both chunks and texts
  • The node extracts substring ranges from the original text based on the chunk boundaries calculated by the chunking algorithm
  • Both chunks (metadata) and texts (content) are output simultaneously when processing completes

Default Options

{
"strategy": "count",
"unit": "word",
"size": 700,
"parts": 3,
"separator": "\n\n",
"overlap": 0
}
  • Strategy: count (creates chunks of specified size)
  • Unit: word (chunks by word count)
  • Size: 700 (units per chunk when using count strategy)
  • Parts: 3 (number of chunks when using divide strategy)
  • Separator: \n\n (used when unit is set to custom)
  • Overlap: 0 (no overlap between chunks)

Example

Basic Document Chunking

Scenario: Split a long article into 500-word chunks for processing by an AI model with a limited context window.

  1. Connect a Text node or document reader node to the Text input
  2. Set the properties panel options:
    • Strategy: count
    • Unit: word
    • Size: 500
    • Overlap: 50 (to maintain context between chunks)
  3. Connect the Done event to an AI Write node
  4. Use the Texts output to feed chunks sequentially into the AI model

Semantic Chunking by Paragraph

Scenario: Split a document by paragraphs to preserve semantic boundaries.

  1. Set Strategy to count
  2. Set Unit to paragraph
  3. Set Size to 1 (one paragraph per chunk)
  4. The Texts output will contain each paragraph as a separate array element

Fixed Number of Chunks

Scenario: Divide a document into exactly 5 equal parts for parallel processing.

  1. Set Strategy to divide
  2. Set Unit to word (or sentence for better semantic splits)
  3. Set Parts to 5
  4. The node will output exactly 5 chunks in the Texts array

Custom Separator Chunking

Scenario: Split a markdown document by headers.

  1. Set Unit to custom
  2. Set Separator to ## (or whatever delimiter marks your sections)
  3. The node will split the text at each occurrence of the separator
Chunk Metadata

The Chunks output provides metadata including start and end character indices, allowing you to map processed results back to the original document positions.