Skip to main content

Read Document

Controlled node

Overview

The Read Document node extracts text content from documents stored in the Intellectible library. It supports PDF, DOCX, TXT, MD, and HTML file formats. The node can optionally chunk the extracted text into smaller pieces—such as sentences, paragraphs, or words—for downstream processing in workflows.

Supported File Types

FormatDescriptionNotes
PDFPortable Document FormatExtracts raw text content from PDF documents
DOCXWord DocumentConverts document structure to markdown format
TXTPlain TextReads raw text content
MDMarkdownReads raw text content
HTMLHyperText Markup LanguageExtracts text content from HTML documents

Inputs

InputTypeDescriptionDefault
RunEventFires when the node starts running-
FileFileSourceThe document file to read from the library. Accepts single files via the file picker or file objects from other nodes.-
ChunkChunk OptionsOptional configuration to split the document into smaller pieces. Configure in the properties panel or pass as data. If not provided, returns the full text as a single string.-

Chunk Options

When chunking is enabled, the node supports the following strategies:

  • Count: Groups elements (words, sentences, or paragraphs) into chunks of a specified size
  • Divide: Divides the text into a specified number of equal parts
  • Separator: Splits text by a custom delimiter string
  • Structure: Returns the elements as an array without combining them

Outputs

OutputTypeDescription
TextText / ArrayThe extracted text content. If chunking is enabled, returns an array of text chunks; otherwise returns a single string.
SuccessBooleanIndicates whether the document was successfully parsed (true) or if an error occurred (false).
DoneEventFires when the node has finished processing the document.

Runtime Behavior and Defaults

  • File Validation: The node checks for a valid project ID and file object at runtime. If the file is missing or invalid, it returns undefined for text and false for success.
  • Extension Detection: The node attempts to determine the file type from the filename extension. If no extension is found, it falls back to the MIME type provided in the file metadata.
  • Default Chunking: If chunking is enabled but no specific strategy is configured, the node defaults to creating chunks of approximately 700 words.
  • Element Types: When chunking by sentence, paragraph, or word, the node uses natural language processing to identify boundaries accurately.
  • Error Handling: Unsupported file types will result in success: false and text: undefined.

Example Usage

Scenario: Extract text from a PDF contract and split it into paragraphs for clause-by-clause analysis.

  1. Add a Read Document node to your workflow.
  2. Connect a trigger event (like Start) to the Run input.
  3. Select a PDF file from the library using the File input in the properties panel.
  4. Enable Chunk in the properties panel and set the strategy to "Paragraph" to split the document by paragraph breaks.
  5. Connect the Text output to a For Each node to iterate through each paragraph.
  6. Connect the Done event to trigger downstream processing after all chunks are ready.

Tip: For large documents, use the "Count" strategy with a size of 1 to process the document sentence-by-sentence, or use "Divide" to split the document into a specific number of chunks for parallel processing.