Skip to main content

Crawl Website

Controlled node

Overview

The Crawl Website node recursively crawls a website starting from a given URL, extracting text content from multiple pages. It supports depth-limited crawling, page count limits, and path depth restrictions to control the scope of the crawl. The node caches results daily to improve performance during development.

By default, the node returns a concatenated string of text from all crawled pages. When List Result is enabled, it returns an array of objects containing the URL and text for each page. When URL Only is enabled, it returns only the URLs without the page content.

Maximum Page Limit

The node enforces a hard limit of 100 pages per crawl to prevent excessive resource usage. If your crawl parameters would exceed this limit, the node will automatically cap the crawl at 100 pages.

Crawl Modes

ModeDescription
StandardFull browser-based crawling with JavaScript execution (slower, more accurate)
Quick ModeFast HTTP-only crawling using Axios (faster, less accurate for JS-heavy sites)

Inputs

InputTypeDescriptionDefault
RunEventTriggers the crawl operation-
URLTextThe starting URL to crawl from-
Max PagesNumberMaximum number of pages to crawl (hard limit: 100)10
Max DepthNumberMaximum link depth to follow from the starting URL1
Max Path DepthNumberMaximum path depth (number of / segments) to follow2

Configuration Options

These options appear in the properties panel but not on the node itself:

OptionTypeDescriptionDefault
List ResultBooleanWhen enabled, outputs an array of objects with url and text properties instead of concatenated textfalse
Quick ModeBooleanWhen enabled, uses fast HTTP crawling instead of full browser renderingfalse
URL OnlyBooleanWhen enabled, returns only URLs without page text contentfalse

Outputs

OutputTypeDescription
DoneEventFires when the crawl completes successfully
OutputDataEither a concatenated string of page text (default) or an array of page objects (if List Result is enabled)

Runtime Behavior and Defaults

Caching

The node implements daily caching based on input parameters. If the same URL and parameters are used within the same day, the cached result is returned immediately without re-crawling. The cache key includes:

  • The target URL
  • All crawl parameters (maxPages, maxDepth, maxPathDepth, quickMode, listResult, urlOnly)
  • The current date (timestamp rounded to the day)

Crawl Constraints

  • Maximum Pages: Hard capped at 100 pages per crawl
  • Within Host: The crawler only follows links within the same host as the starting URL
  • Depth Calculation:
    • maxDepth controls how many links deep to follow from the start page
    • maxPathDepth controls how many path segments (e.g., /path/to/page) to traverse

Output Formats

Default (Concatenated Text):

Page https://example.com/page1

[Extracted text from page 1]

Page https://example.com/page2

[Extracted text from page 2]

List Result Mode:

[
{
"url": "https://example.com/page1",
"text": "[Extracted text from page 1]"
},
{
"url": "https://example.com/page2",
"text": "[Extracted text from page 2]"
}
]

URL Only Mode:

Page https://example.com/page1

Page https://example.com/page2

Or in list result mode with URL only:

[
{"url": "https://example.com/page1"},
{"url": "https://example.com/page2"}
]

Example Usage

Basic Website Crawling

Connect a Text node containing a URL to the URL input, then trigger the Run event:

  1. Set Max Pages to 5 to limit the crawl
  2. Set Max Depth to 2 to follow links two levels deep
  3. Trigger the Run event
  4. The Output will contain text from up to 5 pages within 2 link depths

Structured Data Extraction

Enable List Result to process each page individually:

  1. Enable List Result in the properties panel
  2. Connect the Output to a For Each node
  3. Inside the loop, access element.url and element.text to process each page separately

Quick Crawling for Static Sites

For static HTML sites without heavy JavaScript:

  1. Enable Quick Mode for faster crawling
  2. The node will use HTTP requests instead of browser rendering
  3. Note: Quick mode may not capture content loaded dynamically by JavaScript

URL Discovery

To build a sitemap without extracting text:

  1. Enable URL Only mode
  2. Set Max Pages to the desired limit (up to 100)
  3. The output will contain only the discovered URLs, suitable for feeding into other nodes like Read Webpage for individual processing