Crawl Website

Controlled node

Overview

The Crawl Website node recursively crawls a website starting from a given URL, extracting text content from multiple pages. It supports depth-limited crawling, page count limits, and path depth restrictions to control the scope of the crawl. The node caches results daily to improve performance during development.

By default, the node returns a concatenated string of text from all crawled pages. When List Result is enabled, it returns an array of objects containing the URL and text for each page. When URL Only is enabled, it returns only the URLs without the page content.

Maximum Page Limit

The node enforces a hard limit of 100 pages per crawl to prevent excessive resource usage. If your crawl parameters would exceed this limit, the node will automatically cap the crawl at 100 pages.

Crawl Modes

Mode	Description
Standard	Full browser-based crawling with JavaScript execution (slower, more accurate)
Quick Mode	Fast HTTP-only crawling using Axios (faster, less accurate for JS-heavy sites)

Inputs

Input	Type	Description	Default
Run	Event	Triggers the crawl operation	-
URL	Text	The starting URL to crawl from	-
Max Pages	Number	Maximum number of pages to crawl (hard limit: 100)	10
Max Depth	Number	Maximum link depth to follow from the starting URL	1
Max Path Depth	Number	Maximum path depth (number of `/` segments) to follow	2

Configuration Options

These options appear in the properties panel but not on the node itself:

Option	Type	Description	Default
List Result	Boolean	When enabled, outputs an array of objects with `url` and `text` properties instead of concatenated text	false
Quick Mode	Boolean	When enabled, uses fast HTTP crawling instead of full browser rendering	false
URL Only	Boolean	When enabled, returns only URLs without page text content	false

Outputs

Output	Type	Description
Done	Event	Fires when the crawl completes successfully
Output	Data	Either a concatenated string of page text (default) or an array of page objects (if List Result is enabled)

Runtime Behavior and Defaults

Caching

The node implements daily caching based on input parameters. If the same URL and parameters are used within the same day, the cached result is returned immediately without re-crawling. The cache key includes:

The target URL
All crawl parameters (maxPages, maxDepth, maxPathDepth, quickMode, listResult, urlOnly)
The current date (timestamp rounded to the day)

Crawl Constraints

Maximum Pages: Hard capped at 100 pages per crawl
Within Host: The crawler only follows links within the same host as the starting URL
Depth Calculation:
- maxDepth controls how many links deep to follow from the start page
- maxPathDepth controls how many path segments (e.g., /path/to/page) to traverse

Output Formats

Default (Concatenated Text):

Page https://example.com/page1

[Extracted text from page 1]

Page https://example.com/page2

[Extracted text from page 2]

List Result Mode:

[
  {
    "url": "https://example.com/page1",
    "text": "[Extracted text from page 1]"
  },
  {
    "url": "https://example.com/page2", 
    "text": "[Extracted text from page 2]"
  }
]

URL Only Mode:

Page https://example.com/page1

Page https://example.com/page2

Or in list result mode with URL only:

[
  {"url": "https://example.com/page1"},
  {"url": "https://example.com/page2"}
]

Example Usage

Basic Website Crawling

Connect a Text node containing a URL to the URL input, then trigger the Run event:

Set Max Pages to 5 to limit the crawl
Set Max Depth to 2 to follow links two levels deep
Trigger the Run event
The Output will contain text from up to 5 pages within 2 link depths

Structured Data Extraction

Enable List Result to process each page individually:

Enable List Result in the properties panel
Connect the Output to a For Each node
Inside the loop, access element.url and element.text to process each page separately

Quick Crawling for Static Sites

For static HTML sites without heavy JavaScript:

Enable Quick Mode for faster crawling
The node will use HTTP requests instead of browser rendering
Note: Quick mode may not capture content loaded dynamically by JavaScript

URL Discovery

To build a sitemap without extracting text:

Enable URL Only mode
Set Max Pages to the desired limit (up to 100)
The output will contain only the discovered URLs, suitable for feeding into other nodes like Read Webpage for individual processing

Overview​

Crawl Modes​

Inputs​

Configuration Options​

Outputs​

Runtime Behavior and Defaults​

Caching​

Crawl Constraints​

Output Formats​

Example Usage​

Basic Website Crawling​

Structured Data Extraction​

Quick Crawling for Static Sites​

URL Discovery​