Crawl Website
Controlled node
Overview
The Crawl Website node recursively crawls a website starting from a given URL, extracting text content from multiple pages. It supports depth-limited crawling, page count limits, and path depth restrictions to control the scope of the crawl. The node caches results daily to improve performance during development.
By default, the node returns a concatenated string of text from all crawled pages. When List Result is enabled, it returns an array of objects containing the URL and text for each page. When URL Only is enabled, it returns only the URLs without the page content.
The node enforces a hard limit of 100 pages per crawl to prevent excessive resource usage. If your crawl parameters would exceed this limit, the node will automatically cap the crawl at 100 pages.
Crawl Modes
| Mode | Description |
|---|---|
| Standard | Full browser-based crawling with JavaScript execution (slower, more accurate) |
| Quick Mode | Fast HTTP-only crawling using Axios (faster, less accurate for JS-heavy sites) |
Inputs
| Input | Type | Description | Default |
|---|---|---|---|
| Run | Event | Triggers the crawl operation | - |
| URL | Text | The starting URL to crawl from | - |
| Max Pages | Number | Maximum number of pages to crawl (hard limit: 100) | 10 |
| Max Depth | Number | Maximum link depth to follow from the starting URL | 1 |
| Max Path Depth | Number | Maximum path depth (number of / segments) to follow | 2 |
Configuration Options
These options appear in the properties panel but not on the node itself:
| Option | Type | Description | Default |
|---|---|---|---|
| List Result | Boolean | When enabled, outputs an array of objects with url and text properties instead of concatenated text | false |
| Quick Mode | Boolean | When enabled, uses fast HTTP crawling instead of full browser rendering | false |
| URL Only | Boolean | When enabled, returns only URLs without page text content | false |
Outputs
| Output | Type | Description |
|---|---|---|
| Done | Event | Fires when the crawl completes successfully |
| Output | Data | Either a concatenated string of page text (default) or an array of page objects (if List Result is enabled) |
Runtime Behavior and Defaults
Caching
The node implements daily caching based on input parameters. If the same URL and parameters are used within the same day, the cached result is returned immediately without re-crawling. The cache key includes:
- The target URL
- All crawl parameters (maxPages, maxDepth, maxPathDepth, quickMode, listResult, urlOnly)
- The current date (timestamp rounded to the day)
Crawl Constraints
- Maximum Pages: Hard capped at 100 pages per crawl
- Within Host: The crawler only follows links within the same host as the starting URL
- Depth Calculation:
maxDepthcontrols how many links deep to follow from the start pagemaxPathDepthcontrols how many path segments (e.g.,/path/to/page) to traverse
Output Formats
Default (Concatenated Text):
Page https://example.com/page1
[Extracted text from page 1]
Page https://example.com/page2
[Extracted text from page 2]
List Result Mode:
[
{
"url": "https://example.com/page1",
"text": "[Extracted text from page 1]"
},
{
"url": "https://example.com/page2",
"text": "[Extracted text from page 2]"
}
]
URL Only Mode:
Page https://example.com/page1
Page https://example.com/page2
Or in list result mode with URL only:
[
{"url": "https://example.com/page1"},
{"url": "https://example.com/page2"}
]
Example Usage
Basic Website Crawling
Connect a Text node containing a URL to the URL input, then trigger the Run event:
- Set Max Pages to
5to limit the crawl - Set Max Depth to
2to follow links two levels deep - Trigger the Run event
- The Output will contain text from up to 5 pages within 2 link depths
Structured Data Extraction
Enable List Result to process each page individually:
- Enable List Result in the properties panel
- Connect the Output to a For Each node
- Inside the loop, access
element.urlandelement.textto process each page separately
Quick Crawling for Static Sites
For static HTML sites without heavy JavaScript:
- Enable Quick Mode for faster crawling
- The node will use HTTP requests instead of browser rendering
- Note: Quick mode may not capture content loaded dynamically by JavaScript
URL Discovery
To build a sitemap without extracting text:
- Enable URL Only mode
- Set Max Pages to the desired limit (up to 100)
- The output will contain only the discovered URLs, suitable for feeding into other nodes like Read Webpage for individual processing