Crawl Website
Controlled node
Overview
This node is useed to crawl a website, recursively grabbing the links within each webpage and optionally returning the text content of the pages. It is useful for conducting a deep search for specific content within a website or for scraping data from multiple pages. The node has a number of differnt configuration options to control the crawling process, including the ability to limit the depth of the crawl and the number of pages to crawl.
Depth Inputs
The crawl node has two different depth inputs, max depth and max path depth, which can be used to control the depth to which the node crawls a website.
- Max Depth: controls the maximum depth to which the node will crawl from the starting URL. For example, a depth of 1 means that the node will grab all the links from the starting URL page. A depth of 2 means that the node will grab all the links of the starting page and then grab all the links on each of those pages.
- Max Path Depth: controls the maximum depth of the path to crawl.
The crawler will only crawl and return URLs and content from links that are less than or equal to the specified path depth.
A path depth of 1 would be something like
https://example.com/page1
, while a path depth of 2 would be something likehttps://example.com/page1/page2
. This is useful for limiting the crawl to a specific section of a website, such as a blog or a product page.
The crawl will only return URLs and content from links that are within the same domain as the starting URL. For example, it will not crawl social media links or links to other external locations.
Inputs
Input | Type | Description | Default |
---|---|---|---|
URL | Text | The URL from which to start crawling | - |
Max Pages | Number | The maximum number of pages to return | 10 |
Max Depth | Number | The maximum depth to crawl from the starting URL | 1 |
Max Path Depth | Number | The maximum depth of the path to crawl | 2 |
Run | Event | Fires when the node starts running | - |
Outputs
Output | Type | Description |
---|---|---|
Output | List | Text | The crawl result as either a list of text |
Done | Event | Fires when the node finishes running |
Panel Controls
There are some optional control flags in the panel to configure certain aspects of the crawl node.
-
ListResults: This will return the crawl results as a list rather than as text. By default the crawl results are returned as text.
-
Quick mode: This mode is faster and less accurate, it will not parse the webpage as thoroughly as the default mode. It will work with simple pages but complex modern websites may lack content.
-
URLOnly: This will return only the URLs of the crawled pages, without any text content. By default text content is returned.