Crawl Website

Controlled node

Overview

This node is useed to crawl a website, recursively grabbing the links within each webpage and optionally returning the text content of the pages. It is useful for conducting a deep search for specific content within a website or for scraping data from multiple pages. The node has a number of differnt configuration options to control the crawling process, including the ability to limit the depth of the crawl and the number of pages to crawl.

Depth Inputs

The crawl node has two different depth inputs, max depth and max path depth, which can be used to control the depth to which the node crawls a website.

Max Depth: controls the maximum depth to which the node will crawl from the starting URL. For example, a depth of 1 means that the node will grab all the links from the starting URL page. A depth of 2 means that the node will grab all the links of the starting page and then grab all the links on each of those pages.
Max Path Depth: controls the maximum depth of the path to crawl. The crawler will only crawl and return URLs and content from links that are less than or equal to the specified path depth. A path depth of 1 would be something like https://example.com/page1, while a path depth of 2 would be something like https://example.com/page1/page2. This is useful for limiting the crawl to a specific section of a website, such as a blog or a product page.

Domain restriction

The crawl will only return URLs and content from links that are within the same domain as the starting URL. For example, it will not crawl social media links or links to other external locations.

Inputs

Input	Type	Description	Default
URL	Text	The URL from which to start crawling	-
Max Pages	Number	The maximum number of pages to return	10
Max Depth	Number	The maximum depth to crawl from the starting URL	1
Max Path Depth	Number	The maximum depth of the path to crawl	2
Run	Event	Fires when the node starts running	-

Outputs

Output	Type	Description
Output	List \| Text	The crawl result as either a list of text
Done	Event	Fires when the node finishes running

Panel Controls

There are some optional control flags in the panel to configure certain aspects of the crawl node.

ListResults: This will return the crawl results as a list rather than as text. By default the crawl results are returned as text.
Quick mode: This mode is faster and less accurate, it will not parse the webpage as thoroughly as the default mode. It will work with simple pages but complex modern websites may lack content.
URLOnly: This will return only the URLs of the crawled pages, without any text content. By default text content is returned.

Overview​

Depth Inputs​

Inputs​

Outputs​

Panel Controls​

Overview

Depth Inputs

Inputs

Outputs

Panel Controls