Skip to main content

Crawl Website

Controlled node

Overview

This node is useed to crawl a website, recursively grabbing the links within each webpage and optionally returning the text content of the pages. It is useful for conducting a deep search for specific content within a website or for scraping data from multiple pages. The node has a number of differnt configuration options to control the crawling process, including the ability to limit the depth of the crawl and the number of pages to crawl.

Depth Inputs

The crawl node has two different depth inputs, max depth and max path depth, which can be used to control the depth to which the node crawls a website.

  • Max Depth: controls the maximum depth to which the node will crawl from the starting URL. For example, a depth of 1 means that the node will grab all the links from the starting URL page. A depth of 2 means that the node will grab all the links of the starting page and then grab all the links on each of those pages.
  • Max Path Depth: controls the maximum depth of the path to crawl. The crawler will only crawl and return URLs and content from links that are less than or equal to the specified path depth. A path depth of 1 would be something like https://example.com/page1, while a path depth of 2 would be something like https://example.com/page1/page2. This is useful for limiting the crawl to a specific section of a website, such as a blog or a product page.
Domain restriction

The crawl will only return URLs and content from links that are within the same domain as the starting URL. For example, it will not crawl social media links or links to other external locations.

Inputs

InputTypeDescriptionDefault
URLTextThe URL from which to start crawling-
Max PagesNumberThe maximum number of pages to return10
Max DepthNumberThe maximum depth to crawl from the starting URL1
Max Path DepthNumberThe maximum depth of the path to crawl2
RunEventFires when the node starts running-

Outputs

OutputTypeDescription
OutputList | TextThe crawl result as either a list of text
DoneEventFires when the node finishes running

Panel Controls

There are some optional control flags in the panel to configure certain aspects of the crawl node.

  • ListResults: This will return the crawl results as a list rather than as text. By default the crawl results are returned as text.

  • Quick mode: This mode is faster and less accurate, it will not parse the webpage as thoroughly as the default mode. It will work with simple pages but complex modern websites may lack content.

  • URLOnly: This will return only the URLs of the crawled pages, without any text content. By default text content is returned.