Read Webpage
Controlled node
Overview
This node is used to parse text and content from a webpage allowing you to manipulate and process the data within your workflow. A common use for this would be to include the parsed web content in a prompt for an AI generation task. The web parser does a best effort to extract the main content of the page and remove any non-essential elements such as ads, navigation bars, and other distractions. The content is returned as markdown text, which can be used in other nodes or saved to a file or the library.
Inputs
Input | Type | Description | Default |
---|---|---|---|
URL | Text | The URL of the webpage | - |
Run | Event | Fires when the node starts running | - |
Outputs
Output | Type | Description |
---|---|---|
Done | Event | Fires when the node finishes running |
Text | Text | List | The parsed text content of the webpage as markdown. If chunk is selected a list will be returned. |
Panel Controls
There are some optional control flags in the panel to configure certain aspects of the web parser.
-
Quick mode: This mode is faster and less accurate, it will not parse the webpage as thoroughly as the default mode. It will work with simple pages but complex modern websites may lack content.
-
Chunk: This works in a similar way to the chunking in the Read Document node. If the webpage content is large you can chunk it into a list of smaller pieces and process each one individually.
-
Raw output: This will return the raw HTML content of the webpage instead of the parsed text.
-
Deep clean: This will make a best effort to remove content from the output that does not contribute to the main content of the page. For example, it will attempt to remove ads, navigation bars, and other non-essential elements.