Automated Web Content Extraction and URL Discovery
detail.loadingPreview
Seamlessly extract text content and discover all internal/external URLs from any website. This powerful tool automates the process of gathering comprehensive web data for analysis and integration.
About This Workflow
This n8n workflow leverages Langchain tools to provide two core functionalities: full text retrieval and URL discovery from a given website. The 'Text Retrieval Tool' acts as a sophisticated web scraper, capable of fetching all textual content from a specified URL, stripping away HTML tags to deliver clean, usable text. Concurrently, the 'URL Retrieval Tool' scans a website to identify and extract all linked URLs, ensuring no relevant connections are missed. Both tools are designed for efficiency and accuracy, preparing web data for downstream processing, analysis, or knowledge base integration. The workflow intelligently handles URL formatting, ensuring correct protocol addition and filtering out invalid or duplicate links.
Key Features
- Comprehensive Text Extraction: Retrieves all textual content from a website, cleaning it of HTML markup.
- Complete URL Discovery: Identifies and extracts all internal and external links present on a webpage.
- Intelligent URL Formatting: Automatically adds missing protocols (http/https) and resolves relative URLs to absolute paths.
- Duplicate and Invalid URL Filtering: Ensures data integrity by removing redundant links and filtering out malformed URLs.
- Markdown Conversion: Converts extracted HTML content into a readable Markdown format.
How To Use
- Trigger the Workflow: Initiate either the 'text_retrieval_tool' or the 'url_retrieval_tool' by providing a full website URL as the query.
- Text Retrieval: If using the text retrieval tool, the workflow will automatically fetch the website's content, convert it to Markdown (excluding
<a>and<img>tags), and prepare it for output. - URL Retrieval: If using the URL retrieval tool, the workflow will:
- Extract all
<a>taghrefattributes. - Add missing URL protocols to relative links.
- Filter out any invalid URLs.
- Remove duplicate URLs.
- Aggregate the cleaned and validated URLs with their titles.
- Extract all
Apps Used
Workflow JSON
{
"id": "8edef7ed-89f8-4bdb-ab19-9b3242b1cdb4",
"name": "Automated Web Content Extraction and URL Discovery",
"nodes": 25,
"category": "Operations",
"status": "active",
"version": "1.0.0"
}Note: This is a sample preview. The full workflow JSON contains node configurations, credentials placeholders, and execution logic.
Get This Workflow
ID: 8edef7ed-89f8...
About the Author
Crypto_Watcher
Web3 Developer
Automated trading bots and blockchain monitoring workflows.
Statistics
Related Workflows
Discover more workflows you might like
Universal CSV to JSON API Converter
Effortlessly transform CSV data into structured JSON with this versatile n8n workflow. Integrate it into any application as a custom API endpoint, supporting various input methods including file uploads and raw text.
Instant WooCommerce Order Notifications via Telegram
When a new order is placed on your WooCommerce store, instantly receive detailed notifications directly to your Telegram chat. Stay on top of your e-commerce operations with real-time alerts, including order specifics and a direct link to view the order.
On-Demand Microsoft SQL Query Execution
This workflow allows you to manually trigger and execute any SQL query against your Microsoft SQL Server database. Perfect for ad-hoc data lookups, administrative tasks, or quick tests, giving you direct control over your database operations.