Effortlessly Extract Website Text and Links with n8n Automation
detail.loadingPreview
Automate the extraction of all text content and internal/external links from any given website using this n8n workflow. Leverage the power of Langchain tools to streamline your data retrieval processes.
About This Workflow
This n8n workflow is designed to provide a robust solution for extracting comprehensive data from websites. It comprises two core Langchain tools: text_retrieval_tool and url_retrieval_tool. The text_retrieval_tool intelligently fetches all textual content from a specified URL, converting it from HTML to Markdown for cleaner processing, while ignoring common elements like anchor tags and images. Complementing this, the url_retrieval_tool meticulously extracts all hyperlinks, cleans them, removes duplicates, and ensures they are properly formatted with the correct protocol and domain. This dual functionality empowers users to gather both the narrative and the navigational structure of any web page with ease.
Key Features
- Comprehensive Text Extraction: Retrieve all textual content from a website, converted into a clean Markdown format.
- Intelligent Link Discovery: Automatically identify, extract, and normalize all URLs from a given webpage.
- Duplicate URL Removal: Ensures a unique list of extracted links, preventing redundant data.
- Smart URL Formatting: Automatically prepends the correct protocol and domain to relative links.
- Langchain Integration: Seamlessly integrates with Langchain for advanced AI-powered workflows.
How To Use
- Trigger the Workflow: Manually trigger the
Execute workflownode within either thetext_retrieval_toolorurl_retrieval_tool. - Provide Website URL: In the
text_retrieval_tool, theQueryparameter expects the full website URL. For theurl_retrieval_tool, theQueryparameter will also accept the full website URL. - Process Text: The
text_retrieval_toolwill automatically fetch the website, convert its HTML content to Markdown (ignoringaandimgtags), and output the result. - Process URLs: The
url_retrieval_toolwill extract all<a>taghrefattributes, clean and normalize them, filter out invalid or empty URLs, and aggregate the results. - Utilize Outputs: The extracted text or a list of URLs will be available as output from the respective tools for further processing in your n8n workflow.
Apps Used
Workflow JSON
{
"id": "a04e255c-00d6-46b7-9b4d-4b74fb05b1a6",
"name": "Effortlessly Extract Website Text and Links with n8n Automation",
"nodes": 10,
"category": "DevOps",
"status": "active",
"version": "1.0.0"
}Note: This is a sample preview. The full workflow JSON contains node configurations, credentials placeholders, and execution logic.
Get This Workflow
ID: a04e255c-00d6...
About the Author
DevOps_Master_X
Infrastructure Expert
Specializing in CI/CD pipelines, Docker, and Kubernetes automations.
Statistics
Related Workflows
Discover more workflows you might like
Automated PR Merged QA Notifications
Streamline your QA process with this automated workflow that notifies your team upon successful Pull Request merges. Leverage AI and vector stores to enrich notifications and ensure seamless integration into your development pipeline.
Automate Qualys Report Generation and Retrieval
Streamline your Qualys security reporting by automating the generation and retrieval of reports. This workflow ensures timely access to crucial security data without manual intervention.
Visualize Your n8n Workflows: Interactive Dashboard with Mermaid.js
Gain unparalleled visibility into your n8n automation landscape. This workflow transforms your n8n instance into a dynamic, interactive dashboard, leveraging Mermaid.js to visualize all your workflows in one accessible place.