Smart Multipage Web Scraper: Sitemap to Google Drive
detail.loadingPreview
This workflow efficiently scrapes multiple web pages from any website by utilizing its sitemap. It intelligently filters URLs based on keywords and fetches clean, structured content via Jina.ai, saving it directly to your Google Drive. Best of all, no API key is needed for Jina.ai.
About This Workflow
Dive into seamless web content extraction with this powerful n8n workflow. Designed to automate the laborious task of scraping entire websites, it begins by fetching the target site's sitemap.xml to identify all available URLs. You can then apply sophisticated filters to hone in on specific topics or page types (e.g., 'agent' or 'tool' related content). Leveraging Jina.ai's innovative web scraping service, it bypasses complex configurations and API keys to deliver clean, markdown-formatted content. Each scraped page is meticulously processed, extracting both title and content, before being saved as individual files within your Google Drive, complete with rate-limiting to ensure responsible usage.
Key Features
- Sitemap-Driven Scraping: Automatically discovers and processes all URLs listed in a website's
sitemap.xml. - Smart Content Filtering: Precisely target specific pages or topics using keyword-based filtering (e.g., 'agent', 'tool').
- Jina.ai Integration (No API Key!): Utilizes Jina.ai's powerful web scraper for clean, structured content extraction without the need for credentials.
- Markdown Content & Title Extraction: Automatically parses scraped data to extract page titles and full markdown content.
- Google Drive Archiving: Organizes and saves each scraped webpage as a distinct file in your Google Drive, named by URL and title.
- Responsible Scraping: Includes a built-in
Waitnode to prevent overwhelming target servers.
How To Use
- Configure
Set Website URL: In the 'Set Website URL' node, update thesitemap_urlvalue to thesitemap.xmladdress of the website you wish to scrape. - Adjust
Filter By Topics or Pages: Modify the conditions in the 'Filter By Topics or Pages' node. Add or change keywords (e.g., 'agent', 'tool') to precisely target the pages relevant to your needs. You can add morecontainsorequalsconditions. - Set
Limit(Optional): In the 'Limit' node, adjust theMax Itemsparameter if you want to control the maximum number of URLs to process. Set to a high number or remove for full website scraping. - Connect Google Drive: In the 'Save Webpage Contents to Google Drive' node, select or create your Google Drive credential. Choose the specific folder where you want to save the scraped content.
- Test and Activate: Click 'Test workflow' to run a test with a few pages. Once satisfied, activate the workflow to begin automated content extraction.
- Review
WaitNode (Optional): If you encounter rate-limiting issues, adjust theWaitnode's duration to increase the delay between scraping requests.
Apps Used
Workflow JSON
{
"id": "b8bc8684-13dd-437d-8fda-1e9a23959797",
"name": "Smart Multipage Web Scraper: Sitemap to Google Drive",
"nodes": 6,
"category": "Operations",
"status": "active",
"version": "1.0.0"
}Note: This is a sample preview. The full workflow JSON contains node configurations, credentials placeholders, and execution logic.
Get This Workflow
ID: b8bc8684-13dd...
About the Author
SaaS_Connector
Integration Guru
Connecting CRM, Notion, and Slack to automate your life.
Statistics
Related Workflows
Discover more workflows you might like
Universal CSV to JSON API Converter
Effortlessly transform CSV data into structured JSON with this versatile n8n workflow. Integrate it into any application as a custom API endpoint, supporting various input methods including file uploads and raw text.
Instant WooCommerce Order Notifications via Telegram
When a new order is placed on your WooCommerce store, instantly receive detailed notifications directly to your Telegram chat. Stay on top of your e-commerce operations with real-time alerts, including order specifics and a direct link to view the order.
On-Demand Microsoft SQL Query Execution
This workflow allows you to manually trigger and execute any SQL query against your Microsoft SQL Server database. Perfect for ad-hoc data lookups, administrative tasks, or quick tests, giving you direct control over your database operations.