Web Page to Markdown & Links
detail.loadingPreview
Scrapes web pages, converts HTML to Markdown, and extracts links using Firecrawl.dev API.
About This Workflow
This workflow leverages the Firecrawl.dev API to efficiently scrape web pages, transforming their HTML content into a clean Markdown format. It also extracts all hyperlinks present on the page, providing structured data suitable for analysis by AI models or for building link databases. The workflow is designed to handle API rate limits and process URLs in batches.
Key Features
- HTML to Markdown Conversion: Converts raw HTML content into a human-readable and AI-friendly Markdown format.
- Link Extraction: Identifies and extracts all URLs (hyperlinks) from the scraped web pages.
- API Rate Limiting: Implements delays and batching to respect Firecrawl.dev API rate limits (10 requests per minute) and server memory constraints (batches of 40 or less).
- Batch Processing: Processes multiple URLs efficiently in defined batches.
- Configurable Input: Allows users to define URLs from a data source (e.g., database) or an example array within the workflow.
- Structured Output: Organizes scraped data into title, description, content (Markdown), and links.
How To Use
Setup:
- Firecrawl.dev API Key: Obtain an API key from Firecrawl.dev.
- Configure HTTP Request Node (
Retrieve Page Markdown and Links):- Navigate to the
Credentialssection of the node. - Select your
Firecrawl Bearercredential or create a new one. - Ensure the
Authorizationheader is correctly set with your API key (e.g.,Bearer YOUR_API_KEY). The workflow's sticky noteSticky Note34provides guidance.
- Navigate to the
- Define Input URLs:
- Option A (Example): Edit the
Example fields from data sourcenode and update thePagearray with your desired URLs. - Option B (Data Source): Connect your own data source to the
Get urls from own data sourcenode. Ensure the column containing URLs is namedPage.
- Option A (Example): Edit the
- Configure Output: Connect the
Markdown data and Linksnode to your desired data output destination (e.g., Airtable, database).
Workflow Explanation:
When clicking ‘Test workflow’(Manual Trigger): Initiates the workflow.Get urls from own data source(NoOp): Placeholder for fetching URLs from an external source.Example fields from data source(Set Node): Defines an array of URLs to be processed (or receives URLs from the previous node).Sticky Note33: Instructs on connecting data sources and naming the URL columnPage.Split out page URLs(SplitOut Node): Takes thePagefield from the input and prepares it for individual processing.40 items at a time(Limit Node): Limits the number of items processed in a batch to 40, as indicated bySticky Note36for server memory management.10 at a time(SplitInBatches Node): Further divides the items into batches of 10 for more granular control and API request management.Wait(Wait Node): Introduces a delay to respect API rate limits (10 requests per minute) as suggested bySticky Note37.Retrieve Page Markdown and Links(HTTP Request Node): Makes the POST request to the Firecrawl.dev API to scrape each URL, requestingmarkdownandlinksformats.Markdown data and Links(Set Node): Structures the data received from Firecrawl.dev intotitle,description,content, andlinksfields.Connect to your own data source(NoOp): Placeholder for sending the processed data to your output destination.Sticky Note35: Reminds users to configure output to their data source.
Apps Used
Workflow JSON
{
"id": "b07d2950-705c-4571-8658-7a8eac8b4bc9",
"name": "Web Page to Markdown & Links",
"nodes": 17,
"category": "Data Extraction",
"status": "active",
"version": "1.0.0"
}Note: This is a sample preview. The full workflow JSON contains node configurations, credentials placeholders, and execution logic.
Get This Workflow
ID: b07d2950-705c...
About the Author
N8N_Community_Pick
Curator
Hand-picked high quality workflows from the global community.
Statistics
Related Workflows
Discover more workflows you might like
Reddit Post Analysis and Summarization for n8n
Fetches Reddit posts related to n8n, filters them, and uses OpenAI to classify and summarize relevant content.
Google Slides Metadata Extractor
Extract structured metadata from Google Slides, including slide content and thumbnails.
HubSpot CRM Contact Data Extractor with Pagination
Fetches contact data from HubSpot CRM, handling pagination to retrieve all records.