Web Page to Markdown & Links

Name: Web Page to Markdown & Links
Rating: 5 (5 reviews)
Author: Free N8N

Intermediate

17 nodes connected

detail.loadingPreview

Free N8N Temples

182 views

0 downloads

Data Extractionautomationfirecrawllinksmarkdownweb scraping

Scrapes web pages, converts HTML to Markdown, and extracts links using Firecrawl.dev API.

About This Workflow

This workflow leverages the Firecrawl.dev API to efficiently scrape web pages, transforming their HTML content into a clean Markdown format. It also extracts all hyperlinks present on the page, providing structured data suitable for analysis by AI models or for building link databases. The workflow is designed to handle API rate limits and process URLs in batches.

Key Features

HTML to Markdown Conversion: Converts raw HTML content into a human-readable and AI-friendly Markdown format.
Link Extraction: Identifies and extracts all URLs (hyperlinks) from the scraped web pages.
API Rate Limiting: Implements delays and batching to respect Firecrawl.dev API rate limits (10 requests per minute) and server memory constraints (batches of 40 or less).
Batch Processing: Processes multiple URLs efficiently in defined batches.
Configurable Input: Allows users to define URLs from a data source (e.g., database) or an example array within the workflow.
Structured Output: Organizes scraped data into title, description, content (Markdown), and links.

How To Use

Setup:

Firecrawl.dev API Key: Obtain an API key from Firecrawl.dev.
Configure HTTP Request Node (Retrieve Page Markdown and Links):
- Navigate to the Credentials section of the node.
- Select your Firecrawl Bearer credential or create a new one.
- Ensure the Authorization header is correctly set with your API key (e.g., Bearer YOUR_API_KEY). The workflow's sticky note Sticky Note34 provides guidance.
Define Input URLs:
- Option A (Example): Edit the Example fields from data source node and update the Page array with your desired URLs.
- Option B (Data Source): Connect your own data source to the Get urls from own data source node. Ensure the column containing URLs is named Page.
Configure Output: Connect the Markdown data and Links node to your desired data output destination (e.g., Airtable, database).

Workflow Explanation:

When clicking ‘Test workflow’ (Manual Trigger): Initiates the workflow.
Get urls from own data source (NoOp): Placeholder for fetching URLs from an external source.
Example fields from data source (Set Node): Defines an array of URLs to be processed (or receives URLs from the previous node).
Sticky Note33: Instructs on connecting data sources and naming the URL column Page.
Split out page URLs (SplitOut Node): Takes the Page field from the input and prepares it for individual processing.
40 items at a time (Limit Node): Limits the number of items processed in a batch to 40, as indicated by Sticky Note36 for server memory management.
10 at a time (SplitInBatches Node): Further divides the items into batches of 10 for more granular control and API request management.
Wait (Wait Node): Introduces a delay to respect API rate limits (10 requests per minute) as suggested by Sticky Note37.
Retrieve Page Markdown and Links (HTTP Request Node): Makes the POST request to the Firecrawl.dev API to scrape each URL, requesting markdown and links formats.
Markdown data and Links (Set Node): Structures the data received from Firecrawl.dev into title, description, content, and links fields.
Connect to your own data source (NoOp): Placeholder for sending the processed data to your output destination.
Sticky Note35: Reminds users to configure output to their data source.

Apps Used

automation

firecrawl

links

markdown

web scraping

Workflow JSON

{
  "id": "b07d2950-705c-4571-8658-7a8eac8b4bc9",
  "name": "Web Page to Markdown & Links",
  "nodes": 17,
  "category": "Data Extraction",
  "status": "active",
  "version": "1.0.0"
}

Note: This is a sample preview. The full workflow JSON contains node configurations, credentials placeholders, and execution logic.

Get This Workflow

ID: b07d2950-705c...

About the Author

N8N_Community_Pick

Curator

Hand-picked high quality workflows from the global community.

Statistics

Downloads0

Rating5/5

Get Custom Workflow

Need a specific automation? Our experts can build it for you.

Trusted by top companies
7+ years experience

Related Workflows

Discover more workflows you might like

Intermediate

Data Extractionredditopenaiautomation

Reddit Post Analysis and Summarization for n8n

Fetches Reddit posts related to n8n, filters them, and uses OpenAI to classify and summarize relevant content.

18 nodes

451

View Workflow

Intermediate

Data Extractiongoogle slidesmetadataextraction

Google Slides Metadata Extractor

Extract structured metadata from Google Slides, including slide content and thumbnails.

21 nodes

View Workflow

Beginner

Data ExtractionHubSpotCRMAPI

HubSpot CRM Contact Data Extractor with Pagination

Fetches contact data from HubSpot CRM, handling pagination to retrieve all records.

6 nodes

292

View Workflow

Setup:

Firecrawl.dev API Key: Obtain an API key from Firecrawl.dev.

Configure HTTP Request Node (Retrieve Page Markdown and Links):

Navigate to the Credentials section of the node.
Select your Firecrawl Bearer credential or create a new one.
Ensure the Authorization header is correctly set with your API key (e.g., Bearer YOUR_API_KEY). The workflow's sticky note Sticky Note34 provides guidance.

Define Input URLs:

Option A (Example): Edit the Example fields from data source node and update the Page array with your desired URLs.
Option B (Data Source): Connect your own data source to the Get urls from own data source node. Ensure the column containing URLs is named Page.

Configure Output: Connect the Markdown data and Links node to your desired data output destination (e.g., Airtable, database).

Workflow Explanation:

When clicking ‘Test workflow’ (Manual Trigger): Initiates the workflow.

Get urls from own data source (NoOp): Placeholder for fetching URLs from an external source.

Example fields from data source (Set Node): Defines an array of URLs to be processed (or receives URLs from the previous node).

Sticky Note33: Instructs on connecting data sources and naming the URL column Page.

Split out page URLs (SplitOut Node): Takes the Page field from the input and prepares it for individual processing.

40 items at a time (Limit Node): Limits the number of items processed in a batch to 40, as indicated by Sticky Note36 for server memory management.

10 at a time (SplitInBatches Node): Further divides the items into batches of 10 for more granular control and API request management.

Wait (Wait Node): Introduces a delay to respect API rate limits (10 requests per minute) as suggested by Sticky Note37.

Retrieve Page Markdown and Links (HTTP Request Node): Makes the POST request to the Firecrawl.dev API to scrape each URL, requesting markdown and links formats.

Markdown data and Links (Set Node): Structures the data received from Firecrawl.dev into title, description, content, and links fields.

Connect to your own data source (NoOp): Placeholder for sending the processed data to your output destination.

Sticky Note35: Reminds users to configure output to their data source.