AI-Powered Product Data Extraction from Web Pages
detail.loadingPreview
Automate the entire process of gathering product intelligence from various websites. This workflow intelligently scrapes web pages, cleans HTML content, and uses an AI model to extract structured product details like name, description, rating, and price, directly into Google Sheets.
About This Workflow
Unlock the power of AI for comprehensive product data collection with this robust n8n workflow. Designed for businesses and analysts, it streamlines the complex task of web scraping by first fetching target URLs from Google Sheets. Leveraging BrightData's advanced web unlocker, it effortlessly navigates anti-scraping measures to retrieve raw HTML. A custom function then meticulously cleans the HTML, removing unnecessary clutter like scripts and styles, to prepare a focused input for the AI. Finally, an LLM, orchestrated via Langchain and OpenRouter with a strict structured output parser, extracts precise product attributes, ensuring consistency before neatly appending all collected data back into a designated Google Sheet.
Key Features
- Automated Web Scraping: Fetches URLs from Google Sheets and uses BrightData for resilient, anti-bot-bypassing web content retrieval.
- Intelligent HTML Cleaning: Custom code preprocesses scraped HTML, removing distractions (scripts, styles, ads) and focusing the content for AI analysis.
- AI-Powered Data Extraction: Utilizes a Langchain LLM (via OpenRouter and a specified model like GPT-4.1) to intelligently identify and extract specific product information.
- Structured Output Guarantee: Employs a robust structured output parser to ensure extracted data (name, description, rating, reviews, price) conforms to a predefined JSON schema.
- Seamless Google Sheets Integration: Reads URLs to scrape from one sheet and writes all extracted, structured product data to another.
How To Use
- Set up Google Sheets: Create two Google Sheets. One for input URLs (referenced by
WEB_SHEET_IDandTRACK_SHEET_GIDin the workflow parameters), and another for output results (referenced byWEB_SHEET_IDandRESULTS_SHEET_GID). Configure your Google Sheets OAuth2 API credentials in n8n. - Configure BrightData: Sign up for a BrightData account, obtain your API token. Store your
BRIGHTDATA_TOKENas a global credential or environment variable in n8n and ensure you have aweb_unlocker1zone set up in BrightData. - Adjust LLM Model (Optional): The
OpenRouter Chat Modelnode uses a specific model. You can explore and select other compatible models available through OpenRouter if desired. - Define Output Schema: Review the
Structured Output Parsernode'sinputSchemato understand the expected JSON format. Modify it if you need to extract additional or different data fields. - Update Extraction Prompt: In the
extract datanode, adjust the LLM prompt within the 'messages' field to clearly instruct the AI on what product information to extract based on your desired schema. - Run the Workflow: Populate your input Google Sheet with target URLs, then trigger the workflow manually using the 'When clicking ‘Test workflow’' node or set up a regular schedule. Extracted and structured product data will be appended to your designated results sheet.
Apps Used
Workflow JSON
{
"id": "306b9278-2e4e-4b7e-a4c7-90b37bda69d0",
"name": "AI-Powered Product Data Extraction from Web Pages",
"nodes": 5,
"category": "Operations",
"status": "active",
"version": "1.0.0"
}Note: This is a sample preview. The full workflow JSON contains node configurations, credentials placeholders, and execution logic.
Get This Workflow
ID: 306b9278-2e4e...
About the Author
SaaS_Connector
Integration Guru
Connecting CRM, Notion, and Slack to automate your life.
Statistics
Related Workflows
Discover more workflows you might like
Google Sheets to Icypeas: Automated Bulk Domain Scanning
This workflow streamlines the process of performing bulk domain scans by integrating your Google Sheets data directly with the Icypeas platform. Automate the submission of company names from your spreadsheet to Icypeas for comprehensive domain information, saving valuable time and effort.
Instant WooCommerce Order Notifications via Telegram
When a new order is placed on your WooCommerce store, instantly receive detailed notifications directly to your Telegram chat. Stay on top of your e-commerce operations with real-time alerts, including order specifics and a direct link to view the order.
On-Demand Microsoft SQL Query Execution
This workflow allows you to manually trigger and execute any SQL query against your Microsoft SQL Server database. Perfect for ad-hoc data lookups, administrative tasks, or quick tests, giving you direct control over your database operations.