# data-scraper 网页数据抓取器 — 使用 curl + 解析从网页中提取结构化数据。轻量级,无需浏览器。支持 HTML 到文本、表格提取、价格监控和批量抓取。 ## 何时使用 - 从网页中提取文本内容(文章、博客、文档) - 抓取产品价格、评论或列表 - 监控页面变化(价格下降、新内容) - 批量从多个 URL 收集数据 - 将 HTML 表格转换为结构化格式(JSON/CSV) ## 快速开始 ``bash # 从 URL 提取可读文本 data-scraper fetch "https://example.com/article" # 提取特定元素 data-scraper extract "https://example.com" --selector "h2, .price" # 监控变化 data-scraper watch "https://example.com/product" --interval 3600 ` ## 提取模式 ### 文本模式(默认) 从页面获取并提取可读内容,剥离 HTML 标签、脚本和样式。类似阅读模式。 `bash data-scraper fetch URL # 输出:清洁的 Markdown 文本 ` ### 选择器模式 对特定 CSS 选择器进行精确提取。 `bash data-scraper extract URL --selector ".product-title, .price, .rating" # 输出:匹配元素作为结构化数据 ` ### 表格模式 将 HTML 表格提取为结构化格式。 `bash data-scraper table URL --index 0 # 输出:带有标题到值映射的 JSON 数组行对象 ` ### 链接模式 从页面提取所有链接,支持可选过滤。 `bash data-scraper links URL --filter "*.pdf" # 输出:过滤后的绝对 URL 列表 ` ## 批量抓取 `bash # 批量抓取多个 URL data-scraper batch urls.txt --output results/ # 带有限速 data-scraper batch urls.txt --delay 2000 --output results/ ` urls.txt 格式: ` https://site1.com/page https://site2.com/page https://site3.com/page ` ## 变化监控 `bash # 监控变化,检测差异时报警 data-scraper watch URL --selector ".price" --interval 3600 # 与之前的快照比较 data-scraper diff URL ` 存储带有时间戳的快照在 data-scraper/snapshots/ 中。通过 notification-hub 报警当检测到变化。 ## 输出格式 | 格式 | 标志 | 用例 | |--------|------|----------| | 文本 | --format text | 阅读、摘要 | | JSON | --format json | 数据处理 | | CSV | --format csv | 电子表格 | | Markdown | --format md | 文档 | ## 头部与认证 `bash # 自定义头部 data-scraper fetch URL --header "Authorization: Bearer TOKEN" # 基于 Cookie 的认证 data-scraper fetch URL --cookie "session=abc123" # User-Agent 覆盖 data-scraper fetch URL --ua "Mozilla/5.0..." ` ## 限速与伦理 - 默认:每秒每域名 1 个请求 - 当设置 --polite 标志时尊重 robots.txt - 可配置的请求间隔 - 遇到 429(太多请求)时停止并回退 ## 错误处理 | 错误 | 行为 | |-------|----------| | 404 | 日志并跳过 | | 403/401 | 警告认证要求 | | 429 | 指数回退(最多 3 次重试) | | 超时 | 重试一次,使用更长的超时 | | SSL 错误 | 警告,使用 --insecure` 继续 | ## 集成 - web-claude:当 web_fetch 不足够时作为后备使用 - competitor-watch:将抓取数据输入竞争对手分析 - seo-audit:抓取竞争对手页面进行 SEO 比较 - performance-tracker:从公共个人资料收集社会媒体指标
Web Data Scraper — Extract structured data from web pages using curl + parsing. Lightweight, no browser required. Supports HTML-to-text, table extraction, price monitoring, and batch scraping.
When to Use
- Extract text content from web pages (articles, blogs, docs)
- Scrape product prices, reviews, or listings
- Monitor pages for changes (price drops, new content)
- Batch-collect data from multiple URLs
- Convert HTML tables to structured formats (JSON/CSV)
Quick Start
# Extract readable text from URL
data-scraper fetch "https://example.com/article"# Extract specific elements
data-scraper extract "https://example.com" --selector "h2, .price"
# Monitor for changes
data-scraper watch "https://example.com/product" --interval 3600
Extraction Modes
Text Mode (default)
Fetches page and extracts readable content, stripping HTML tags, scripts, and styles. Similar to reader mode.
data-scraper fetch URL
# Output: clean markdown text
Selector Mode
Target specific CSS selectors for precise extraction.
data-scraper extract URL --selector ".product-title, .price, .rating"
# Output: matched elements as structured data
Table Mode
Extract HTML tables into structured formats.
data-scraper table URL --index 0
# Output: JSON array of row objects (header → value mapping)
Link Mode
Extract all links from a page with optional filtering.
data-scraper links URL --filter "*.pdf"
# Output: filtered list of absolute URLs
Batch Scraping
# Scrape multiple URLs
data-scraper batch urls.txt --output results/# With rate limiting
data-scraper batch urls.txt --delay 2000 --output results/
urls.txt format:
https://site1.com/page
https://site2.com/page
https://site3.com/page
Change Monitoring
# Watch for changes, alert on diff
data-scraper watch URL --selector ".price" --interval 3600# Compare with previous snapshot
data-scraper diff URL
Stores snapshots in data-scraper/snapshots/ with timestamps. Alerts via notification-hub when changes detected.
Output Formats
| Format | Flag | Use Case |
|---|
| Text | --format text | Reading, summarization |
| JSON | --format json | Data processing |
| CSV | --format csv | Spreadsheets |
| Markdown | --format md | Documentation |
Headers & Auth
# Custom headers
data-scraper fetch URL --header "Authorization: Bearer TOKEN"# Cookie-based auth
data-scraper fetch URL --cookie "session=abc123"
# User-Agent override
data-scraper fetch URL --ua "Mozilla/5.0..."
Rate Limiting & Ethics
- Default: 1 request per second per domain
- Respects
robots.txt when --polite flag is set
- Configurable delay between requests
- Stops on 429 (Too Many Requests) and backs off
Error Handling
| Error | Behavior |
|---|
| 404 | Log and skip |
| 403/401 | Warn about auth requirement |
| 429 | Exponential backoff (max 3 retries) |
| Timeout | Retry once with longer timeout |
| SSL error | Warn, option to proceed with --insecure |
Integration
- web-claude: Use as fallback when web_fetch isn't enough
- competitor-watch: Feed scraped data into competitor analysis
- seo-audit: Scrape competitor pages for SEO comparison
- performance-tracker: Collect social metrics from public profiles