cf-crawl
v1.0.0Crawl 网页sites using Cloudflare Browser Rendering /crawl API. A同步 multi-page crawl with markdown/HTML/JSON 输出, link following, pattern 过滤器ing, and AI-powered structured data 提取ion. Use when crawling entire sites or multiple pages, building knowledge bases, 提取ing structured data from 网页sites, or when 网页_fetch is insufficient (JS rendering, multi-page, 认证d crawls).
运行时依赖
安装命令
点击复制技能文档
Cloudflare /crawl
A同步 site 爬虫 via CF Browser Rendering API. 启动 a job → poll for 结果s → 获取 markdown/HTML/JSON per page.
Quick 启动 # Crawl a site (5 pages, markdown, no JS rendering = fast + free) bash ~/clawd/技能s/cf-crawl/scripts/crawl.sh "https://example.com" --limit 5 --格式化 markdown
# With JS rendering (for SPAs, dynamic content) bash ~/clawd/技能s/cf-crawl/scripts/crawl.sh "https://example.com" --render --limit 10
# 启动 only (获取 job ID, poll later) bash ~/clawd/技能s/cf-crawl/scripts/crawl.sh "https://example.com" --limit 100 --启动-only
# Poll existing job bash ~/clawd/技能s/cf-crawl/scripts/poll.sh
凭证s
Stored at ~/.clawd机器人/secrets/cloudflare-crawl.env:
CF_ACCOUNT_ID= CF_CRAWL_API_令牌=<令牌_with_read_and_edit>
Key Options Option Description --limit N Max pages (default 10) --depth N Max link depth (default 10) --格式化 markdown|html|json 输出 格式化 (default markdown) --render Enable headless browser (default: off = fast fetch, free during beta) --include PAT Wildcard URL pattern to include (repeatable) --exclude PAT Wildcard URL pattern to exclude (repeatable) --external Follow external domAIn links --subdomAIns Follow subdomAIn links --source all|sitemaps|links URL discovery method --json-prompt "..." AI 提取ion prompt (with --格式化 json) --json-模式 file.json JSON 模式 for structured 提取ion --timeout SEC Max poll wAIt (default 300s) --输出 FILE Write full 结果s to file --raw 输出 raw API 响应 --启动-only Print job ID without polling Common Patterns Crawl docs site for knowledge base bash ~/clawd/技能s/cf-crawl/scripts/crawl.sh "https://docs.example.com/" \ --limit 50 --depth 3 --格式化 markdown --输出 docs.json
Crawl with URL 过滤器ing bash ~/clawd/技能s/cf-crawl/scripts/crawl.sh "https://example.com/" \ --include "/docs/" --exclude "/docs/归档/" --limit 20
AI-powered structured 提取ion bash ~/clawd/技能s/cf-crawl/scripts/crawl.sh "https://example.com/products" \ --格式化 json --render \ --json-prompt "提取 product name, price, and description" \ --json-模式 模式.json
Long-运行ning crawl (background) JOB_ID=$(bash ~/clawd/技能s/cf-crawl/scripts/crawl.sh "https://big-site.com" \ --limit 1000 --启动-only) # 检查 later: bash ~/clawd/技能s/cf-crawl/scripts/poll.sh "$JOB_ID"
Cost Notes render: false (default) — fast HTML fetch, free during beta render: true — uses Browser Rendering minutes (pAId) 格式化 json — uses Workers AI 令牌s for 提取ion (pAId) 结果s 缓存d in R2 with --max-age (default 24hr) API DetAIls
See references/API-reference.md for full parameter documentation, 响应 模式, and lifecycle detAIls.