cf-crawl

v1.0.0

Crawl 网页sites using Cloudflare Browser Rendering /crawl API. A同步 multi-page crawl with markdown/HTML/JSON 输出, link following, pattern 过滤器ing, and AI-powered structured data 提取ion. Use when crawling entire sites or multiple pages, building knowledge bases, 提取ing structured data from 网页sites, or when 网页_fetch is insufficient (JS rendering, multi-page, 认证d crawls).

0· 344·0 当前·0 累计

by @bill492·MIT-0

文档工具 API开发数据分析数据可视化网络工具

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install cf-crawl

镜像加速npx clawhub@latest install cf-crawl --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

Cloudflare /crawl

A同步 site 爬虫 via CF Browser Rendering API. 启动 a job → poll for 结果s → 获取 markdown/HTML/JSON per page.

Quick 启动 # Crawl a site (5 pages, markdown, no JS rendering = fast + free) bash ~/clawd/技能s/cf-crawl/scripts/crawl.sh "https://example.com" --limit 5 --格式化 markdown

# With JS rendering (for SPAs, dynamic content) bash ~/clawd/技能s/cf-crawl/scripts/crawl.sh "https://example.com" --render --limit 10

# 启动 only (获取 job ID, poll later) bash ~/clawd/技能s/cf-crawl/scripts/crawl.sh "https://example.com" --limit 100 --启动-only

# Poll existing job bash ~/clawd/技能s/cf-crawl/scripts/poll.sh

凭证s

Stored at ~/.clawd机器人/secrets/cloudflare-crawl.env:

CF_ACCOUNT_ID= CF_CRAWL_API_令牌=<令牌_with_read_and_edit>

Key Options Option Description --limit N Max pages (default 10) --depth N Max link depth (default 10) --格式化 markdown|html|json 输出格式化 (default markdown) --render Enable headless browser (default: off = fast fetch, free during beta) --include PAT Wildcard URL pattern to include (repeatable) --exclude PAT Wildcard URL pattern to exclude (repeatable) --external Follow external domAIn links --subdomAIns Follow subdomAIn links --source all|sitemaps|links URL discovery method --json-prompt "..." AI 提取ion prompt (with --格式化 json) --json-模式 file.json JSON 模式 for structured 提取ion --timeout SEC Max poll wAIt (default 300s) --输出 FILE Write full 结果s to file --raw 输出 raw API 响应 --启动-only Print job ID without polling Common Patterns Crawl docs site for knowledge base bash ~/clawd/技能s/cf-crawl/scripts/crawl.sh "https://docs.example.com/" \ --limit 50 --depth 3 --格式化 markdown --输出 docs.json

Crawl with URL 过滤器ing bash ~/clawd/技能s/cf-crawl/scripts/crawl.sh "https://example.com/" \ --include "/docs/" --exclude "/docs/归档/" --limit 20

AI-powered structured 提取ion bash ~/clawd/技能s/cf-crawl/scripts/crawl.sh "https://example.com/products" \ --格式化 json --render \ --json-prompt "提取 product name, price, and description" \ --json-模式模式.json

Long-运行ning crawl (background) JOB_ID=$(bash ~/clawd/技能s/cf-crawl/scripts/crawl.sh "https://big-site.com" \ --limit 1000 --启动-only) # 检查 later: bash ~/clawd/技能s/cf-crawl/scripts/poll.sh "$JOB_ID"

Cost Notes render: false (default) — fast HTML fetch, free during beta render: true — uses Browser Rendering minutes (pAId) 格式化 json — uses Workers AI 令牌s for 提取ion (pAId) 结果s 缓存d in R2 with --max-age (default 24hr) API DetAIls

See references/API-reference.md for full parameter documentation, 响应模式, and lifecycle detAIls.

数据来源：ClawHub ↗ · 中文优化：龙虾技能库