Crawl — 爬取

v0.1.0

爬取任意网站并将页面保存为本地markdown文件。当您需要下载文档、知识库或网页内容以便离线访问或分析时使用。无需编码 - 只需提供一个URL。

0· 1.2k·0 当前·0 累计

by @barneyjm (Barneyjm)·MIT-0

开发工具代码生成文档工具网络工具浏览器自动化

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install crawl

镜像加速npx clawhub@latest install crawl --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

Crawl 技能从多个页面中爬取网站以提取内容。适用于文档、知识库和全站内容提取。前置条件 Tavily API 密钥 - 在 https://tavily.com 获取您的密钥在 ~/.claude/settings.json 中添加： { "env": { "TAVILY_API_KEY": "tvly-your-api-key-here" } } 快速开始使用脚本 ./scripts/crawl.sh '' [output_dir] 示例： # 基本爬取 ./scripts/crawl.sh '{"url": "https://docs.example.com"}' # 更深入的爬取，带有限制 ./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}' # 保存到文件 ./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs # 使用路径过滤器的有焦点爬取 ./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.", "/api/."], "exclude_paths": ["/blog/."]}' # 使用语义指令（用于代理） ./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}' 当提供 output_dir 时，每个爬取的页面都会被保存为一个单独的 markdown 文件。基本爬取 curl --request POST \ --url https://api.tavily.com/crawl \ --header "Authorization: Bearer $TAVILY_API_KEY" \ --header 'Content-Type: application/json' \ --data '{ "url": "https://docs.example.com", "max_depth": 1, "limit": 20 }' 有焦点的爬取，带有指令 curl --request POST \ --url https://api.tavily.com/crawl \ --header "Authorization: Bearer $TAVILY_API_KEY" \ --header 'Content-Type: application/json' \ --data '{ "url": "https://docs.example.com", "max_depth": 2, "instructions": "Find API documentation and code examples", "chunks_per_source": 3, "select_paths": ["/docs/.", "/api/."] }' API 参考端点 POST https://api.tavily.com/crawl 头部头部值 Authorization Bearer Content-Type application/json 请求体字段类型默认值描述 url 字符串必需开始爬取的根 URL max_depth 整数 1 爬取的深度（1-5） max_breadth 整数 20 每页链接数 limit 整数 50 总页数上限 instructions 字符串 null 焦点指令 chunks_per_source 整数 3 每页块数（1-5，需要指令） extract_depth 字符串 "basic" 提取深度（基本或高级） format 字符串 "markdown" 格式（markdown 或文本） select_paths 数组 null 包含的正则表达式模式 exclude_paths 数组 null 排除的正则表达式模式 allow_external 布尔值 true 是否包含外部域名链接 timeout 浮点数 150 最大等待时间（10-150 秒）响应格式 { "base_url": "https://docs.example.com", "results": [ { "url": "https://docs.example.com/page", "raw_content": "# Page Title\n\nContent..." } ], "response_time": 12.5 } 深度与性能深度典型页面时间 1 10-50 秒 2 50-500 分钟 3 500-5000 多分钟从 max_depth=1 开始，只有在需要时才增加。爬取上下文与数据收集用于代理（将结果输入上下文）：始终使用指令 + chunks_per_source。这将返回相关块，而不是完整页面，防止上下文窗口爆炸。用于数据收集（保存到文件）：省略 chunks_per_source 以获取完整页面内容。示例用于上下文：代理研究（推荐）用于将爬取结果输入 LLM 上下文： curl --request POST \ --url https://api.tavily.com/crawl \ --header "Authorization: Bearer $TAVILY_API_KEY" \ --header 'Content-Type: application/json' \ --data '{ "url": "https://docs.example.com", "max_depth": 2, "instructions": "Find API documentation and authentication guides", "chunks_per_source": 3 }' 返回每页最相关的块（每块最大 500 个字符）- 适合上下文，不会让其过载。用于上下文：目标技术文档 curl --request POST \ --url https://api.tavily.com/crawl \ --header "Authorization: Bearer $TAVILY_API_KEY" \ --header 'Content-Type: application/json' \ --data '{ "url": "https://example.com", "max_depth": 2, "instructions": "Find all documentation about authentication and security", "chunks_per_source": 3, "select_paths": ["/docs/.", "/api/."] }' 用于数据收集：完整页面存档用于将内容保存到文件以备后续处理： curl --request POST \ --url https://api.tavily.com/crawl \ --header "Authorization: Bearer $TAVILY_API_KEY" \ --header 'Content-Type: application/json' \ --data '{ "url": "https://example.com/blog", "max_depth": 2, "max_breadth": 50, "select_paths": ["/blog/."], "exclude_paths": ["/blog/tag/.", "/blog/category/."] }' 返回完整页面内容 - 使用脚本和 output_dir 保存为 markdown 文件。 Map API（URL 发现）使用 map 代替 crawl，当您只需要 URL 而不需要内容时： curl --request POST \ --url https://api.tavily.com/map \ --header "Authorization: Bearer $TAVILY_API_KEY" \ --header 'Content-Type: application/json' \ --data '{...}'

License

运行时依赖

安装命令

技能文档

相关技能推荐