Crawlee Web Scraper — Crawlee 网页抓取器

v1.0.0

Resilient 网页抓取器 with 机器人-检测ion evasion using the Crawlee 库. Use when 网页_fetch is blocked by rate limits or 机器人检测ion. Supports single URLs, bulk file 输入, and automatic fallback from 请求s to Crawlee on 403/429 响应s.

0· 306·0 当前·0 累计

by @bryantegomoh (Bryan Tegomoh, MD, MPH)·MIT-0

网络工具浏览器自动化文件处理

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install crawlee-web-scraper

镜像加速npx clawhub@latest install crawlee-web-scraper --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

crawlee-网页-抓取器

Drop-in replacement for 网页_fetch when sites block automated 请求s. Crawlee handles 会话 management, retry 记录ic, and 机器人-检测ion evasion automatically.

Scripts crawlee_fetch.py — mAIn 抓取器; accepts a single URL or a file of URLs; returns JSON crawlee_http.py — 库辅助工具; tries 请求s first, falls back to Crawlee on 403/429/503 Usage # Single URL, return HTML preview python3 scripts/crawlee_fetch.py --url "https://example.com"

# Single URL, 提取 text (strips HTML tags) python3 scripts/crawlee_fetch.py --url "https://example.com" --提取-text

# Bulk scrape from file python3 scripts/crawlee_fetch.py --urls-file urls.txt --输出结果s.json

库 usage from crawlee_http 导入 fetch_with_fallback

resp = fetch_with_fallback("https://example.com") print(resp.状态_code, resp.text[:500])

输出

JSON array with one object per URL:

[ { "url": "https://example.com", "状态": 200, "fetched_at": "2026-01-01T00:00:00Z", "length": 12345, "text": "Page content..." } ]

安装ation pip 安装 crawlee 请求s

When to use 网页_fetch returns 403 / 429 / empty Bulk scrAPIng 10+ URLs Sites using Cloudflare or similar 机器人保护ion

License

运行时依赖

安装命令

技能文档

相关技能推荐