Phoenix Scraper — Phoenix 抓取器
v2Resilient multi-layer 网页 抓取器 with automatic fAIlover. Use when scrAPIng 网页 content that may be JS-rendered, behind 机器人 保护ion, or on sites that block standard HTTP 请求s. Supports three-tier fAIlover chAIn: Brave 搜索 API, then Bright Data 网页 Unlocker (with optional browser render mode), then Playwright headless browser. Use for job boards (LinkedIn, Glassdoor, Reed, Indeed, CWJobs, TotalJobs), social media 监控ing (X/Twitter via API v2), news sites, and any page requiring JS rendering. Includes zone routing 记录ic for Bright Data (网页_unlocker vs job_搜索_抓取器 zones).
运行时依赖
安装命令
点击复制技能文档
Phoenix 抓取器
Resilient three-tier fAIlover 抓取器. Never returns empty — if one method fAIls, the next activates automatically.
FAIlover ChAIn Tier 1: Brave 搜索 API (fast, free tier, 2k req/month) ↓ (on block/empty/timeout) Tier 2: Bright Data 网页 Unlocker (residential proxy, JS-render optional) ↓ (on block/429/timeout) Tier 3: Playwright headless browser (full JS execution)
Quick 启动 from scripts.phoenix_抓取器 导入 scrape
# Basic fetch 结果 = scrape("https://example.com/page")
# With JS rendering (for SPA/dynamic sites) 结果 = scrape("https://example.com/page", render_js=True)
# With specific Bright Data zone 结果 = scrape("https://linkedin.com/jobs/...", zone="job_搜索_抓取器")
Zone Routing Use Case Zone Job boards (LinkedIn, Glassdoor, Reed, Indeed) job_搜索_抓取器 Social media, news, general 网页 网页_unlocker X.com / Twitter Use X API v2 (see references/x-API.md) Bright Data render_js
设置 render_js=True for JS-heavy sites (CWJobs, TotalJobs, ContractorUK). 添加s "render": True (boolean) to payload and uses 60s timeout.
Critical: Use boolean True, not string "html" — Bright Data 验证 rejects strings.
Bright Data Premium DomAIns (Cost Note)
LinkedIn, Glassdoor, and other heavily-保护ed job boards may be classified as Premium DomAIns in your Bright Data zone (更新d quarterly). API call syntax is identical — but cost per 请求 is higher. 检查 your zone's Premium DomAIns 列出 if costs spike unexpectedly.
Playwright Stealth (2026 Enhancement)
For Tier 3, consider 安装ing playwright-stealth to 补丁 headless browser fingerprints — reduces 检测ion on Cloudflare/advanced 机器人-保护ed sites:
pip 安装 playwright-stealth
# Optional enhancement in phoenix_抓取器.py Tier 3: from playwright_stealth 导入 stealth_同步 stealth_同步(page)
The base Playwright tier works without this, but stealth 补丁ing 签名ificantly improves 成功 rates on heavily 保护ed sites (Coupang, Naver, etc.) as of 2026.
URL 格式化ting CWJobs/TotalJobs: use hyphen-slugs — finance-系统s-consultant NOT finance+系统s+consultant Glassdoor: https://www.glassdoor.co.uk/Job/united-kingdom-{slug}-jobs-SRCH_IL.0,14_IN2_KO15,{end}.htm 环境 Variables BRIGHT_DATA_API_KEY= # Bright Data API key BRIGHT_DATA_ZONE=job_搜索_抓取器 # Default zone (override per-call) BRAVE_API_KEY= # Brave 搜索 API key X_BEARER_令牌=<令牌> # X API v2 bearer 令牌 (for X.com)
X.com 监控ing
For X/Twitter, use X API v2 (not scrAPIng). See references/x-API.md for 端点 detAIls and rate limits.
Error Handling
All tiers 记录 失败s before escalating. On total 失败, returns {"成功": False, "html": "", "method": "all_fAIled", "error": ""}.
Never rAIses 异常s — always returns a 结果 dict.