Playwright Scraper Skill — 基于 Playwright 的网页爬虫技能

Name: Playwright Scraper Skill — 基于 Playwright 的网页爬虫技能
Rating: 3 (52 reviews)
Author: waisimon

waisimon

Playwright Scraper Skill — 基于 Playwright 的网页爬虫技能

v1.2.0

一个基于 Playwright 的网页爬虫 OpenClaw 技能，具备反爬虫保护。已在复杂网站如 Discuss.com.hk 上成功测试。

52· 23,900·0 当前·0 累计

by @waisimon

网络工具开发工具 API工具自动化

下载技能包

运行时依赖

无特殊依赖

安装命令点击复制

官方clawhub install playwright-scraper-skill

镜像加速clawhub install playwright-scraper-skill --registry https://www.longxiaskill.com

技能文档

简介

一个基于 Playwright 的网页爬虫 OpenClaw 技能，具备反爬虫保护。已在复杂网站如 Discuss.com.hk 上成功测试。

使用方法

# 示例命令，实际使用请根据技能文档调整
npx playwright-scraper-skill <目标网址>

注意

请确保目标网站允许爬虫活动
定期更新技能以应对反爬虫策略的变化

A Playwright-based web scraping OpenClaw Skill with anti-bot protection. Choose the best approach based on the target website's anti-bot level.

🎯 Use Case Matrix

Target Website	Anti-Bot Level	Recommended Method	Script
Regular Sites	Low	web_fetch tool	N/A (built-in)
Dynamic Sites	Medium	Playwright Simple	`scripts/playwright-simple.js`
Cloudflare Protected	High	Playwright Stealth ⭐	`scripts/playwright-stealth.js`
YouTube	Special	deep-scraper	Install separately
Reddit	Special	reddit-scraper	Install separately

📦 Installation

cd playwright-scraper-skill
npm install
npx playwright install chromium

🚀 Quick Start

1️⃣ Simple Sites (No Anti-Bot)

Use OpenClaw's built-in web_fetch tool:

# Invoke directly in OpenClaw
Hey, fetch me the content from https://example.com

2️⃣ Dynamic Sites (Requires JavaScript)

Use Playwright Simple:

node scripts/playwright-simple.js "https://example.com"

Example output:

{
  "url": "https://example.com",
  "title": "Example Domain",
  "content": "...",
  "elapsedSeconds": "3.45"
}

3️⃣ Anti-Bot Protected Sites (Cloudflare etc.)

Use Playwright Stealth:

node scripts/playwright-stealth.js "https://m.discuss.com.hk/#hot"

Features:

Hide automation markers (navigator.webdriver = false)
Realistic User-Agent (iPhone, Android)
Random delays to mimic human behavior
Screenshot and HTML saving support

4️⃣ YouTube Video Transcripts

Use deep-scraper (install separately):

# Install deep-scraper skill npx clawhub install deep-scraper

# Use it cd skills/deep-scraper node assets/youtube_handler.js "https://www.youtube.com/watch?v=VIDEO_ID"

📖 Script Descriptions

`scripts/playwright-simple.js`

Use Case: Regular dynamic websites
Speed: Fast (3-5 seconds)
Anti-Bot: None
Output: JSON (title, content, URL)

`scripts/playwright-stealth.js` ⭐

Use Case: Sites with Cloudflare or anti-bot protection
Speed: Medium (5-20 seconds)
Anti-Bot: Medium-High (hides automation, realistic UA)
Output: JSON + Screenshot + HTML file
Verified: 100% success on Discuss.com.hk

🎓 Best Practices

1. Try web_fetch First

If the site doesn't have dynamic loading, use OpenClaw's web_fetch tool—it's fastest.

2. Need JavaScript? Use Playwright Simple

If you need to wait for JavaScript rendering, use playwright-simple.js.

3. Getting Blocked? Use Stealth

If you encounter 403 or Cloudflare challenges, use playwright-stealth.js.

4. Special Sites Need Specialized Skills

YouTube → deep-scraper
Reddit → reddit-scraper
Twitter → bird skill

🔧 Customization

All scripts support environment variables:

# Set screenshot path SCREENSHOT_PATH=/path/to/screenshot.png node scripts/playwright-stealth.js URL # Set wait time (milliseconds) WAIT_TIME=10000 node scripts/playwright-simple.js URL # Enable headful mode (show browser) HEADLESS=false node scripts/playwright-stealth.js URL # Save HTML SAVE_HTML=true node scripts/playwright-stealth.js URL

# Custom User-Agent USER_AGENT="Mozilla/5.0 ..." node scripts/playwright-stealth.js URL

📊 Performance Comparison

Method	Speed	Anti-Bot	Success Rate (Discuss.com.hk)
web_fetch	⚡ Fastest	❌ None	0%
Playwright Simple	🚀 Fast	⚠️ Low	20%
Playwright Stealth	⏱️ Medium	✅ Medium	100% ✅
Puppeteer Stealth	⏱️ Medium	✅ Medium-High	~80%
Crawlee (deep-scraper)	🐢 Slow	❌ Detected	0%
Chaser (Rust)	⏱️ Medium	❌ Detected	0%

🛡️ Anti-Bot Techniques Summary

Lessons learned from our testing:

✅ Effective Anti-Bot Measures

Hide navigator.webdriver — Essential
Realistic User-Agent — Use real devices (iPhone, Android)
Mimic Human Behavior — Random delays, scrolling
Avoid Framework Signatures — Crawlee, Selenium are easily detected
Use addInitScript (Playwright) — Inject before page load

❌ Ineffective Anti-Bot Measures

Only changing User-Agent — Not enough
Using high-level frameworks (Crawlee) — More easily detected
Docker isolation — Doesn't help with Cloudflare

🔍 Troubleshooting

Issue: 403 Forbidden

Solution: Use playwright-stealth.js

Issue: Cloudflare Challenge Page

Solution:

Increase wait time (10-15 seconds)
Try headless: false (headful mode sometimes has higher success rate)
Consider using proxy IPs

Issue: Blank Page