Clean Content Fetch — 技能工具

晨冬

Clean Content Fetch — 技能工具

v1.0.5

获取干净、可读的网页正文内容，适合现代网页、博客、新闻、公告和微信公众号文章抓取；支持网页正文提取、内容清洗、去噪、Markdown 输出，适用于普通 fetch 效果不佳、页面噪音较多或动态渲染干扰的场景。Clean content fetch for modern web pages, article ext...

1· 596·6 当前·7 累计

by @jllyzzd (晨冬)·MIT-0

浏览器自动化网络工具文档工具开发工具

下载技能包

License

MIT-0

最后更新

2026/4/12

安全扫描

VirusTotal

无害

查看报告

OpenClaw

可疑

high confidence

The skill's instructions require running Python scripts (and installing browser automation dependencies) but those script files are not included in the package — the described runtime does not match the delivered files.

评估建议

This skill's README-like instructions expect a scripts/ directory and a script named scripts/scrapling_fetch.py, but those scripts are not included in the package. Before installing or running anything: 1) Ask the publisher for the missing script files or an authoritative source (git repo or release) and verify their contents. 2) Never pip-install packages system-wide for unknown code — use an isolated virtual environment or container. 3) Inspect any fetched scripts for network calls, hidden end...

详细分析 ▾

⚠ 用途与能力

The name/description claim a content-extraction tool that runs a Python pipeline (scrapling + html2text + optional Playwright). That purpose would legitimately need the referenced scripts and possibly those dependencies. However, the package contains only reference docs and no scripts (e.g., scripts/scrapling_fetch.py is referenced in SKILL.md but not present). This mismatch means the skill as delivered cannot perform its stated function without fetching or relying on external code.

⚠ 指令范围

SKILL.md gives concrete runtime instructions (run python3 scripts/scrapling_fetch.py <url> <max_chars>, install packages, optionally use playwright) which are narrowly scoped to fetching and cleaning public webpages. Those instructions do not ask for unrelated system files or credentials. The problem is they direct execution of a script that is not included; if an agent attempted to follow them it would need to obtain or install code from elsewhere, which is not documented here and increases risk.

✓ 安装机制

There is no install spec and no binaries packaged. That keeps the skill low-risk from an automatic-install perspective. The SKILL.md recommends pip installs and playwright browser installation — standard for this functionality — but these are manual recommendations, not an automated install step included in the package.

✓ 凭证需求

The skill requests no environment variables, no credentials, and no config-path access. The declared dependencies (scrapling, html2text, curl_cffi, playwright, browserforge) align with web fetching and rendering. Nothing in the description asks for unrelated secrets or system access.

✓ 持久化与权限

The skill is user-invocable, not always-on, and does not request to modify other skills or persist configuration. Autonomous invocation is allowed by default but is not combined with any other high-risk factor here.

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv1.0.52026/3/9

Republish full bundle after path cleanup

● 无害

安装命令点击复制

官方npx clawhub@latest install clean-content-fetch

镜像加速npx clawhub@latest install clean-content-fetch --registry https://cn.clawhub-mirror.com

技能文档

当用户要获取网页内容、正文提取、把网页转成 markdown/text、抓取文章主体时，优先使用此技能。

默认流程

使用 python3 scripts/scrapling_fetch.py
默认正文选择器优先级：

- article - main - .post-content - [class*="body"]

命中正文后，使用 html2text 转 Markdown
若都未命中，回退到 body
最终按 max_chars 截断输出

用法

python3 scripts/scrapling_fetch.py  30000

依赖

常见依赖包括：

scrapling
html2text
curl_cffi
playwright
browserforge

建议在隔离环境中安装依赖，再运行脚本。若宿主环境限制系统级 pip 安装，可使用项目级虚拟环境。

示例：

python3 -m venv .venv
. .venv/bin/activate
pip install scrapling html2text curl_cffi playwright browserforge
python -m playwright install chromium
python scripts/scrapling_fetch.py  30000

输出约定

脚本默认输出 Markdown 正文内容。如需结构化输出，可追加 --json。如需调试提取命中了哪个 selector，可查看 stderr 输出。

附加资源

用法参考：references/usage.md
选择器策略：references/selectors.md
统一入口：scripts/fetch-web-content

何时用这个技能

获取文章正文
抓博客/新闻/公告正文
将网页转成 Markdown 供后续总结
常规 fetch 效果差，希望提升现代网页抓取稳定性
抓小红书分享短链或笔记落地页正文

小红书抓取方法

对于 xhslink.com 短链或小红书笔记页，可直接运行：

python3 scripts/scrapling_fetch.py 'http://xhslink.com/o/9745hugimlD' 30000

说明：

脚本会先解析短链并抓取落地页正文
适合提取小红书笔记文案、标题和主体内容
若页面需要更复杂交互，再切到浏览器自动化

安全边界

仅用于抓取公开网页的正文内容与可读文本
不用于登录后页面、私有数据、受限资源或绕过权限控制
若目标页面需要账号登录、点击授权、滚动交互或复杂会话状态，应改用浏览器自动化并在明确授权下执行

何时不用

需要完整浏览器交互、点击、登录、翻页时：改用浏览器自动化
只是简单获取 API JSON：直接请求 API 更合适

当用户要获取网页内容、正文提取、把网页转成 markdown/text、抓取文章主体时，优先使用此技能。

默认流程

使用 python3 scripts/scrapling_fetch.py
默认正文选择器优先级：

- article - main - .post-content - [class*="body"]

命中正文后，使用 html2text 转 Markdown
若都未命中，回退到 body
最终按 max_chars 截断输出

用法

python3 scripts/scrapling_fetch.py  30000

依赖

常见依赖包括：

scrapling
html2text
curl_cffi
playwright
browserforge

建议在隔离环境中安装依赖，再运行脚本。若宿主环境限制系统级 pip 安装，可使用项目级虚拟环境。

示例：

python3 -m venv .venv
. .venv/bin/activate
pip install scrapling html2text curl_cffi playwright browserforge
python -m playwright install chromium
python scripts/scrapling_fetch.py  30000

输出约定

脚本默认输出 Markdown 正文内容。如需结构化输出，可追加 --json。如需调试提取命中了哪个 selector，可查看 stderr 输出。

附加资源

用法参考：references/usage.md
选择器策略：references/selectors.md
统一入口：scripts/fetch-web-content

何时用这个技能

获取文章正文
抓博客/新闻/公告正文
将网页转成 Markdown 供后续总结
常规 fetch 效果差，希望提升现代网页抓取稳定性
抓小红书分享短链或笔记落地页正文

小红书抓取方法

对于 xhslink.com 短链或小红书笔记页，可直接运行：

python3 scripts/scrapling_fetch.py 'http://xhslink.com/o/9745hugimlD' 30000

说明：

脚本会先解析短链并抓取落地页正文
适合提取小红书笔记文案、标题和主体内容
若页面需要更复杂交互，再切到浏览器自动化

安全边界

仅用于抓取公开网页的正文内容与可读文本
不用于登录后页面、私有数据、受限资源或绕过权限控制
若目标页面需要账号登录、点击授权、滚动交互或复杂会话状态，应改用浏览器自动化并在明确授权下执行

何时不用

需要完整浏览器交互、点击、登录、翻页时：改用浏览器自动化
只是简单获取 API JSON：直接请求 API 更合适

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

License

运行时依赖

版本

安装命令 点击复制

技能文档

默认流程

用法

依赖

输出约定

附加资源

何时用这个技能

小红书抓取方法

安全边界

何时不用

默认流程

用法

依赖

输出约定

附加资源

何时用这个技能

小红书抓取方法

安全边界

何时不用

安装命令点击复制