📦 AutoResearch Skill Optimizer — 自动研究技能优化器

v2.0.0

自动研究技能优化器工具。

0· 130·0 当前·0 累计

by @ngmeyer (Neal Meyer)·MIT-0

自动化网络工具 AI模型访问开发工具测试工具

下载技能包

License

MIT-0

最后更新

2026/4/13

安全扫描

VirusTotal

无害

查看报告

OpenClaw

可疑

medium confidence

The skill's stated purpose (auto-improving other skills) is plausible, but its runtime instructions let the agent read session history, infer test inputs, and modify other skills' files without explicit guardrails — behavior that can unintentionally leak data or alter unrelated skills.

评估建议

This skill can be useful for automating improvement of skills, but it has two practical risks: (1) it tells the agent to pull test inputs from session history/memory and other skills' recent invocations — that can expose private data; (2) it instructs the agent to edit other skills' files (skills/{skill-name}/SKILL-optimized.md and to modify SKILL.md), and the approval mechanism is vague. Before installing/running: (a) require explicit, interactive approval before any edits and prefer a dry-run ...

详细分析 ▾

ℹ 用途与能力

The name/description match the instructions: it is an instruction-only optimizer that runs iterative tests and edits SKILL.md. Producing optimized SKILL-optimized.md and changelogs is coherent with the purpose. However, the workflow explicitly writes edits into skills/{skill-name}/ (modifying other skill files), which is powerful but not clearly restricted to a copy or sandbox — this capability is proportionate only if the agent is intended to edit target-skill files and the user expects that.

⚠ 指令范围

SKILL.md tells the agent to: (a) infer test inputs from the skill's docs or from "recent real invocations from memory/session history", and (b) make targeted edits to the target SKILL.md and keep/revert them. Asking the agent to access session/memory history and to programmatically modify other skills' files is broad and not explicitly constrained (no explicit 'dry-run' or explicit user-approval step is enforced). These instructions give the agent wide discretion to collect contextual data and to change other skills, which is scope creep from a simple auditor unless the user explicitly consents.

✓ 安装机制

No install spec and no code files — instruction-only — so there is nothing downloaded or written at install time. This minimizes supply-chain risk.

⚠ 凭证需求

The skill declares no environment variables or external credentials, but the runtime instructions rely on accessing session/memory history and reading other skills' docs and example prompts. Those are not declared as required sources and could include sensitive information. Also, the instructions assume the ability to write files into a skills/ directory; filesystem access is implied but not documented or constrained.

⚠ 持久化与权限

always:false (good), but the skill's autonomous-invocation setting is default (allowed). Combined with instructions that modify other skills' files and infer inputs from session history, autonomous runs could repeatedly alter or leak content. There is no explicit in-flow mandatory user confirmation checkpoint for edits (the doc says 'apply if approved' but does not define how approval is requested or enforced).

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv2.0.02026/3/17

v2: Dimensional scoring (0-10), meta-skill optimization, progressive disclosure, autonomous setup

● 无害

安装命令

点击复制

官方npx clawhub@latest install autoresearch-skill-optimizer

镜像加速npx clawhub@latest install autoresearch-skill-optimizer --registry https://cn.longxiaskill.com

技能文档

Two-phase improvement system: (1) structural audit against Anthropic best practices, (2) iterative output quality loop.

Phase 1: Structure Audit (run 第一个, always)

Before optimizing output quality, audit the skill's architecture. Score against these 5 structural checks:

Structural Checklist:

Gotchas section — 做 SKILL.md 有 ## Gotchas section 带有在最少 one real failure case? (Highest-signal content per Anthropic)
Trigger-phrase description — 做 YAML description 字段 say 当...时 到使用 skill, 不只是什么做? 必须 include "使用当...时..." 或 equivalent trigger 条件.
Progressive disclosure — 做 skill 使用 file system (references/, scripts/, assets/, 配置.json) 代替的 inline-dumping everything 进入 SKILL.md?
Single focus — 做 skill fit cleanly 进入 one 类型 (库 Reference, Verification, Automation, Scaffolding, Runbook, etc.) 没有 straddling multiple?
否 railroading — 做 skill give Claude information + flexibility, rather 比在...上-specifying 如何必须 execute?

Score each: ✅ pass | ❌ fail | ⚠️ partial

For each failure: propose a concrete fix and apply if approved.

Quick wins 到 apply immediately:

如果否 Gotchas section → 添加 ## Gotchas\n- [Placeholder: 添加 real failures 这里作为它们're discovered]
如果 description summary → rewrite 作为 trigger 条件
如果所有 content inline → propose references/ folder structure

Phase 2: 输出 Quality 循环 (autoresearch)

After structure audit, run the iterative improvement loop on the skill's actual outputs.

Setup

哪个 skill? — 用户 specifies, 或 infer 从 context.
Test inputs — 获取 2-3 representative inputs. 如果用户 doesn't provide them:

- Check skill's own docs 对于示例 usage - 使用 recent real invocations 从 memory/会话 history - 对于 extraction skills: 使用 known-good URLs/files. 对于 generation skills: 使用 skill's own 示例 prompts.

Scoring checklist — Build 3-6 scoring items. 开始从 examples 下面, 然后 customize:

- 什么's #1 thing makes skill's 输出 bad? ('s checklist item 1) - 什么 would 使用户 say "'s exactly 什么 I wanted"? ('s positive framing) - 添加 1-2 items 从 "Universal structural quality" 列表下面

Scoring Checklist Examples

See references/checklist-examples.md for starter checklists by skill type (cold outreach, content, research, extraction, process/meta-skills).

Scoring Modes

Binary mode (默认对于 simple skills): 是/否 per checklist item. Pass rate = 总计是 / (items × runs).

Dimensional mode (使用对于 complex skills 或当...时 binary plateaus): Score 每个 dimension 0-10. Identify weakest dimension (lowest 平均值穿过 runs). Target dimension 对于 revision — 做不 rewrite everything.

Use dimensional mode when:

Binary scoring hits 100% 但是输出仍然 feels mediocre
skill 有 qualitative dimensions (tone, depth, relevance) binary 可以't capture
您 want 到 improve 从 "good" 到 "excellent" rather 比从 "broken" 到 "working"

循环

Round N:
Run skill against each test input
Score each output (binary: 1 per yes | dimensional: 0-10 per dimension)
Calculate score:
   - Binary: pass rate = (total yes) / (items × runs)
   - Dimensional: avg score per dimension across runs
Identify the weakest item/dimension (most failures or lowest avg score)
Make ONE targeted change to SKILL.md addressing ONLY that weakness
Re-run and re-score
If new score > old score: KEEP. Else: REVERT.
Log: score before/after, change made, dimension targeted, kept/reverted

Stop when: binary ≥ 95% (3 consecutive rounds) OR dimensional weakest ≥ 8/10 (3 consecutive) OR 20 rounds reached.

输出 Files

skills/{skill-name}/SKILL-optimized.md — improved version (original untouched)
skills/{skill-name}/optimization-changelog.md — 满 round log

Changelog 格式

## Structural Audit
Gotchas section: ❌ → Added placeholder
Description: ❌ → Rewritten as trigger condition
Progressive disclosure: ⚠️ → Noted, deferred
Round 1 (binary mode)
Score: 4/10 (40%)
Weakest item: "Does it mention business name?"
Change: Added rule "Always open with [Business Name],"
New score: 7/10 (70%)
Decision: KEPT
Round 2 (dimensional mode)
Scores: Accuracy 8/10 | Tone 5/10 | Brevity 9/10 | Relevance 7/10
Weakest dimension: Tone (5/10)
Change: Added "Match prospect's industry language, not generic sales speak"
New scores: Accuracy 8/10 | Tone 7/10 | Brevity 9/10 | Relevance 7/10
Decision: KEPT (Tone +2)

Optimizing Meta-Skills (Process Skills)

Some skills don't produce text — they drive a process (e.g., this skill itself, planning workflows, research pipelines). For these:

什么到 score: Score experience 的 following process, 不 text artifact.

做过 process produce 清除结果?
是那里 moments 的 confusion 在哪里 instructions 是 ambiguous?
做过任何 step feel unnecessary 或 redundant?
Could someone 关注没有 prior context?

如何到 test: Run skill 在...上 2-3 real tasks (不 hypothetical). Score 之后每个 real 使用. test inputs tasks 您're applying skill 到.

Dimensional scoring 对于 process skills:

Clarity — 可以 I 关注每个 step 没有 re-reading?
Completeness — 做 process cover 满 workflow?
Actionability — 做 I know exactly 什么到做在每个 step, 或做 I 有到 infer?
Efficiency — 那里 wasted/redundant steps?
Self-applicability — 可以 process improve itself? (Meta-test)

Checklist Sweet Spot

3-6 questions = optimal
Too few: 不 granular enough 到 guide changes
Too many: skill starts gaming checklist (点赞 student memorizing answers 没有 understanding)

当...时到使用

之前 running 任何 skill 在 scale (cold outreach, content generation, scraping)
之后新的模型 upgrade — re-验证 existing skills
当...时 skill 有 inconsistent 输出 quality
Monthly maintenance pass 在...上 high-使用 skills
Immediately 之后 creating 新的 skill (structural audit 仅 takes 5 min)

当...时到 Run 哪个 Phase

任何新的 skill → Structure audit (5 min, catches issues early)
之前 scale 使用 → 输出循环 (验证 quality 之前 mass runs)
之后模型 upgrade → 输出循环 (re-验证 existing skills)
Inconsistent 输出 → 输出循环 (查找 failing item/dimension)
High-revenue skills → Both phases (cold outreach, content gen — quality variance = revenue impact)

Gotchas

输出循环 requires skills produce scoreable text outputs — scripts/tools produce side effects 需要不同 verification approach (使用 Product Verification skill 类型代替)
Don't run 输出循环在...上 skills call expensive APIs 没有 rate limit awareness — 每个 round runs skill multiple 乘以
Phase 1 (structure audit) 应该 always run 之前 Phase 2 — fixing structure 第一个 makes 输出循环更多 effective
3-6 checklist questions sweet spot — 更多比 6 和 skill starts gaming individual checks rather 比 improving overall quality

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

License

运行时依赖

版本

安装命令

技能文档

Phase 1: Structure Audit (run 第一个, always)

Phase 2: 输出 Quality 循环 (autoresearch)

Setup

Scoring Checklist Examples

Scoring Modes

循环

输出 Files

Changelog 格式

Round 1 (binary mode)

Round 2 (dimensional mode)

Optimizing Meta-Skills (Process Skills)

Checklist Sweet Spot

当...时 到 使用

当...时 到 Run 哪个 Phase

Gotchas

当...时到使用

当...时到 Run 哪个 Phase