AutoResearch Skill Optimizer — 自动研究技能优化器

Name: AutoResearch Skill Optimizer — 自动研究技能优化器
Author: Neal Meyer

Neal Meyer

AutoResearch Skill Optimizer — 自动研究技能优化器

v2.0.0

自动研究技能优化器工具。

0· 130·0 当前·0 累计

by @ngmeyer (Neal Meyer)·MIT-0

自动化网络工具 AI模型访问开发工具测试工具

下载技能包

License

MIT-0

最后更新

2026/4/13

安全扫描

VirusTotal

无害

查看报告

OpenClaw

可疑

medium confidence

The skill's stated purpose (auto-improving other skills) is plausible, but its runtime instructions let the agent read session history, infer test inputs, and modify other skills' files without explicit guardrails — behavior that can unintentionally leak data or alter unrelated skills.

评估建议

This skill can be useful for automating improvement of skills, but it has two practical risks: (1) it tells the agent to pull test inputs from session history/memory and other skills' recent invocations — that can expose private data; (2) it instructs the agent to edit other skills' files (skills/{skill-name}/SKILL-optimized.md and to modify SKILL.md), and the approval mechanism is vague. Before installing/running: (a) require explicit, interactive approval before any edits and prefer a dry-run ...

详细分析 ▾

ℹ 用途与能力

The name/description match the instructions: it is an instruction-only optimizer that runs iterative tests and edits SKILL.md. Producing optimized SKILL-optimized.md and changelogs is coherent with the purpose. However, the workflow explicitly writes edits into skills/{skill-name}/ (modifying other skill files), which is powerful but not clearly restricted to a copy or sandbox — this capability is proportionate only if the agent is intended to edit target-skill files and the user expects that.

⚠ 指令范围

SKILL.md tells the agent to: (a) infer test inputs from the skill's docs or from "recent real invocations from memory/session history", and (b) make targeted edits to the target SKILL.md and keep/revert them. Asking the agent to access session/memory history and to programmatically modify other skills' files is broad and not explicitly constrained (no explicit 'dry-run' or explicit user-approval step is enforced). These instructions give the agent wide discretion to collect contextual data and to change other skills, which is scope creep from a simple auditor unless the user explicitly consents.

✓ 安装机制

No install spec and no code files — instruction-only — so there is nothing downloaded or written at install time. This minimizes supply-chain risk.

⚠ 凭证需求

The skill declares no environment variables or external credentials, but the runtime instructions rely on accessing session/memory history and reading other skills' docs and example prompts. Those are not declared as required sources and could include sensitive information. Also, the instructions assume the ability to write files into a skills/ directory; filesystem access is implied but not documented or constrained.

⚠ 持久化与权限

always:false (good), but the skill's autonomous-invocation setting is default (allowed). Combined with instructions that modify other skills' files and infer inputs from session history, autonomous runs could repeatedly alter or leak content. There is no explicit in-flow mandatory user confirmation checkpoint for edits (the doc says 'apply if approved' but does not define how approval is requested or enforced).

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv2.0.02026/3/18

v2: Dimensional scoring (0-10), meta-skill optimization, progressive disclosure, autonomous setup

● 无害

安装命令点击复制

官方npx clawhub@latest install autoresearch-skill-optimizer

镜像加速npx clawhub@latest install autoresearch-skill-optimizer --registry https://cn.clawhub-mirror.com

技能文档

Two-phase improvement system: (1) structural audit against Anthropic best practices, (2) iterative output quality loop.

Phase 1: Structure Audit (run 第一个, always)

Before optimizing output quality, audit the skill's architecture. Score against these 5 structural checks:

Structural Checklist:

Gotchas section — 做 SKILL.md 有 ## Gotchas section 带有在最少 one real failure case? (Highest-signal content per Anthropic)
Trigger-phrase description — 做 YAML description 字段 say 当...时 到使用 skill, 不只是什么做? 必须 include "使用当...时..." 或 equivalent trigger 条件.
Progressive disclosure — 做 skill 使用 file system (references/, scripts/, assets/, 配置.json) 代替的 inline-dumping everything 进入 SKILL.md?
Single focus — 做 skill fit cleanly 进入 one 类型 (库 Reference, Verification, Automation, Scaffolding, Runbook, etc.) 没有 straddling multiple?
否 railroading — 做 skill give Claude information + flexibility, rather 比在...上-specifying 如何必须 execute?

Score each: ✅ pass | ❌ fail | ⚠️ partial

For each failure: propose a concrete fix and apply if approved.

Quick wins 到 apply immediately:

如果否 Gotchas section → 添加 ## Gotchas\n- [Placeholder: 添加 real failures 这里作为它们're discovered]
如果 description summary → rewrite 作为 trigger 条件
如果所有 content inline → propose references/ folder structure

Phase 2: 输出 Quality 循环 (autoresearch)

After structure audit, run the iterative improvement loop on the skill's actual outputs.

Setup

哪个 skill? — 用户 specifies, 或 infer 从 context.
Test inputs — 获取 2-3 representative inputs. 如果用户 doesn't provide them:

- Check skill's own docs 对于示例 usage - 使用 recent real invocations 从 memory/会话 history - 对于 extraction skills: 使用 known-good URLs/files. 对于 generation skills: 使用 skill's own 示例 prompts.

Scoring checklist — Build 3-6 scoring items. 开始从 examples 下面, 然后 customize:

- 什么's #1 thing makes skill's 输出 bad? ('s checklist item 1) - 什么 would 使用户 say "'s exactly 什么 I wanted"? ('s positive framing) - 添加 1-2 items 从 "Universal structural quality" 列表下面

Scoring Checklist Examples

See references/checklist-examples.md for starter checklists by skill type (cold outreach, content, research, extraction, process/meta-skills).

Scoring Modes

Binary mode (默认对于 simple skills): 是/否 per checklist item. Pass rate = 总计是 / (items × runs).

Dimensional mode (使用对于 complex skills 或当...时 binary plateaus): Score 每个 dimension 0-10. Identify weakest dimension (lowest 平均值穿过 runs). Target dimension 对于 revision — 做不 rewrite everything.

Use dimensional mode when:

Binary scoring hits 100% 但是输出仍然 feels mediocre
skill 有 qualitative dimensions (tone, depth, relevance) binary 可以't capture
您 want 到 improve 从 "good" 到 "excellent" rather 比从 "broken" 到 "working"

循环

Round N:
Run skill against each test input
Score each output (binary: 1 per yes | dimensional: 0-10 per dimension)
Calculate score:
   - Binary: pass rate = (total yes) / (items × runs)
   - Dimensional: avg score per dimension across runs
Identify the weakest item/dimension (most failures or lowest avg score)
Make ONE targeted change to SKILL.md addressing ONLY that weakness
Re-run and re-score
If new score > old score: KEEP. Else: REVERT.
Log: score before/after, change made, dimension targeted, kept/reverted

Stop when: binary ≥ 95% (3 consecutive rounds) OR dimensional weakest ≥ 8/10 (3 consecutive) OR 20 rounds reached.

输出 Files

skills/{skill-name}/SKILL-optimized.md — improved version (original untouched)
skills/{skill-name}/optimization-changelog.md — 满 round log

Changelog 格式

## Structural Audit
Gotchas section: ❌ → Added placeholder
Description: ❌ → Rewritten as trigger condition
Progressive disclosure: ⚠️ → Noted, deferred
Round 1 (binary mode)
Score: 4/10 (40%)
Weakest item: "Does it mention business name?"
Change: Added rule "Always open with [Business Name],"
New score: 7/10 (70%)
Decision: KEPT
Round 2 (dimensional mode)
Scores: Accuracy 8/10 | Tone 5/10 | Brevity 9/10 | Relevance 7/10
Weakest dimension: Tone (5/10)
Change: Added "Match prospect's industry language, not generic sales speak"
New scores: Accuracy 8/10 | Tone 7/10 | Brevity 9/10 | Relevance 7/10
Decision: KEPT (Tone +2)

Optimizing Meta-Skills (Process Skills)

Some skills don't produce text — they drive a process (e.g., this skill itself, planning workflows, research pipelines). For these:

什么到 score: Score experience 的 following process, 不 text artifact.

做过 process produce 清除结果?
是那里 moments 的 confusion 在哪里 instructions 是 ambiguous?
做过任何 step feel unnecessary 或 redundant?
Could someone 关注没有 prior context?

如何到 test: Run skill 在...上 2-3 real tasks (不 hypothetical). Score 之后每个 real 使用. test inputs tasks 您're applying skill 到.

Dimensional scoring 对于 process skills:

Clarity — 可以 I 关注每个 step 没有 re-reading?
Completeness — 做 process cover 满 workflow?
Actionability — 做 I know exactly 什么到做在每个 step, 或做 I 有到 infer?
Efficiency — 那里 wasted/redundant steps?
Self-applicability — 可以 process improve itself? (Meta-test)

Checklist Sweet Spot

3-6 questions = optimal
Too few: 不 granular enough 到 guide changes
Too many: skill starts gaming checklist (点赞 student memorizing answers 没有 understanding)

当...时到使用

之前 running 任何 skill 在 scale (cold outreach, content generation, scraping)
之后新的模型 upgrade — re-验证 existing skills
当...时 skill 有 inconsistent 输出 quality
Monthly maintenance pass 在...上 high-使用 skills
Immediately 之后 creating 新的 skill (structural audit 仅 takes 5 min)

当...时到 Run 哪个 Phase

任何新的 skill → Structure audit (5 min, catches issues early)
之前 scale 使用 → 输出循环 (验证 quality 之前 mass runs)
之后模型 upgrade → 输出循环 (re-验证 existing skills)
Inconsistent 输出 → 输出循环 (查找 failing item/dimension)
High-revenue skills → Both phases (cold outreach, content gen — quality variance = revenue impact)

Gotchas

输出循环 requires skills produce scoreable text outputs — scripts/tools produce side effects 需要不同 verification approach (使用 Product Verification skill 类型代替)
Don't run 输出循环在...上 skills call expensive APIs 没有 rate limit awareness — 每个 round runs skill multiple 乘以
Phase 1 (structure audit) 应该 always run 之前 Phase 2 — fixing structure 第一个 makes 输出循环更多 effective
3-6 checklist questions sweet spot — 更多比 6 和 skill starts gaming individual checks rather 比 improving overall quality

Two-phase improvement system: (1) structural audit against Anthropic best practices, (2) iterative output quality loop.

Phase 1: Structure Audit (run first, always)

Before optimizing output quality, audit the skill's architecture. Score against these 5 structural checks:

Structural Checklist:

Gotchas section — Does SKILL.md have a ## Gotchas section with at least one real failure case? (Highest-signal content per Anthropic)
Trigger-phrase description — Does the YAML description field say when to use the skill, not just what it does? Must include "Use when..." or equivalent trigger condition.
Progressive disclosure — Does the skill use the file system (references/, scripts/, assets/, config.json) instead of inline-dumping everything into SKILL.md?
Single focus — Does the skill fit cleanly into one type (Library Reference, Verification, Automation, Scaffolding, Runbook, etc.) without straddling multiple?
No railroading — Does the skill give Claude information + flexibility, rather than over-specifying how it must execute?

Score each: ✅ pass | ❌ fail | ⚠️ partial

For each failure: propose a concrete fix and apply if approved.

Quick wins to apply immediately:

If no Gotchas section → add ## Gotchas\n- [Placeholder: add real failures here as they're discovered]
If description is a summary → rewrite as trigger condition
If all content is inline → propose a references/ folder structure

Phase 2: Output Quality Loop (autoresearch)

After structure audit, run the iterative improvement loop on the skill's actual outputs.

Setup

Which skill? — User specifies, or infer from context.
Test inputs — Get 2-3 representative inputs. If the user doesn't provide them:

- Check the skill's own docs for example usage - Use recent real invocations from memory/session history - For extraction skills: use known-good URLs/files. For generation skills: use the skill's own example prompts.

Scoring checklist — Build 3-6 scoring items. Start from the examples below, then customize:

- What's the #1 thing that makes this skill's output bad? (That's checklist item 1) - What would make a user say "that's exactly what I wanted"? (That's the positive framing) - Add 1-2 items from the "Universal structural quality" list below

Scoring Checklist Examples

See references/checklist-examples.md for starter checklists by skill type (cold outreach, content, research, extraction, process/meta-skills).

Scoring Modes

Binary mode (default for simple skills): Yes/no per checklist item. Pass rate = total yes / (items × runs).

Dimensional mode (use for complex skills or when binary plateaus): Score each dimension 0-10. Identify the weakest dimension (lowest average across runs). Target that dimension for revision — do NOT rewrite everything.

Use dimensional mode when:

Binary scoring hits 100% but output still feels mediocre
The skill has qualitative dimensions (tone, depth, relevance) that binary can't capture
You want to improve from "good" to "excellent" rather than from "broken" to "working"

The Loop

Round N:
Run skill against each test input
Score each output (binary: 1 per yes | dimensional: 0-10 per dimension)
Calculate score:
   - Binary: pass rate = (total yes) / (items × runs)
   - Dimensional: avg score per dimension across runs
Identify the weakest item/dimension (most failures or lowest avg score)
Make ONE targeted change to SKILL.md addressing ONLY that weakness
Re-run and re-score
If new score > old score: KEEP. Else: REVERT.
Log: score before/after, change made, dimension targeted, kept/reverted

Stop when: binary ≥ 95% (3 consecutive rounds) OR dimensional weakest ≥ 8/10 (3 consecutive) OR 20 rounds reached.

Output Files

skills/{skill-name}/SKILL-optimized.md — improved version (original untouched)
skills/{skill-name}/optimization-changelog.md — full round log

Changelog Format

## Structural Audit
Gotchas section: ❌ → Added placeholder
Description: ❌ → Rewritten as trigger condition
Progressive disclosure: ⚠️ → Noted, deferred
Round 1 (binary mode)
Score: 4/10 (40%)
Weakest item: "Does it mention business name?"
Change: Added rule "Always open with [Business Name],"
New score: 7/10 (70%)
Decision: KEPT
Round 2 (dimensional mode)
Scores: Accuracy 8/10 | Tone 5/10 | Brevity 9/10 | Relevance 7/10
Weakest dimension: Tone (5/10)
Change: Added "Match prospect's industry language, not generic sales speak"
New scores: Accuracy 8/10 | Tone 7/10 | Brevity 9/10 | Relevance 7/10
Decision: KEPT (Tone +2)

Optimizing Meta-Skills (Process Skills)

Some skills don't produce text — they drive a process (e.g., this skill itself, planning workflows, research pipelines). For these:

What to score: Score the experience of following the process, not a text artifact.

Did the process produce a clear result?
Were there moments of confusion where the instructions were ambiguous?
Did any step feel unnecessary or redundant?
Could someone follow this without prior context?

How to test: Run the skill on 2-3 real tasks (not hypothetical). Score after each real use. The test inputs ARE the tasks you're applying the skill to.

Dimensional scoring for process skills:

Clarity — Can I follow each step without re-reading?
Completeness — Does the process cover the full workflow?
Actionability — Do I know exactly what to do at each step, or do I have to infer?
Efficiency — Are there wasted/redundant steps?
Self-applicability — Can this process improve itself? (Meta-test)

Checklist Sweet Spot

3-6 questions = optimal
Too few: not granular enough to guide changes
Too many: skill starts gaming the checklist (like a student memorizing answers without understanding)

When to Use

Before running any skill at scale (cold outreach, content generation, scraping)
After a new model upgrade — re-validate existing skills
When a skill has inconsistent output quality
Monthly maintenance pass on high-use skills
Immediately after creating a new skill (structural audit only takes 5 min)

When to Run Which Phase

Any new skill → Structure audit (5 min, catches issues early)
Before scale use → Output loop (validate quality before mass runs)
After model upgrade → Output loop (re-validate existing skills)
Inconsistent output → Output loop (find the failing item/dimension)
High-revenue skills → Both phases (cold outreach, content gen — quality variance = revenue impact)

Gotchas

Output loop requires skills that produce scoreable text outputs — scripts/tools that produce side effects need a different verification approach (use a Product Verification skill type instead)
Don't run output loop on skills that call expensive APIs without rate limit awareness — each round runs the skill multiple times
Phase 1 (structure audit) should always run before Phase 2 — fixing structure first makes the output loop more effective
3-6 checklist questions is the sweet spot — more than 6 and the skill starts gaming individual checks rather than improving overall quality

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

License

运行时依赖

版本

安装命令 点击复制

技能文档

Phase 1: Structure Audit (run 第一个, always)

Phase 2: 输出 Quality 循环 (autoresearch)

Setup

Scoring Checklist Examples

Scoring Modes

循环

输出 Files

Changelog 格式

Round 1 (binary mode)

Round 2 (dimensional mode)

Optimizing Meta-Skills (Process Skills)

Checklist Sweet Spot

当...时 到 使用

当...时 到 Run 哪个 Phase

Gotchas

Phase 1: Structure Audit (run first, always)

Phase 2: Output Quality Loop (autoresearch)

Setup

Scoring Checklist Examples

Scoring Modes

The Loop

Output Files

Changelog Format

Round 1 (binary mode)

Round 2 (dimensional mode)

Optimizing Meta-Skills (Process Skills)

Checklist Sweet Spot

When to Use

When to Run Which Phase

Gotchas

安装命令点击复制

当...时到使用

当...时到 Run 哪个 Phase