首页龙虾技能列表 › AutoResearch Skill Optimizer — 自动研究技能优化器

AutoResearch Skill Optimizer — 自动研究技能优化器

v2.0.0

自动研究技能优化器工具。

0· 130·0 当前·0 累计
by @ngmeyer (Neal Meyer)·MIT-0
下载技能包
License
MIT-0
最后更新
2026/4/13
安全扫描
VirusTotal
无害
查看报告
OpenClaw
可疑
medium confidence
The skill's stated purpose (auto-improving other skills) is plausible, but its runtime instructions let the agent read session history, infer test inputs, and modify other skills' files without explicit guardrails — behavior that can unintentionally leak data or alter unrelated skills.
评估建议
This skill can be useful for automating improvement of skills, but it has two practical risks: (1) it tells the agent to pull test inputs from session history/memory and other skills' recent invocations — that can expose private data; (2) it instructs the agent to edit other skills' files (skills/{skill-name}/SKILL-optimized.md and to modify SKILL.md), and the approval mechanism is vague. Before installing/running: (a) require explicit, interactive approval before any edits and prefer a dry-run ...
详细分析 ▾
用途与能力
The name/description match the instructions: it is an instruction-only optimizer that runs iterative tests and edits SKILL.md. Producing optimized SKILL-optimized.md and changelogs is coherent with the purpose. However, the workflow explicitly writes edits into skills/{skill-name}/ (modifying other skill files), which is powerful but not clearly restricted to a copy or sandbox — this capability is proportionate only if the agent is intended to edit target-skill files and the user expects that.
指令范围
SKILL.md tells the agent to: (a) infer test inputs from the skill's docs or from "recent real invocations from memory/session history", and (b) make targeted edits to the target SKILL.md and keep/revert them. Asking the agent to access session/memory history and to programmatically modify other skills' files is broad and not explicitly constrained (no explicit 'dry-run' or explicit user-approval step is enforced). These instructions give the agent wide discretion to collect contextual data and to change other skills, which is scope creep from a simple auditor unless the user explicitly consents.
安装机制
No install spec and no code files — instruction-only — so there is nothing downloaded or written at install time. This minimizes supply-chain risk.
凭证需求
The skill declares no environment variables or external credentials, but the runtime instructions rely on accessing session/memory history and reading other skills' docs and example prompts. Those are not declared as required sources and could include sensitive information. Also, the instructions assume the ability to write files into a skills/ directory; filesystem access is implied but not documented or constrained.
持久化与权限
always:false (good), but the skill's autonomous-invocation setting is default (allowed). Combined with instructions that modify other skills' files and infer inputs from session history, autonomous runs could repeatedly alter or leak content. There is no explicit in-flow mandatory user confirmation checkpoint for edits (the doc says 'apply if approved' but does not define how approval is requested or enforced).
安全有层次,运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发,无需署名。

运行时依赖

无特殊依赖

版本

latestv2.0.02026/3/18

v2: Dimensional scoring (0-10), meta-skill optimization, progressive disclosure, autonomous setup

● 无害

安装命令 点击复制

官方npx clawhub@latest install autoresearch-skill-optimizer
镜像加速npx clawhub@latest install autoresearch-skill-optimizer --registry https://cn.clawhub-mirror.com

技能文档

Two-phase improvement system: (1) structural audit against Anthropic best practices, (2) iterative output quality loop.


Phase 1: Structure Audit (run 第一个, always)

Before optimizing output quality, audit the skill's architecture. Score against these 5 structural checks:

Structural Checklist:

  • Gotchas section — 做 SKILL.md 有 ## Gotchas section 带有 在 最少 one real failure case? (Highest-signal content per Anthropic)
  • Trigger-phrase description — 做 YAML description 字段 say 当...时 到 使用 skill, 不 只是 什么 做? 必须 include "使用 当...时..." 或 equivalent trigger 条件.
  • Progressive disclosure — 做 skill 使用 file system (references/, scripts/, assets/, 配置.json) 代替 的 inline-dumping everything 进入 SKILL.md?
  • Single focus — 做 skill fit cleanly 进入 one 类型 (库 Reference, Verification, Automation, Scaffolding, Runbook, etc.) 没有 straddling multiple?
  • 否 railroading — 做 skill give Claude information + flexibility, rather 比 在...上-specifying 如何 必须 execute?

Score each: ✅ pass | ❌ fail | ⚠️ partial

For each failure: propose a concrete fix and apply if approved.

Quick wins 到 apply immediately:

  • 如果 否 Gotchas section → 添加 ## Gotchas\n- [Placeholder: 添加 real failures 这里 作为 它们're discovered]
  • 如果 description summary → rewrite 作为 trigger 条件
  • 如果 所有 content inline → propose references/ folder structure

Phase 2: 输出 Quality 循环 (autoresearch)

After structure audit, run the iterative improvement loop on the skill's actual outputs.

Setup

  • 哪个 skill? — 用户 specifies, 或 infer 从 context.
  • Test inputs — 获取 2-3 representative inputs. 如果 用户 doesn't provide them:
- Check skill's own docs 对于 示例 usage - 使用 recent real invocations 从 memory/会话 history - 对于 extraction skills: 使用 known-good URLs/files. 对于 generation skills: 使用 skill's own 示例 prompts.
  • Scoring checklist — Build 3-6 scoring items. 开始 从 examples 下面, 然后 customize:
- 什么's #1 thing makes skill's 输出 bad? ('s checklist item 1) - 什么 would 使 用户 say "'s exactly 什么 I wanted"? ('s positive framing) - 添加 1-2 items 从 "Universal structural quality" 列表 下面

Scoring Checklist Examples

See references/checklist-examples.md for starter checklists by skill type (cold outreach, content, research, extraction, process/meta-skills).

Scoring Modes

Binary mode (默认 对于 simple skills): 是/否 per checklist item. Pass rate = 总计 是 / (items × runs).

Dimensional mode (使用 对于 complex skills 或 当...时 binary plateaus): Score 每个 dimension 0-10. Identify weakest dimension (lowest 平均值 穿过 runs). Target dimension 对于 revision — 做 不 rewrite everything.

Use dimensional mode when:

  • Binary scoring hits 100% 但是 输出 仍然 feels mediocre
  • skill 有 qualitative dimensions (tone, depth, relevance) binary 可以't capture
  • 您 want 到 improve 从 "good" 到 "excellent" rather 比 从 "broken" 到 "working"

循环

Round N:
  • Run skill against each test input
  • Score each output (binary: 1 per yes | dimensional: 0-10 per dimension)
  • Calculate score:
- Binary: pass rate = (total yes) / (items × runs) - Dimensional: avg score per dimension across runs
  • Identify the weakest item/dimension (most failures or lowest avg score)
  • Make ONE targeted change to SKILL.md addressing ONLY that weakness
  • Re-run and re-score
  • If new score > old score: KEEP. Else: REVERT.
  • Log: score before/after, change made, dimension targeted, kept/reverted

Stop when: binary ≥ 95% (3 consecutive rounds) OR dimensional weakest ≥ 8/10 (3 consecutive) OR 20 rounds reached.

输出 Files

  • skills/{skill-name}/SKILL-optimized.md — improved version (original untouched)
  • skills/{skill-name}/optimization-changelog.md — 满 round log

Changelog 格式

## Structural Audit
  • Gotchas section: ❌ → Added placeholder
  • Description: ❌ → Rewritten as trigger condition
  • Progressive disclosure: ⚠️ → Noted, deferred

Round 1 (binary mode)

  • Score: 4/10 (40%)
  • Weakest item: "Does it mention business name?"
  • Change: Added rule "Always open with [Business Name],"
  • New score: 7/10 (70%)
  • Decision: KEPT

Round 2 (dimensional mode)

  • Scores: Accuracy 8/10 | Tone 5/10 | Brevity 9/10 | Relevance 7/10
  • Weakest dimension: Tone (5/10)
  • Change: Added "Match prospect's industry language, not generic sales speak"
  • New scores: Accuracy 8/10 | Tone 7/10 | Brevity 9/10 | Relevance 7/10
  • Decision: KEPT (Tone +2)

Optimizing Meta-Skills (Process Skills)

Some skills don't produce text — they drive a process (e.g., this skill itself, planning workflows, research pipelines). For these:

什么 到 score: Score experience 的 following process, 不 text artifact.

  • 做过 process produce 清除 结果?
  • 是 那里 moments 的 confusion 在哪里 instructions 是 ambiguous?
  • 做过 任何 step feel unnecessary 或 redundant?
  • Could someone 关注 没有 prior context?

如何 到 test: Run skill 在...上 2-3 real tasks (不 hypothetical). Score 之后 每个 real 使用. test inputs tasks 您're applying skill 到.

Dimensional scoring 对于 process skills:

  • Clarity — 可以 I 关注 每个 step 没有 re-reading?
  • Completeness — 做 process cover 满 workflow?
  • Actionability — 做 I know exactly 什么 到 做 在 每个 step, 或 做 I 有 到 infer?
  • Efficiency — 那里 wasted/redundant steps?
  • Self-applicability — 可以 process improve itself? (Meta-test)

Checklist Sweet Spot

  • 3-6 questions = optimal
  • Too few: 不 granular enough 到 guide changes
  • Too many: skill starts gaming checklist (点赞 student memorizing answers 没有 understanding)

当...时 到 使用

  • 之前 running 任何 skill 在 scale (cold outreach, content generation, scraping)
  • 之后 新的 模型 upgrade — re-验证 existing skills
  • 当...时 skill 有 inconsistent 输出 quality
  • Monthly maintenance pass 在...上 high-使用 skills
  • Immediately 之后 creating 新的 skill (structural audit 仅 takes 5 min)

当...时 到 Run 哪个 Phase

  • 任何 新的 skill → Structure audit (5 min, catches issues early)
  • 之前 scale 使用 → 输出 循环 (验证 quality 之前 mass runs)
  • 之后 模型 upgrade → 输出 循环 (re-验证 existing skills)
  • Inconsistent 输出 → 输出 循环 (查找 failing item/dimension)
  • High-revenue skills → Both phases (cold outreach, content gen — quality variance = revenue impact)

Gotchas

  • 输出 循环 requires skills produce scoreable text outputs — scripts/tools produce side effects 需要 不同 verification approach (使用 Product Verification skill 类型 代替)
  • Don't run 输出 循环 在...上 skills call expensive APIs 没有 rate limit awareness — 每个 round runs skill multiple 乘以
  • Phase 1 (structure audit) 应该 always run 之前 Phase 2 — fixing structure 第一个 makes 输出 循环 更多 effective
  • 3-6 checklist questions sweet spot — 更多 比 6 和 skill starts gaming individual checks rather 比 improving overall quality
数据来源:ClawHub ↗ · 中文优化:龙虾技能库
OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险,如需更匹配、更安全的方案,建议联系付费定制

了解定制服务