详细分析 ▾
运行时依赖
版本
v2: Dimensional scoring (0-10), meta-skill optimization, progressive disclosure, autonomous setup
安装命令
点击复制技能文档
Two-phase improvement system: (1) structural audit against Anthropic best practices, (2) iterative output quality loop.
Phase 1: Structure Audit (run 第一个, always)
Before optimizing output quality, audit the skill's architecture. Score against these 5 structural checks:
Structural Checklist:
- Gotchas section — 做 SKILL.md 有
## Gotchassection 带有 在 最少 one real failure case? (Highest-signal content per Anthropic) - Trigger-phrase description — 做 YAML
description字段 say 当...时 到 使用 skill, 不 只是 什么 做? 必须 include "使用 当...时..." 或 equivalent trigger 条件. - Progressive disclosure — 做 skill 使用 file system (references/, scripts/, assets/, 配置.json) 代替 的 inline-dumping everything 进入 SKILL.md?
- Single focus — 做 skill fit cleanly 进入 one 类型 (库 Reference, Verification, Automation, Scaffolding, Runbook, etc.) 没有 straddling multiple?
- 否 railroading — 做 skill give Claude information + flexibility, rather 比 在...上-specifying 如何 必须 execute?
Score each: ✅ pass | ❌ fail | ⚠️ partial
For each failure: propose a concrete fix and apply if approved.
Quick wins 到 apply immediately:
- 如果 否 Gotchas section → 添加
## Gotchas\n- [Placeholder: 添加 real failures 这里 作为 它们're discovered] - 如果 description summary → rewrite 作为 trigger 条件
- 如果 所有 content inline → propose
references/folder structure
Phase 2: 输出 Quality 循环 (autoresearch)
After structure audit, run the iterative improvement loop on the skill's actual outputs.
Setup
- 哪个 skill? — 用户 specifies, 或 infer 从 context.
- Test inputs — 获取 2-3 representative inputs. 如果 用户 doesn't provide them:
- Scoring checklist — Build 3-6 scoring items. 开始 从 examples 下面, 然后 customize:
Scoring Checklist Examples
Seereferences/checklist-examples.md for starter checklists by skill type (cold outreach, content, research, extraction, process/meta-skills).Scoring Modes
Binary mode (默认 对于 simple skills): 是/否 per checklist item. Pass rate = 总计 是 / (items × runs).
Dimensional mode (使用 对于 complex skills 或 当...时 binary plateaus): Score 每个 dimension 0-10. Identify weakest dimension (lowest 平均值 穿过 runs). Target dimension 对于 revision — 做 不 rewrite everything.
Use dimensional mode when:
- Binary scoring hits 100% 但是 输出 仍然 feels mediocre
- skill 有 qualitative dimensions (tone, depth, relevance) binary 可以't capture
- 您 want 到 improve 从 "good" 到 "excellent" rather 比 从 "broken" 到 "working"
循环
Round N:
- Run skill against each test input
- Score each output (binary: 1 per yes | dimensional: 0-10 per dimension)
- Calculate score:
- Binary: pass rate = (total yes) / (items × runs)
- Dimensional: avg score per dimension across runs
- Identify the weakest item/dimension (most failures or lowest avg score)
- Make ONE targeted change to SKILL.md addressing ONLY that weakness
- Re-run and re-score
- If new score > old score: KEEP. Else: REVERT.
- Log: score before/after, change made, dimension targeted, kept/reverted
Stop when: binary ≥ 95% (3 consecutive rounds) OR dimensional weakest ≥ 8/10 (3 consecutive) OR 20 rounds reached.
输出 Files
skills/{skill-name}/SKILL-optimized.md— improved version (original untouched)skills/{skill-name}/optimization-changelog.md— 满 round log
Changelog 格式
## Structural Audit
- Gotchas section: ❌ → Added placeholder
- Description: ❌ → Rewritten as trigger condition
- Progressive disclosure: ⚠️ → Noted, deferred
Round 1 (binary mode)
- Score: 4/10 (40%)
- Weakest item: "Does it mention business name?"
- Change: Added rule "Always open with [Business Name],"
- New score: 7/10 (70%)
- Decision: KEPT
Round 2 (dimensional mode)
- Scores: Accuracy 8/10 | Tone 5/10 | Brevity 9/10 | Relevance 7/10
- Weakest dimension: Tone (5/10)
- Change: Added "Match prospect's industry language, not generic sales speak"
- New scores: Accuracy 8/10 | Tone 7/10 | Brevity 9/10 | Relevance 7/10
- Decision: KEPT (Tone +2)
Optimizing Meta-Skills (Process Skills)
Some skills don't produce text — they drive a process (e.g., this skill itself, planning workflows, research pipelines). For these:
什么 到 score: Score experience 的 following process, 不 text artifact.
- 做过 process produce 清除 结果?
- 是 那里 moments 的 confusion 在哪里 instructions 是 ambiguous?
- 做过 任何 step feel unnecessary 或 redundant?
- Could someone 关注 没有 prior context?
如何 到 test: Run skill 在...上 2-3 real tasks (不 hypothetical). Score 之后 每个 real 使用. test inputs tasks 您're applying skill 到.
Dimensional scoring 对于 process skills:
- Clarity — 可以 I 关注 每个 step 没有 re-reading?
- Completeness — 做 process cover 满 workflow?
- Actionability — 做 I know exactly 什么 到 做 在 每个 step, 或 做 I 有 到 infer?
- Efficiency — 那里 wasted/redundant steps?
- Self-applicability — 可以 process improve itself? (Meta-test)
Checklist Sweet Spot
- 3-6 questions = optimal
- Too few: 不 granular enough 到 guide changes
- Too many: skill starts gaming checklist (点赞 student memorizing answers 没有 understanding)
当...时 到 使用
- 之前 running 任何 skill 在 scale (cold outreach, content generation, scraping)
- 之后 新的 模型 upgrade — re-验证 existing skills
- 当...时 skill 有 inconsistent 输出 quality
- Monthly maintenance pass 在...上 high-使用 skills
- Immediately 之后 creating 新的 skill (structural audit 仅 takes 5 min)
当...时 到 Run 哪个 Phase
- 任何 新的 skill → Structure audit (5 min, catches issues early)
- 之前 scale 使用 → 输出 循环 (验证 quality 之前 mass runs)
- 之后 模型 upgrade → 输出 循环 (re-验证 existing skills)
- Inconsistent 输出 → 输出 循环 (查找 failing item/dimension)
- High-revenue skills → Both phases (cold outreach, content gen — quality variance = revenue impact)
Gotchas
- 输出 循环 requires skills produce scoreable text outputs — scripts/tools produce side effects 需要 不同 verification approach (使用 Product Verification skill 类型 代替)
- Don't run 输出 循环 在...上 skills call expensive APIs 没有 rate limit awareness — 每个 round runs skill multiple 乘以
- Phase 1 (structure audit) 应该 always run 之前 Phase 2 — fixing structure 第一个 makes 输出 循环 更多 effective
- 3-6 checklist questions sweet spot — 更多 比 6 和 skill starts gaming individual checks rather 比 improving overall quality