Two-phase improvement system: (1) structural audit against Anthropic best practices, (2) iterative output quality loop.
Phase 1: Structure Audit (run 第一个, always)
Before optimizing output quality, audit the skill's architecture. Score against these 5 structural checks:
Structural Checklist:
- Gotchas section — 做 SKILL.md 有
## Gotchas section 带有 在 最少 one real failure case? (Highest-signal content per Anthropic)
- Trigger-phrase description — 做 YAML
description 字段 say 当...时 到 使用 skill, 不 只是 什么 做? 必须 include "使用 当...时..." 或 equivalent trigger 条件.
- Progressive disclosure — 做 skill 使用 file system (references/, scripts/, assets/, 配置.json) 代替 的 inline-dumping everything 进入 SKILL.md?
- Single focus — 做 skill fit cleanly 进入 one 类型 (库 Reference, Verification, Automation, Scaffolding, Runbook, etc.) 没有 straddling multiple?
- 否 railroading — 做 skill give Claude information + flexibility, rather 比 在...上-specifying 如何 必须 execute?
Score each: ✅ pass | ❌ fail | ⚠️ partial
For each failure: propose a concrete fix and apply if approved.
Quick wins 到 apply immediately:
- 如果 否 Gotchas section → 添加
## Gotchas\n- [Placeholder: 添加 real failures 这里 作为 它们're discovered]
- 如果 description summary → rewrite 作为 trigger 条件
- 如果 所有 content inline → propose
references/ folder structure
Phase 2: 输出 Quality 循环 (autoresearch)
After structure audit, run the iterative improvement loop on the skill's actual outputs.
Setup
- 哪个 skill? — 用户 specifies, 或 infer 从 context.
- Test inputs — 获取 2-3 representative inputs. 如果 用户 doesn't provide them:
- Check skill's own docs 对于 示例 usage
- 使用 recent real invocations 从 memory/会话 history
- 对于 extraction skills: 使用 known-good URLs/files. 对于 generation skills: 使用 skill's own 示例 prompts.
- Scoring checklist — Build 3-6 scoring items. 开始 从 examples 下面, 然后 customize:
- 什么's #1 thing makes skill's 输出
bad? ('s checklist item 1)
- 什么 would 使 用户 say "'s exactly 什么 I wanted"? ('s positive framing)
- 添加 1-2 items 从 "Universal structural quality" 列表 下面
Scoring Checklist Examples
See
references/checklist-examples.md for starter checklists by skill type (cold outreach, content, research, extraction, process/meta-skills).
Scoring Modes
Binary mode (默认 对于 simple skills): 是/否 per checklist item. Pass rate = 总计 是 / (items × runs).
Dimensional mode (使用 对于 complex skills 或 当...时 binary plateaus): Score 每个 dimension 0-10. Identify weakest dimension (lowest 平均值 穿过 runs). Target dimension 对于 revision — 做 不 rewrite everything.
Use dimensional mode when:
- Binary scoring hits 100% 但是 输出 仍然 feels mediocre
- skill 有 qualitative dimensions (tone, depth, relevance) binary 可以't capture
- 您 want 到 improve 从 "good" 到 "excellent" rather 比 从 "broken" 到 "working"
循环
Round N:
- Run skill against each test input
- Score each output (binary: 1 per yes | dimensional: 0-10 per dimension)
- Calculate score:
- Binary: pass rate = (total yes) / (items × runs)
- Dimensional: avg score per dimension across runs
- Identify the weakest item/dimension (most failures or lowest avg score)
- Make ONE targeted change to SKILL.md addressing ONLY that weakness
- Re-run and re-score
- If new score > old score: KEEP. Else: REVERT.
- Log: score before/after, change made, dimension targeted, kept/reverted
Stop when: binary ≥ 95% (3 consecutive rounds) OR dimensional weakest ≥ 8/10 (3 consecutive) OR 20 rounds reached.
输出 Files
skills/{skill-name}/SKILL-optimized.md — improved version (original untouched)
skills/{skill-name}/optimization-changelog.md — 满 round log
Changelog 格式
## Structural Audit
- Gotchas section: ❌ → Added placeholder
- Description: ❌ → Rewritten as trigger condition
- Progressive disclosure: ⚠️ → Noted, deferred
Round 1 (binary mode)
- Score: 4/10 (40%)
- Weakest item: "Does it mention business name?"
- Change: Added rule "Always open with [Business Name],"
- New score: 7/10 (70%)
- Decision: KEPT
Round 2 (dimensional mode)
- Scores: Accuracy 8/10 | Tone 5/10 | Brevity 9/10 | Relevance 7/10
- Weakest dimension: Tone (5/10)
- Change: Added "Match prospect's industry language, not generic sales speak"
- New scores: Accuracy 8/10 | Tone 7/10 | Brevity 9/10 | Relevance 7/10
- Decision: KEPT (Tone +2)
Optimizing Meta-Skills (Process Skills)
Some skills don't produce text — they drive a process (e.g., this skill itself, planning workflows, research pipelines). For these:
什么 到 score: Score experience 的 following process, 不 text artifact.
- 做过 process produce 清除 结果?
- 是 那里 moments 的 confusion 在哪里 instructions 是 ambiguous?
- 做过 任何 step feel unnecessary 或 redundant?
- Could someone 关注 没有 prior context?
如何 到 test: Run skill 在...上 2-3 real tasks (不 hypothetical). Score 之后 每个 real 使用. test inputs tasks 您're applying skill 到.
Dimensional scoring 对于 process skills:
- Clarity — 可以 I 关注 每个 step 没有 re-reading?
- Completeness — 做 process cover 满 workflow?
- Actionability — 做 I know exactly 什么 到 做 在 每个 step, 或 做 I 有 到 infer?
- Efficiency — 那里 wasted/redundant steps?
- Self-applicability — 可以 process improve itself? (Meta-test)
Checklist Sweet Spot
- 3-6 questions = optimal
- Too few: 不 granular enough 到 guide changes
- Too many: skill starts gaming checklist (点赞 student memorizing answers 没有 understanding)
当...时 到 使用
- 之前 running 任何 skill 在 scale (cold outreach, content generation, scraping)
- 之后 新的 模型 upgrade — re-验证 existing skills
- 当...时 skill 有 inconsistent 输出 quality
- Monthly maintenance pass 在...上 high-使用 skills
- Immediately 之后 creating 新的 skill (structural audit 仅 takes 5 min)
当...时 到 Run 哪个 Phase
- 任何 新的 skill → Structure audit (5 min, catches issues early)
- 之前 scale 使用 → 输出 循环 (验证 quality 之前 mass runs)
- 之后 模型 upgrade → 输出 循环 (re-验证 existing skills)
- Inconsistent 输出 → 输出 循环 (查找 failing item/dimension)
- High-revenue skills → Both phases (cold outreach, content gen — quality variance = revenue impact)
Gotchas
- 输出 循环 requires skills produce scoreable text outputs — scripts/tools produce side effects 需要 不同 verification approach (使用 Product Verification skill 类型 代替)
- Don't run 输出 循环 在...上 skills call expensive APIs 没有 rate limit awareness — 每个 round runs skill multiple 乘以
- Phase 1 (structure audit) 应该 always run 之前 Phase 2 — fixing structure 第一个 makes 输出 循环 更多 effective
- 3-6 checklist questions sweet spot — 更多 比 6 和 skill starts gaming individual checks rather 比 improving overall quality
Two-phase improvement system: (1) structural audit against Anthropic best practices, (2) iterative output quality loop.
Phase 1: Structure Audit (run first, always)
Before optimizing output quality, audit the skill's architecture. Score against these 5 structural checks:
Structural Checklist:
- Gotchas section — Does SKILL.md have a
## Gotchas section with at least one real failure case? (Highest-signal content per Anthropic)
- Trigger-phrase description — Does the YAML
description field say when to use the skill, not just what it does? Must include "Use when..." or equivalent trigger condition.
- Progressive disclosure — Does the skill use the file system (references/, scripts/, assets/, config.json) instead of inline-dumping everything into SKILL.md?
- Single focus — Does the skill fit cleanly into one type (Library Reference, Verification, Automation, Scaffolding, Runbook, etc.) without straddling multiple?
- No railroading — Does the skill give Claude information + flexibility, rather than over-specifying how it must execute?
Score each: ✅ pass | ❌ fail | ⚠️ partial
For each failure: propose a concrete fix and apply if approved.
Quick wins to apply immediately:
- If no Gotchas section → add
## Gotchas\n- [Placeholder: add real failures here as they're discovered]
- If description is a summary → rewrite as trigger condition
- If all content is inline → propose a
references/ folder structure
Phase 2: Output Quality Loop (autoresearch)
After structure audit, run the iterative improvement loop on the skill's actual outputs.
Setup
- Which skill? — User specifies, or infer from context.
- Test inputs — Get 2-3 representative inputs. If the user doesn't provide them:
- Check the skill's own docs for example usage
- Use recent real invocations from memory/session history
- For extraction skills: use known-good URLs/files. For generation skills: use the skill's own example prompts.
- Scoring checklist — Build 3-6 scoring items. Start from the examples below, then customize:
- What's the #1 thing that makes this skill's output
bad? (That's checklist item 1)
- What would make a user say "that's exactly what I wanted"? (That's the positive framing)
- Add 1-2 items from the "Universal structural quality" list below
Scoring Checklist Examples
See
references/checklist-examples.md for starter checklists by skill type (cold outreach, content, research, extraction, process/meta-skills).
Scoring Modes
Binary mode (default for simple skills): Yes/no per checklist item. Pass rate = total yes / (items × runs).
Dimensional mode (use for complex skills or when binary plateaus): Score each dimension 0-10. Identify the weakest dimension (lowest average across runs). Target that dimension for revision — do NOT rewrite everything.
Use dimensional mode when:
- Binary scoring hits 100% but output still feels mediocre
- The skill has qualitative dimensions (tone, depth, relevance) that binary can't capture
- You want to improve from "good" to "excellent" rather than from "broken" to "working"
The Loop
Round N:
- Run skill against each test input
- Score each output (binary: 1 per yes | dimensional: 0-10 per dimension)
- Calculate score:
- Binary: pass rate = (total yes) / (items × runs)
- Dimensional: avg score per dimension across runs
- Identify the weakest item/dimension (most failures or lowest avg score)
- Make ONE targeted change to SKILL.md addressing ONLY that weakness
- Re-run and re-score
- If new score > old score: KEEP. Else: REVERT.
- Log: score before/after, change made, dimension targeted, kept/reverted
Stop when: binary ≥ 95% (3 consecutive rounds) OR dimensional weakest ≥ 8/10 (3 consecutive) OR 20 rounds reached.
Output Files
skills/{skill-name}/SKILL-optimized.md — improved version (original untouched)
skills/{skill-name}/optimization-changelog.md — full round log
Changelog Format
## Structural Audit
- Gotchas section: ❌ → Added placeholder
- Description: ❌ → Rewritten as trigger condition
- Progressive disclosure: ⚠️ → Noted, deferred
Round 1 (binary mode)
- Score: 4/10 (40%)
- Weakest item: "Does it mention business name?"
- Change: Added rule "Always open with [Business Name],"
- New score: 7/10 (70%)
- Decision: KEPT
Round 2 (dimensional mode)
- Scores: Accuracy 8/10 | Tone 5/10 | Brevity 9/10 | Relevance 7/10
- Weakest dimension: Tone (5/10)
- Change: Added "Match prospect's industry language, not generic sales speak"
- New scores: Accuracy 8/10 | Tone 7/10 | Brevity 9/10 | Relevance 7/10
- Decision: KEPT (Tone +2)
Optimizing Meta-Skills (Process Skills)
Some skills don't produce text — they drive a process (e.g., this skill itself, planning workflows, research pipelines). For these:
What to score: Score the experience of following the process, not a text artifact.
- Did the process produce a clear result?
- Were there moments of confusion where the instructions were ambiguous?
- Did any step feel unnecessary or redundant?
- Could someone follow this without prior context?
How to test: Run the skill on 2-3 real tasks (not hypothetical). Score after each real use. The test inputs ARE the tasks you're applying the skill to.
Dimensional scoring for process skills:
- Clarity — Can I follow each step without re-reading?
- Completeness — Does the process cover the full workflow?
- Actionability — Do I know exactly what to do at each step, or do I have to infer?
- Efficiency — Are there wasted/redundant steps?
- Self-applicability — Can this process improve itself? (Meta-test)
Checklist Sweet Spot
- 3-6 questions = optimal
- Too few: not granular enough to guide changes
- Too many: skill starts gaming the checklist (like a student memorizing answers without understanding)
When to Use
- Before running any skill at scale (cold outreach, content generation, scraping)
- After a new model upgrade — re-validate existing skills
- When a skill has inconsistent output quality
- Monthly maintenance pass on high-use skills
- Immediately after creating a new skill (structural audit only takes 5 min)
When to Run Which Phase
- Any new skill → Structure audit (5 min, catches issues early)
- Before scale use → Output loop (validate quality before mass runs)
- After model upgrade → Output loop (re-validate existing skills)
- Inconsistent output → Output loop (find the failing item/dimension)
- High-revenue skills → Both phases (cold outreach, content gen — quality variance = revenue impact)
Gotchas
- Output loop requires skills that produce scoreable text outputs — scripts/tools that produce side effects need a different verification approach (use a Product Verification skill type instead)
- Don't run output loop on skills that call expensive APIs without rate limit awareness — each round runs the skill multiple times
- Phase 1 (structure audit) should always run before Phase 2 — fixing structure first makes the output loop more effective
- 3-6 checklist questions is the sweet spot — more than 6 and the skill starts gaming individual checks rather than improving overall quality