Skip to main content
  1. Posts/

From Claude Code to Kiro CLI: Porting a Multi-Agent Skill Builder

·18 mins
How porting Claude Code’s skill-creator to Kiro CLI exposed fundamental differences between ephemeral and named agent architectures — and what we built to bridge the gap.

The Port That Wasn’t a Port #

I thought porting Claude Code’s skill-creator to Kiro CLI would take an afternoon. Read the prompts, translate the Agent() calls to subagent configs, done.

It took a week. What looked like a translation exercise turned into an architecture redesign. The reason: Claude Code and Kiro CLI have fundamentally different models for what an “agent” is, and those differences cascade through every design decision — from how you run tests, to how you isolate a control group, to how the orchestrator decides who does what.

This post walks through the full port: what the original does, why Kiro CLI can’t replicate it directly, and the architecture we built instead. Every config snippet is from the working implementation in this blog’s repository.

The original skill-creator lives in Anthropic’s public skills repo, and the methodology is documented in The Complete Guide to Building Skills for Claude. Our port preserves the evaluation methodology while rebuilding the orchestration layer for Kiro CLI’s named-agent model.

TL;DR — What you’ll learn:

  1. How Claude Code’s skill-creator uses ephemeral Agent() calls for evaluation
  2. Why Kiro CLI’s named-agent model requires a different orchestration strategy
  3. The 8-agent architecture we built (meta-builder, skill-executor, baseline, grader, comparator, analyzer, llm-worker, skill-trigger-tester)
  4. Key design decisions: subagents for judgment, shell for execution, TODO templates for determinism
  5. The idempotency problem — why neither the default agent NOR omitting --agent works for reproducible evals
  6. Real benchmark results from the first evaluation run

How Skill-Creator Works in Claude Code #

Claude Code’s skill-creator is elegant in its simplicity. One long-running agent does everything — interviewing the user, drafting skills, running evaluations, grading results, and iterating. When it needs parallel work, it spawns ephemeral workers via the Agent() tool.

graph TD A[Claude Code Session] --> B[skill-creator SKILL.md loaded in context] B --> C{Parent agent orchestrates everything} C --> D["Agent(prompt='Execute with skill...')"] C --> E["Agent(prompt='Execute WITHOUT skill...')"] C --> F["Agent(prompt='Read grader.md, grade these...')"] C --> G["Bash('python run_eval.py ...')"] D --> H[Returns transcript text] E --> H F --> I[Returns grading JSON] G --> J[Returns benchmark data]

The key characteristics:

  • One agent does everything — no pre-configuration needed
  • Agent() creates disposable workers with any prompt the parent writes on the fly
  • Workers inherit parent’s tools — full filesystem, bash, everything
  • The agents/ folder contains .md files that are read at runtime and passed as prompts to ephemeral Agent() calls
  • Scripts use claude -p for headless LLM calls (description optimization, trigger testing)

The scripts handle the mechanical parts:

ScriptPurposeUses LLM?
run_eval.pyTests if a skill’s description triggers correctlyYes (claude -p subprocess)
run_loop.pyOrchestrates eval → improve → re-eval cyclesYes (via run_eval + improve_description)
improve_description.pyRewrites description based on failuresYes (claude -p subprocess)
aggregate_benchmark.pyComputes pass rates, mean/stddev, deltasNo
generate_report.pyGenerates HTML report for description optimization resultsNo

Here’s how the scripts chain together in the evaluation loop:

graph LR subgraph "Description Optimization (run_loop.py)" RE[run_eval.py] --> |"pass/fail rates"| ID[improve_description.py] ID --> |"rewritten description"| RE end subgraph "Benchmark Pipeline" RE --> |"raw results"| AB[aggregate_benchmark.py] AB --> |"benchmark.json"| GR[generate_report.py] GR --> |"report.html"| USER[User reviews] end style RE fill:#f9f,stroke:#333 style ID fill:#f9f,stroke:#333 style AB fill:#bbf,stroke:#333 style GR fill:#bbf,stroke:#333

Pink = uses LLM (claude -p in the original, kiro-cli chat --no-interactive --agent llm-worker in our port), Blue = pure Python (no LLM calls). Our port also adds validate_workspace.py (pure Python) which validates workspace structure before aggregation — catching format errors that would otherwise cascade.

And the subagent prompts (read by the parent, passed to Agent()):

FileRole
agents/grader.mdGrade assertions pass/fail with evidence
agents/comparator.mdBlind A/B comparison of two outputs
agents/analyzer.mdSurface patterns that aggregate stats hide

This works beautifully in Claude Code because Agent() is infinitely flexible — you write any prompt, get any behavior, no registration required.

Why Kiro CLI Is Different #

Kiro CLI’s subagent tool looks similar on the surface but operates on fundamentally different principles:

DimensionClaude Code Agent()Kiro CLI subagent
IdentityEphemeral, no nameNamed, pre-configured JSON
PromptWritten on the fly by parentFixed in config (or file:// reference)
ToolsInherits parent’s toolsOwn tools, own MCP servers
RoutingParent decides in-contextDescription-driven (parent reads descriptions)
LifecycleDies after returningPersistent config, spawned per-call
ParallelismUnlimitedParallel (DAG-based task graphs)

The routing problem #

In Claude Code, the parent constructs the perfect prompt for each worker. In Kiro CLI, the parent must choose from a fixed menu of named agents based on their description fields. If descriptions are vague, routing fails silently — the parent picks the wrong agent or skips delegation entirely.

This makes the description field the most critical piece of the architecture. It’s not documentation — it’s the routing table.

What can’t be a subagent #

Test execution — running the skill under test — can’t use the subagent tool because:

  1. The skill-executor agent needs its own tools, skills, and MCP servers
  2. The baseline agent must not see any workspace skills (to be a valid control)
  3. Subagents inherit the parent’s MCP paths but load their own agent config — you can’t get a truly clean environment
  4. Both must be idempotent — the same config every time, regardless of user settings

The solution: kiro-cli chat --no-interactive --agent skill-executor --trust-all-tools and --agent baseline as shell subprocesses. Separate process, separate config, clean isolation, reproducible results.

The Kiro CLI Architecture #

Here’s what we built — eight agents, each with a specific role (seven for evaluation orchestration, plus a skill-trigger-tester used by the description optimization scripts):

graph TB subgraph "meta-builder (orchestrator)" MB[meta-builder agent] TODO[TODO List - deterministic execution plan] end subgraph "In-process subagents (judgment)" SG[skill-grader] SC[skill-comparator] SA[skill-analyzer] end subgraph "Shell subprocesses (execution — idempotent agents)" SE["kiro-cli chat --agent skill-executor (with skill)"] BA["kiro-cli chat --agent baseline (control)"] PY["python scripts (validate, aggregate, report)"] end MB --> |"subagent call"| SG MB --> |"subagent call"| SC MB --> |"subagent call"| SA MB --> |"shell command"| SE MB --> |"shell command"| BA MB --> |"shell command"| PY SG --> |"writes"| TMP["/tmp/workspace/grading.json"] SC --> |"writes"| TMP SA --> |"writes"| TMP SE --> |"writes"| TMP BA --> |"writes"| TMP

The orchestrator: meta-builder.json #

The full evaluation flow — showing exactly which component handles each step:

sequenceDiagram participant U as User participant MB as meta-builder participant SH as Shell (kiro-cli) participant PY as Python scripts participant SG as skill-grader participant SC as skill-comparator participant SA as skill-analyzer U->>MB: "Evaluate my kiro-docs skill" MB->>MB: Read skill, draft test prompts MB->>U: Confirm test prompts? U->>MB: Approved par Test Execution (shell subprocesses — idempotent agents) MB->>SH: kiro-cli chat --agent skill-executor "test query" MB->>SH: kiro-cli chat --agent baseline "test query" end MB->>SG: Grade with_skill transcript SG-->>MB: grading.json (pass/fail + evidence) MB->>SG: Grade without_skill transcript SG-->>MB: grading.json MB->>SC: Blind A/B compare (doesn't know which is which) SC-->>MB: comparison.json (winner + reasoning) MB->>PY: python -m scripts.validate_workspace PY-->>MB: ✓ valid (or fix errors) MB->>PY: python -m scripts.aggregate_benchmark PY-->>MB: benchmark.json (pass rates, deltas) MB->>SA: Analyze patterns in benchmark SA-->>MB: analyst_notes.json MB->>PY: python generate_review.py --static PY-->>MB: report.html MB->>U: Review results (link to report)
{
  "name": "meta-builder",
  "description": "Kiro CLI skill and agent builder. Invoke when you need to create new skills, scaffold agent configurations, or design custom agents with proper security and tooling. Guides through structured interviews before generating artifacts.",
  "prompt": "file://../prompts/meta-builder.md",
  "tools": ["read", "write", "shell", "subagent", "todo_list"],
  "allowedTools": ["read", "write", "shell", "todo_list", "subagent"],
  "resources": [
    "skill://.kiro/skills/agent-creator/SKILL.md",
    "skill://.kiro/skills/skill-creator/SKILL.md"
  ],
  "hooks": {
    "agentSpawn": [
      { "command": "ls .kiro/skills/ 2>/dev/null | head -10 || echo 'No skills found'" },
      { "command": "ls .kiro/agents/ 2>/dev/null | head -10 || echo 'No agents found'" }
    ]
  },
  "toolsSettings": {
    "fs_write": {
      "allowedPaths": [".kiro/skills/**", ".kiro/agents/**", ".kiro/prompts/**", "/tmp/**", "/private/tmp/**"]
    },
    "execute_bash": {
      "allowedCommands": [
        "^kiro-cli agent validate.*$",
        "^kiro-cli agent list.*$",
        "^kiro-cli chat --no-interactive.*$",
        "^python -m scripts\\..*$",
        "^python .kiro/skills/skill-creator/.*$",
        "^mkdir -p /tmp/.*$",
        "^cp -r .* /tmp/.*$"
      ],
      "deniedCommands": ["^rm -rf.*$", "^git push.*$"],
      "autoAllowReadonly": true
    },
    "subagent": {
      "availableAgents": ["researcher", "skill-grader", "skill-comparator", "skill-analyzer"],
      "trustedAgents": ["researcher", "skill-grader", "skill-comparator", "skill-analyzer"]
    }
  },
  "welcomeMessage": "What skill or agent would you like to build? I'll guide you through the process."
}

Notice what’s happening: the meta-builder can shell out to kiro-cli chat --no-interactive (for test execution) and delegate to subagents (for judgment). It can’t rm -rf or git push. The trustedAgents list means it won’t prompt for permission before delegating — these are pre-approved. The hooks.agentSpawn lists workspace skills and agents on every invocation for context. And fs_write.allowedPaths scopes write access to .kiro/ config directories and /tmp/ — nothing else.

The execution pair: skill-executor.json and baseline.json #

These two agents form the A/B pair. They’re identical in capability — the only difference is skill discovery:

{
  "name": "skill-executor",
  "description": "Full-capability agent with all workspace skills loaded. Used as the 'with_skill' executor in skill evaluations. Mirrors kiro_default behavior — same tools, same prompt style, same skill discovery — but is deterministic regardless of user's default agent settings.",
  "prompt": "You are the Kiro CLI agent, bringing the power of AI-assisted development directly to the user's terminal. You help with coding tasks, system operations, AWS management, and development workflows. When a skill's description matches the user's request, read and follow that skill's instructions.",
  "tools": ["*"],
  "allowedTools": ["*"],
  "resources": [
    "skill://.kiro/skills/*/SKILL.md",
    "skill://~/.kiro/skills/*/SKILL.md",
    "file://.kiro/steering/**/*.md"
  ]
}
{
  "name": "baseline",
  "description": "Clean agent with no workspace skills loaded. Used as a control in skill evaluation baselines — mirrors kiro_default (same tools, same steering, same prompt style) but without any skill:// resources. This isolates the skill's contribution from tool capability.",
  "prompt": "You are the default Kiro CLI agent, bringing the power of AI-assisted development directly to the user's terminal. You help with coding tasks, system operations, AWS management, and development workflows.",
  "tools": ["*"],
  "allowedTools": ["*"],
  "resources": [
    "file://AGENTS.md",
    "file://README.md",
    "file://.kiro/steering/**/*.md"
  ]
}

Both are pinned configs — they behave the same regardless of what the user has set as their default agent. This makes evaluations idempotent: run them on any machine, get reproducible A/B comparisons.

The judgment agents #

All three share the same pattern — minimal tools, write scoped to /tmp/, prompts loaded from the skill-creator’s agents/ directory:

{
  "name": "skill-grader",
  "description": "Skill evaluation grader. Invoke to evaluate assertions/expectations against execution transcripts and output files. Grades each assertion as pass/fail with evidence.",
  "prompt": "file://../skills/skill-creator/agents/grader.md",
  "tools": ["read", "write"],
  "allowedTools": ["read", "write"],
  "toolsSettings": {
    "fs_write": {
      "allowedPaths": ["/tmp/**", "/private/tmp/**"]
    }
  }
}

The comparator adds isolation — it cannot read metadata that would reveal which output came from which variant:

{
  "name": "skill-comparator",
  "description": "Blind output comparator. Invoke to compare two outputs without knowing which skill produced them. Judges quality and task completion without bias.",
  "prompt": "file://../skills/skill-creator/agents/comparator.md",
  "tools": ["read", "write"],
  "allowedTools": ["read", "write"],
  "toolsSettings": {
    "fs_write": {
      "allowedPaths": ["/tmp/**", "/private/tmp/**"]
    },
    "fs_read": {
      "deniedPaths": ["**/eval_metadata.json", "**/skill-snapshot/**"]
    }
  }
}

The headless worker: llm-worker.json #

{
  "name": "llm-worker",
  "description": "Minimal LLM agent for single-turn text generation. No tools, no skills, no MCP. Used by scripts that need raw model output without side effects.",
  "prompt": "You are a text generation assistant. Respond only with what is asked. Do not use tools.",
  "tools": [],
  "allowedTools": [],
  "resources": []
}

This replaces claude -p in the original scripts. Zero tools, zero skills — pure text generation for description optimization loops.

Key Design Decisions #

Decision 1: Subagents for judgment, shell for execution #

Subagents are in-process — fast, return structured data directly to the parent’s context. But test execution needs a completely separate agent config with different skills loaded. Shell subprocess is the only way to get a clean execution environment.

The rule: if the task needs the target’s own config, use shell. If it needs the parent’s context, use subagent.

Decision 2: The idempotency problem #

This one almost shipped broken — twice.

First realization: the baseline. The default Kiro agent auto-discovers ALL workspace skills via skill://.kiro/skills/*/SKILL.md. You can’t use it as a “without skill” baseline because it sees the skill you’re testing. We needed baseline.json"tools": ["*"] with zero skill:// resources.

Second realization: the with_skill side. We initially ran with_skill tests by omitting --agent (using whatever default the user has). But users can configure a custom default agent in kiro-cli settings — different prompt, different tools, maybe fewer skills. That makes evals non-reproducible across machines.

The fix: both sides are explicit named agents:

  • skill-executor.json — all tools + skill://.kiro/skills/*/SKILL.md + global skills. Always discovers and loads workspace skills, regardless of user settings.
  • baseline.json — all tools + zero skill:// resources. Never sees skills.
{
  "name": "skill-executor",
  "description": "Full-capability agent with all workspace skills loaded. Used as the 'with_skill' executor in skill evaluations. Mirrors kiro_default behavior — same tools, same prompt style, same skill discovery — but is deterministic regardless of user's default agent settings.",
  "prompt": "You are the Kiro CLI agent...",
  "tools": ["*"],
  "allowedTools": ["*"],
  "resources": [
    "skill://.kiro/skills/*/SKILL.md",
    "skill://~/.kiro/skills/*/SKILL.md",
    "file://.kiro/steering/**/*.md"
  ]
}

Now evaluations are idempotent: same agents, same configs, same results — no matter who runs them or what their personal defaults are.

Decision 3: TODO templates beat implicit planning #

Claude Code’s parent agent figures out what to do from the skill instructions — it reads the methodology and improvises an execution plan. This works because Agent() is infinitely flexible.

Kiro’s meta-builder needs explicit step-by-step TODO templates with executor annotations. Each step specifies WHO executes it:

1.  [self]              Read skill, understand purpose
2.  [shell]             Snapshot skill to /tmp/
3.  [self]              Draft test prompts
4.  [user]              Confirm
5.  [shell]             kiro-cli chat --no-interactive --agent skill-executor --trust-all-tools
6.  [shell]             kiro-cli chat --no-interactive --agent baseline --trust-all-tools
7.  [subagent:grader]   Grade outputs → grading.json
8.  [subagent:comparator] Blind A/B → comparison.json
9.  [shell]             python -m scripts.validate_workspace (catch errors early)
10. [shell]             python -m scripts.aggregate_benchmark
11. [subagent:analyzer] Surface patterns → analyst_notes.json
12. [shell]             python generate_review.py --static
13. [user]              Review, provide feedback
14. [goto:5]            Iterate

Without these annotations, the meta-builder would misroute — we saw test execution go to the researcher subagent because it was the only one with broad tools. The TODO template makes routing deterministic.

Decision 4: Blind comparator isolation #

The skill-comparator does blind A/B comparison — it must not know which output came from which variant. We enforce this with deniedPaths:

"fs_read": {
  "deniedPaths": ["**/eval_metadata.json", "**/skill-snapshot/**"]
}

The metadata file maps “Output A” → “with_skill” and “Output B” → “without_skill”. The comparator physically cannot read it. This is stronger than a prompt instruction — it’s a tool-level constraint.

Decision 5: Write access scoped to /tmp/ #

All evaluation work happens in /tmp/<skill-name>-workspace/. Subagents can write results there but cannot touch the actual skill files in .kiro/. The meta-builder writes to .kiro/ only with user confirmation.

/tmp/<skill-name>-workspace/
├── skill-snapshot/          (copy of skill before changes)
├── iteration-1/
│   ├── <eval-name>/         (descriptive: "trending-ai", "pdf-extraction")
│   │   ├── eval_metadata.json   (prompt + assertions)
│   │   ├── with_skill/
│   │   │   ├── transcript.md
│   │   │   ├── outputs/
│   │   │   └── grading.json
│   │   ├── without_skill/
│   │   │   ├── transcript.md
│   │   │   ├── outputs/
│   │   │   └── grading.json
│   │   └── comparison.json  (blind A/B from comparator)
│   ├── benchmark.json       (generated by aggregate_benchmark.py)
│   ├── analyst_notes.json   (generated by skill-analyzer)
│   └── report.html          (generated by generate_review.py)
└── iteration-2/
    └── ...

Decision 6: Scripts ported from claude -p to kiro-cli chat --no-interactive #

The original run_eval.py uses claude -p for headless LLM calls. Our port uses kiro-cli chat --no-interactive --trust-all-tools. The trigger detection creates a temporary skill in .kiro/skills/, runs a query, and checks if the output contains a marker string — proving the skill was auto-discovered and loaded:

def run_single_query(query, skill_name, skill_description,
                     timeout, project_root, model=None, agent=None):
    unique_id = uuid.uuid4().hex[:8]
    clean_name = f"{skill_name}-eval-{unique_id}"
    skill_dir = Path(project_root) / ".kiro" / "skills" / clean_name
    skill_file = skill_dir / "SKILL.md"

    skill_dir.mkdir(parents=True, exist_ok=True)
    skill_content = (
        f"---\n"
        f"name: {clean_name}\n"
        f"description: {skill_description}\n"
        f"---\n\n"
        f"If this skill was loaded, respond with exactly: SKILL_TRIGGERED_{unique_id}\n"
    )
    skill_file.write_text(skill_content)

    cmd = [
        "kiro-cli", "chat",
        "--no-interactive",
        "--trust-all-tools",
        "--agent", agent or "skill-trigger-tester",
    ]
    if model:
        cmd.extend(["--model", model])
    cmd.append(query)

    result = subprocess.run(cmd, capture_output=True, text=True,
                            cwd=project_root, timeout=timeout)
    output = result.stdout + result.stderr
    return f"SKILL_TRIGGERED_{unique_id}" in output

The Mapping Table #

For anyone doing a similar port, here’s the complete translation:

Claude CodeKiro CLIWhy different
Agent(prompt="...")subagent → named-agentKiro requires pre-registered agents with JSON config
Agent() for test executionshell: kiro-cli chat --no-interactive --agent skill-executorTarget needs its own config/tools/skills; must be idempotent
agents/grader.md read at runtimeskill-grader.json with "prompt": "file://../skills/skill-creator/agents/grader.md".md is shared, JSON is the identity layer
TaskCreate/TaskUpdatetodo_list toolSame concept, different implementation
claude -p in scriptskiro-cli chat --no-interactive --agent llm-workerZero-tool agent for pure text generation
Implicit routing (parent decides)Description-driven + TODO annotationsKiro needs descriptions; we augment with deterministic TODO
No baseline needed (Agent() with custom prompt = no skill)Dedicated baseline + skill-executor agentsBoth sides must be pinned for idempotent evaluation

Results #

First full evaluation run on a youtube-trending skill — 6 parallel test runs:

  • 3 with-skill runs via --agent skill-executor (default-like agent with skill discovery)
  • 3 baseline runs via --agent baseline (same tools, zero skills)
  • Automated grading via skill-grader subagent
  • Benchmark: 100% pass rate with skill vs 13-20% without (+80-87% delta)
  • Full HTML report generated via generate_review.py --static

The pipeline produces a clear discriminating signal: with the skill, the agent runs the bundled Python script and produces structured data tables. Without it, the agent does web searches and returns general commentary — useful but not structured.

The first run exposed two pipeline bugs (documented in Lessons Learned below) that we fixed before getting clean end-to-end execution. The architecture works, but the orchestrator prompt required significant iteration to prevent shortcutting behavior.

Lessons Learned #

1. Description is the routing table. In Kiro CLI, if your subagent’s description doesn’t clearly say when to invoke it, the parent will misroute. We learned this when test execution went to researcher instead of shell. The fix: descriptions that state trigger conditions, not just capabilities.

Bad: "AWS helper agent" Good: "AWS fact-checker. Invoke when blog content makes claims about AWS service features, limits, pricing, or configurations that need verification against official docs."

2. TODO templates beat implicit planning. Claude Code’s parent improvises from instructions. Kiro’s meta-builder needs explicit executor annotations ([shell], [subagent:name], [self], [user]) to stay deterministic. The TODO list isn’t just progress tracking — it’s the execution plan.

3. Idempotency requires pinning both sides. We almost shipped with the default agent as baseline before discovering it auto-discovers all workspace skills. Then we almost shipped with --agent omitted for with_skill runs before realizing users can configure custom defaults. The fix: both skill-executor and baseline are explicit named agents — same config every time, regardless of user settings. Reproducibility comes from pinning the entire evaluation environment, not just the control group.

4. Subagents can’t write without explicit permission. First run, the grader returned grading data in its Summary but couldn’t write grading.json. Adding write tool + /tmp/** in allowedPaths fixed it. Default is deny-all.

5. The .md files are portable. The grader, comparator, and analyzer prompts from Claude Code’s skill-creator work unchanged in Kiro CLI. Only the identity layer (JSON config) needed to be created. The prompts are the reusable asset; the orchestration is what changes.

6. The orchestrator will shortcut if you let it. Our first real evaluation run failed because the meta-builder wrote an inline grading script instead of delegating to skill-grader. It “saved time” by skipping the subagent call, but produced grading.json in the wrong format, which broke aggregate_benchmark.py, which caused it to hand-write benchmark.json — cascading through the entire pipeline. The fix: hard rules in the prompt (“NEVER grade outputs yourself”) plus a validate_workspace.py script that catches structural errors before aggregation. The agent can’t shortcut past a failing validation check.

7. Shell permission parsing is sophisticated. Compound commands (cd X && kiro-cli chat ...) get split by tree-sitter and each sub-command checked independently. This matters for allowedCommands regex design — anchor with ^ and $.

What’s Next #

The architecture is working but there’s more to build:

  • Parallel eval runs — Currently sequential; the shell tool blocks. A wrapper script that spawns background processes and polls for completion would cut eval time significantly.
  • Cross-skill benchmarking — Run the same eval suite across multiple skills to find interaction effects.
  • Description optimization at scale — The run_loop.py script iterates on descriptions automatically. We’ve ported it but haven’t stress-tested at scale yet.
  • Prompt hardening — The orchestrator still occasionally attempts shortcuts despite hard rules. We’re exploring whether a pre-tool-use hook that rejects inline grading attempts would be more reliable than prompt instructions alone.

The broader lesson: porting between AI agent frameworks isn’t about syntax translation. It’s about understanding the execution model — who decides what, when, and with what information. Claude Code’s model is “one smart agent with infinite flexibility.” Kiro CLI’s model is “many focused agents with explicit contracts.” Neither is wrong, but they demand different architectures for the same problem. And in the named-agent world, idempotency must be designed in — you can’t assume the default agent is what you think it is.

Try It: Building a Skill with Meta-Builder #

If you want to use this architecture in your own project, here’s the quickstart:

Prerequisites #

# Enable experimental features needed by the pipeline
kiro-cli settings chat.enableTodoList true
kiro-cli settings chat.enableDelegate true

Step 1: Copy the agents and skill into your project #

# Clone or copy these into your .kiro/ directory:
.kiro/agents/meta-builder.json
.kiro/agents/skill-executor.json
.kiro/agents/baseline.json
.kiro/agents/skill-grader.json
.kiro/agents/skill-comparator.json
.kiro/agents/skill-analyzer.json
.kiro/agents/llm-worker.json
.kiro/agents/skill-trigger-tester.json
.kiro/prompts/meta-builder.md
.kiro/skills/skill-creator/          # The full skill-creator skill

Step 2: Start a session with meta-builder #

kiro-cli chat --agent meta-builder

Step 3: Tell it what you want #

> I want to build a skill that searches YouTube and returns a markdown table with video stats

The meta-builder will:

  1. Interview you (max 3 questions per turn)
  2. Draft the SKILL.md and test cases
  3. Ask for confirmation
  4. Run evaluations: --agent skill-executor (with skill) vs --agent baseline (without)
  5. Grade via skill-grader subagent
  6. Run blind comparison via skill-comparator
  7. Aggregate benchmarks and generate an HTML report
  8. Present results and iterate based on your feedback

What you’ll see #

The pipeline produces a /tmp/<skill-name>-workspace/ directory with:

  • Transcripts from both variants
  • Grading JSON with pass/fail + evidence per assertion
  • A benchmark showing the delta (e.g., +80% pass rate with skill)
  • A standalone HTML report for reviewing outputs qualitatively

Tips #

  • Be specific in your intent — “a skill that does X, outputs Y format, triggered when user says Z” gives better results than vague requests
  • Review the test prompts — the meta-builder drafts them, but you know your users better
  • The first iteration is rarely final — expect 2-3 rounds of eval → feedback → improve
  • Check the HTML report — it shows side-by-side outputs so you can judge quality, not just pass rates
  • Environment variables — if your skill needs API keys, export them before starting the session so eval runs can access them

References #


This post was co-authored by Mladen Trampic and Kiro, demonstrating the collaborative approach to technical content creation.

Authors

Mladen Trampic

Mladen Trampic

Senior Technical Account Manager for AWS Enterprise Support EMEA, helping customers architect for reliability and cost efficiency.

Kiro

Kiro

AWS's AI-powered coding assistant built in Rust, leveraging Anthropic models for intelligent code assistance and technical guidance.