- Personal Website of Mladen Trampic/
- Posts/
- From Claude Code to Kiro CLI: Porting a Multi-Agent Skill Builder/
From Claude Code to Kiro CLI: Porting a Multi-Agent Skill Builder
Table of Contents
The Port That Wasn’t a Port #
I thought porting Claude Code’s skill-creator to Kiro CLI would take an afternoon. Read the prompts, translate the Agent() calls to subagent configs, done.
It took a week. What looked like a translation exercise turned into an architecture redesign. The reason: Claude Code and Kiro CLI have fundamentally different models for what an “agent” is, and those differences cascade through every design decision — from how you run tests, to how you isolate a control group, to how the orchestrator decides who does what.
This post walks through the full port: what the original does, why Kiro CLI can’t replicate it directly, and the architecture we built instead. Every config snippet is from the working implementation in this blog’s repository.
The original skill-creator lives in Anthropic’s public skills repo, and the methodology is documented in The Complete Guide to Building Skills for Claude. Our port preserves the evaluation methodology while rebuilding the orchestration layer for Kiro CLI’s named-agent model.
TL;DR — What you’ll learn:
- How Claude Code’s skill-creator uses ephemeral
Agent()calls for evaluation- Why Kiro CLI’s named-agent model requires a different orchestration strategy
- The 8-agent architecture we built (meta-builder, skill-executor, baseline, grader, comparator, analyzer, llm-worker, skill-trigger-tester)
- Key design decisions: subagents for judgment, shell for execution, TODO templates for determinism
- The idempotency problem — why neither the default agent NOR omitting
--agentworks for reproducible evals- Real benchmark results from the first evaluation run
How Skill-Creator Works in Claude Code #
Claude Code’s skill-creator is elegant in its simplicity. One long-running agent does everything — interviewing the user, drafting skills, running evaluations, grading results, and iterating. When it needs parallel work, it spawns ephemeral workers via the Agent() tool.
The key characteristics:
- One agent does everything — no pre-configuration needed
Agent()creates disposable workers with any prompt the parent writes on the fly- Workers inherit parent’s tools — full filesystem, bash, everything
- The
agents/folder contains .md files that are read at runtime and passed as prompts to ephemeralAgent()calls - Scripts use
claude -pfor headless LLM calls (description optimization, trigger testing)
The scripts handle the mechanical parts:
| Script | Purpose | Uses LLM? |
|---|---|---|
run_eval.py | Tests if a skill’s description triggers correctly | Yes (claude -p subprocess) |
run_loop.py | Orchestrates eval → improve → re-eval cycles | Yes (via run_eval + improve_description) |
improve_description.py | Rewrites description based on failures | Yes (claude -p subprocess) |
aggregate_benchmark.py | Computes pass rates, mean/stddev, deltas | No |
generate_report.py | Generates HTML report for description optimization results | No |
Here’s how the scripts chain together in the evaluation loop:
Pink = uses LLM (claude -p in the original, kiro-cli chat --no-interactive --agent llm-worker in our port), Blue = pure Python (no LLM calls). Our port also adds validate_workspace.py (pure Python) which validates workspace structure before aggregation — catching format errors that would otherwise cascade.
And the subagent prompts (read by the parent, passed to Agent()):
| File | Role |
|---|---|
agents/grader.md | Grade assertions pass/fail with evidence |
agents/comparator.md | Blind A/B comparison of two outputs |
agents/analyzer.md | Surface patterns that aggregate stats hide |
This works beautifully in Claude Code because Agent() is infinitely flexible — you write any prompt, get any behavior, no registration required.
Why Kiro CLI Is Different #
Kiro CLI’s subagent tool looks similar on the surface but operates on fundamentally different principles:
| Dimension | Claude Code Agent() | Kiro CLI subagent |
|---|---|---|
| Identity | Ephemeral, no name | Named, pre-configured JSON |
| Prompt | Written on the fly by parent | Fixed in config (or file:// reference) |
| Tools | Inherits parent’s tools | Own tools, own MCP servers |
| Routing | Parent decides in-context | Description-driven (parent reads descriptions) |
| Lifecycle | Dies after returning | Persistent config, spawned per-call |
| Parallelism | Unlimited | Parallel (DAG-based task graphs) |
The routing problem #
In Claude Code, the parent constructs the perfect prompt for each worker. In Kiro CLI, the parent must choose from a fixed menu of named agents based on their description fields. If descriptions are vague, routing fails silently — the parent picks the wrong agent or skips delegation entirely.
This makes the description field the most critical piece of the architecture. It’s not documentation — it’s the routing table.
What can’t be a subagent #
Test execution — running the skill under test — can’t use the subagent tool because:
- The
skill-executoragent needs its own tools, skills, and MCP servers - The
baselineagent must not see any workspace skills (to be a valid control) - Subagents inherit the parent’s MCP paths but load their own agent config — you can’t get a truly clean environment
- Both must be idempotent — the same config every time, regardless of user settings
The solution: kiro-cli chat --no-interactive --agent skill-executor --trust-all-tools and --agent baseline as shell subprocesses. Separate process, separate config, clean isolation, reproducible results.
The Kiro CLI Architecture #
Here’s what we built — eight agents, each with a specific role (seven for evaluation orchestration, plus a skill-trigger-tester used by the description optimization scripts):
The orchestrator: meta-builder.json #
The full evaluation flow — showing exactly which component handles each step:
{
"name": "meta-builder",
"description": "Kiro CLI skill and agent builder. Invoke when you need to create new skills, scaffold agent configurations, or design custom agents with proper security and tooling. Guides through structured interviews before generating artifacts.",
"prompt": "file://../prompts/meta-builder.md",
"tools": ["read", "write", "shell", "subagent", "todo_list"],
"allowedTools": ["read", "write", "shell", "todo_list", "subagent"],
"resources": [
"skill://.kiro/skills/agent-creator/SKILL.md",
"skill://.kiro/skills/skill-creator/SKILL.md"
],
"hooks": {
"agentSpawn": [
{ "command": "ls .kiro/skills/ 2>/dev/null | head -10 || echo 'No skills found'" },
{ "command": "ls .kiro/agents/ 2>/dev/null | head -10 || echo 'No agents found'" }
]
},
"toolsSettings": {
"fs_write": {
"allowedPaths": [".kiro/skills/**", ".kiro/agents/**", ".kiro/prompts/**", "/tmp/**", "/private/tmp/**"]
},
"execute_bash": {
"allowedCommands": [
"^kiro-cli agent validate.*$",
"^kiro-cli agent list.*$",
"^kiro-cli chat --no-interactive.*$",
"^python -m scripts\\..*$",
"^python .kiro/skills/skill-creator/.*$",
"^mkdir -p /tmp/.*$",
"^cp -r .* /tmp/.*$"
],
"deniedCommands": ["^rm -rf.*$", "^git push.*$"],
"autoAllowReadonly": true
},
"subagent": {
"availableAgents": ["researcher", "skill-grader", "skill-comparator", "skill-analyzer"],
"trustedAgents": ["researcher", "skill-grader", "skill-comparator", "skill-analyzer"]
}
},
"welcomeMessage": "What skill or agent would you like to build? I'll guide you through the process."
}
Notice what’s happening: the meta-builder can shell out to kiro-cli chat --no-interactive (for test execution) and delegate to subagents (for judgment). It can’t rm -rf or git push. The trustedAgents list means it won’t prompt for permission before delegating — these are pre-approved. The hooks.agentSpawn lists workspace skills and agents on every invocation for context. And fs_write.allowedPaths scopes write access to .kiro/ config directories and /tmp/ — nothing else.
The execution pair: skill-executor.json and baseline.json #
These two agents form the A/B pair. They’re identical in capability — the only difference is skill discovery:
{
"name": "skill-executor",
"description": "Full-capability agent with all workspace skills loaded. Used as the 'with_skill' executor in skill evaluations. Mirrors kiro_default behavior — same tools, same prompt style, same skill discovery — but is deterministic regardless of user's default agent settings.",
"prompt": "You are the Kiro CLI agent, bringing the power of AI-assisted development directly to the user's terminal. You help with coding tasks, system operations, AWS management, and development workflows. When a skill's description matches the user's request, read and follow that skill's instructions.",
"tools": ["*"],
"allowedTools": ["*"],
"resources": [
"skill://.kiro/skills/*/SKILL.md",
"skill://~/.kiro/skills/*/SKILL.md",
"file://.kiro/steering/**/*.md"
]
}
{
"name": "baseline",
"description": "Clean agent with no workspace skills loaded. Used as a control in skill evaluation baselines — mirrors kiro_default (same tools, same steering, same prompt style) but without any skill:// resources. This isolates the skill's contribution from tool capability.",
"prompt": "You are the default Kiro CLI agent, bringing the power of AI-assisted development directly to the user's terminal. You help with coding tasks, system operations, AWS management, and development workflows.",
"tools": ["*"],
"allowedTools": ["*"],
"resources": [
"file://AGENTS.md",
"file://README.md",
"file://.kiro/steering/**/*.md"
]
}
Both are pinned configs — they behave the same regardless of what the user has set as their default agent. This makes evaluations idempotent: run them on any machine, get reproducible A/B comparisons.
The judgment agents #
All three share the same pattern — minimal tools, write scoped to /tmp/, prompts loaded from the skill-creator’s agents/ directory:
{
"name": "skill-grader",
"description": "Skill evaluation grader. Invoke to evaluate assertions/expectations against execution transcripts and output files. Grades each assertion as pass/fail with evidence.",
"prompt": "file://../skills/skill-creator/agents/grader.md",
"tools": ["read", "write"],
"allowedTools": ["read", "write"],
"toolsSettings": {
"fs_write": {
"allowedPaths": ["/tmp/**", "/private/tmp/**"]
}
}
}
The comparator adds isolation — it cannot read metadata that would reveal which output came from which variant:
{
"name": "skill-comparator",
"description": "Blind output comparator. Invoke to compare two outputs without knowing which skill produced them. Judges quality and task completion without bias.",
"prompt": "file://../skills/skill-creator/agents/comparator.md",
"tools": ["read", "write"],
"allowedTools": ["read", "write"],
"toolsSettings": {
"fs_write": {
"allowedPaths": ["/tmp/**", "/private/tmp/**"]
},
"fs_read": {
"deniedPaths": ["**/eval_metadata.json", "**/skill-snapshot/**"]
}
}
}
The headless worker: llm-worker.json #
{
"name": "llm-worker",
"description": "Minimal LLM agent for single-turn text generation. No tools, no skills, no MCP. Used by scripts that need raw model output without side effects.",
"prompt": "You are a text generation assistant. Respond only with what is asked. Do not use tools.",
"tools": [],
"allowedTools": [],
"resources": []
}
This replaces claude -p in the original scripts. Zero tools, zero skills — pure text generation for description optimization loops.
Key Design Decisions #
Decision 1: Subagents for judgment, shell for execution #
Subagents are in-process — fast, return structured data directly to the parent’s context. But test execution needs a completely separate agent config with different skills loaded. Shell subprocess is the only way to get a clean execution environment.
The rule: if the task needs the target’s own config, use shell. If it needs the parent’s context, use subagent.
Decision 2: The idempotency problem #
This one almost shipped broken — twice.
First realization: the baseline. The default Kiro agent auto-discovers ALL workspace skills via skill://.kiro/skills/*/SKILL.md. You can’t use it as a “without skill” baseline because it sees the skill you’re testing. We needed baseline.json — "tools": ["*"] with zero skill:// resources.
Second realization: the with_skill side. We initially ran with_skill tests by omitting --agent (using whatever default the user has). But users can configure a custom default agent in kiro-cli settings — different prompt, different tools, maybe fewer skills. That makes evals non-reproducible across machines.
The fix: both sides are explicit named agents:
skill-executor.json— all tools +skill://.kiro/skills/*/SKILL.md+ global skills. Always discovers and loads workspace skills, regardless of user settings.baseline.json— all tools + zeroskill://resources. Never sees skills.
{
"name": "skill-executor",
"description": "Full-capability agent with all workspace skills loaded. Used as the 'with_skill' executor in skill evaluations. Mirrors kiro_default behavior — same tools, same prompt style, same skill discovery — but is deterministic regardless of user's default agent settings.",
"prompt": "You are the Kiro CLI agent...",
"tools": ["*"],
"allowedTools": ["*"],
"resources": [
"skill://.kiro/skills/*/SKILL.md",
"skill://~/.kiro/skills/*/SKILL.md",
"file://.kiro/steering/**/*.md"
]
}
Now evaluations are idempotent: same agents, same configs, same results — no matter who runs them or what their personal defaults are.
Decision 3: TODO templates beat implicit planning #
Claude Code’s parent agent figures out what to do from the skill instructions — it reads the methodology and improvises an execution plan. This works because Agent() is infinitely flexible.
Kiro’s meta-builder needs explicit step-by-step TODO templates with executor annotations. Each step specifies WHO executes it:
1. [self] Read skill, understand purpose
2. [shell] Snapshot skill to /tmp/
3. [self] Draft test prompts
4. [user] Confirm
5. [shell] kiro-cli chat --no-interactive --agent skill-executor --trust-all-tools
6. [shell] kiro-cli chat --no-interactive --agent baseline --trust-all-tools
7. [subagent:grader] Grade outputs → grading.json
8. [subagent:comparator] Blind A/B → comparison.json
9. [shell] python -m scripts.validate_workspace (catch errors early)
10. [shell] python -m scripts.aggregate_benchmark
11. [subagent:analyzer] Surface patterns → analyst_notes.json
12. [shell] python generate_review.py --static
13. [user] Review, provide feedback
14. [goto:5] Iterate
Without these annotations, the meta-builder would misroute — we saw test execution go to the researcher subagent because it was the only one with broad tools. The TODO template makes routing deterministic.
Decision 4: Blind comparator isolation #
The skill-comparator does blind A/B comparison — it must not know which output came from which variant. We enforce this with deniedPaths:
"fs_read": {
"deniedPaths": ["**/eval_metadata.json", "**/skill-snapshot/**"]
}
The metadata file maps “Output A” → “with_skill” and “Output B” → “without_skill”. The comparator physically cannot read it. This is stronger than a prompt instruction — it’s a tool-level constraint.
Decision 5: Write access scoped to /tmp/ #
All evaluation work happens in /tmp/<skill-name>-workspace/. Subagents can write results there but cannot touch the actual skill files in .kiro/. The meta-builder writes to .kiro/ only with user confirmation.
/tmp/<skill-name>-workspace/
├── skill-snapshot/ (copy of skill before changes)
├── iteration-1/
│ ├── <eval-name>/ (descriptive: "trending-ai", "pdf-extraction")
│ │ ├── eval_metadata.json (prompt + assertions)
│ │ ├── with_skill/
│ │ │ ├── transcript.md
│ │ │ ├── outputs/
│ │ │ └── grading.json
│ │ ├── without_skill/
│ │ │ ├── transcript.md
│ │ │ ├── outputs/
│ │ │ └── grading.json
│ │ └── comparison.json (blind A/B from comparator)
│ ├── benchmark.json (generated by aggregate_benchmark.py)
│ ├── analyst_notes.json (generated by skill-analyzer)
│ └── report.html (generated by generate_review.py)
└── iteration-2/
└── ...
Decision 6: Scripts ported from claude -p to kiro-cli chat --no-interactive #
The original run_eval.py uses claude -p for headless LLM calls. Our port uses kiro-cli chat --no-interactive --trust-all-tools. The trigger detection creates a temporary skill in .kiro/skills/, runs a query, and checks if the output contains a marker string — proving the skill was auto-discovered and loaded:
def run_single_query(query, skill_name, skill_description,
timeout, project_root, model=None, agent=None):
unique_id = uuid.uuid4().hex[:8]
clean_name = f"{skill_name}-eval-{unique_id}"
skill_dir = Path(project_root) / ".kiro" / "skills" / clean_name
skill_file = skill_dir / "SKILL.md"
skill_dir.mkdir(parents=True, exist_ok=True)
skill_content = (
f"---\n"
f"name: {clean_name}\n"
f"description: {skill_description}\n"
f"---\n\n"
f"If this skill was loaded, respond with exactly: SKILL_TRIGGERED_{unique_id}\n"
)
skill_file.write_text(skill_content)
cmd = [
"kiro-cli", "chat",
"--no-interactive",
"--trust-all-tools",
"--agent", agent or "skill-trigger-tester",
]
if model:
cmd.extend(["--model", model])
cmd.append(query)
result = subprocess.run(cmd, capture_output=True, text=True,
cwd=project_root, timeout=timeout)
output = result.stdout + result.stderr
return f"SKILL_TRIGGERED_{unique_id}" in output
The Mapping Table #
For anyone doing a similar port, here’s the complete translation:
| Claude Code | Kiro CLI | Why different |
|---|---|---|
Agent(prompt="...") | subagent → named-agent | Kiro requires pre-registered agents with JSON config |
Agent() for test execution | shell: kiro-cli chat --no-interactive --agent skill-executor | Target needs its own config/tools/skills; must be idempotent |
agents/grader.md read at runtime | skill-grader.json with "prompt": "file://../skills/skill-creator/agents/grader.md" | .md is shared, JSON is the identity layer |
TaskCreate/TaskUpdate | todo_list tool | Same concept, different implementation |
claude -p in scripts | kiro-cli chat --no-interactive --agent llm-worker | Zero-tool agent for pure text generation |
| Implicit routing (parent decides) | Description-driven + TODO annotations | Kiro needs descriptions; we augment with deterministic TODO |
No baseline needed (Agent() with custom prompt = no skill) | Dedicated baseline + skill-executor agents | Both sides must be pinned for idempotent evaluation |
Results #
First full evaluation run on a youtube-trending skill — 6 parallel test runs:
- 3 with-skill runs via
--agent skill-executor(default-like agent with skill discovery) - 3 baseline runs via
--agent baseline(same tools, zero skills) - Automated grading via
skill-gradersubagent - Benchmark: 100% pass rate with skill vs 13-20% without (+80-87% delta)
- Full HTML report generated via
generate_review.py --static
The pipeline produces a clear discriminating signal: with the skill, the agent runs the bundled Python script and produces structured data tables. Without it, the agent does web searches and returns general commentary — useful but not structured.
The first run exposed two pipeline bugs (documented in Lessons Learned below) that we fixed before getting clean end-to-end execution. The architecture works, but the orchestrator prompt required significant iteration to prevent shortcutting behavior.
Lessons Learned #
1. Description is the routing table. In Kiro CLI, if your subagent’s description doesn’t clearly say when to invoke it, the parent will misroute. We learned this when test execution went to researcher instead of shell. The fix: descriptions that state trigger conditions, not just capabilities.
Bad: "AWS helper agent"
Good: "AWS fact-checker. Invoke when blog content makes claims about AWS service features, limits, pricing, or configurations that need verification against official docs."
2. TODO templates beat implicit planning. Claude Code’s parent improvises from instructions. Kiro’s meta-builder needs explicit executor annotations ([shell], [subagent:name], [self], [user]) to stay deterministic. The TODO list isn’t just progress tracking — it’s the execution plan.
3. Idempotency requires pinning both sides. We almost shipped with the default agent as baseline before discovering it auto-discovers all workspace skills. Then we almost shipped with --agent omitted for with_skill runs before realizing users can configure custom defaults. The fix: both skill-executor and baseline are explicit named agents — same config every time, regardless of user settings. Reproducibility comes from pinning the entire evaluation environment, not just the control group.
4. Subagents can’t write without explicit permission. First run, the grader returned grading data in its Summary but couldn’t write grading.json. Adding write tool + /tmp/** in allowedPaths fixed it. Default is deny-all.
5. The .md files are portable. The grader, comparator, and analyzer prompts from Claude Code’s skill-creator work unchanged in Kiro CLI. Only the identity layer (JSON config) needed to be created. The prompts are the reusable asset; the orchestration is what changes.
6. The orchestrator will shortcut if you let it. Our first real evaluation run failed because the meta-builder wrote an inline grading script instead of delegating to skill-grader. It “saved time” by skipping the subagent call, but produced grading.json in the wrong format, which broke aggregate_benchmark.py, which caused it to hand-write benchmark.json — cascading through the entire pipeline. The fix: hard rules in the prompt (“NEVER grade outputs yourself”) plus a validate_workspace.py script that catches structural errors before aggregation. The agent can’t shortcut past a failing validation check.
7. Shell permission parsing is sophisticated. Compound commands (cd X && kiro-cli chat ...) get split by tree-sitter and each sub-command checked independently. This matters for allowedCommands regex design — anchor with ^ and $.
What’s Next #
The architecture is working but there’s more to build:
- Parallel eval runs — Currently sequential; the shell tool blocks. A wrapper script that spawns background processes and polls for completion would cut eval time significantly.
- Cross-skill benchmarking — Run the same eval suite across multiple skills to find interaction effects.
- Description optimization at scale — The
run_loop.pyscript iterates on descriptions automatically. We’ve ported it but haven’t stress-tested at scale yet. - Prompt hardening — The orchestrator still occasionally attempts shortcuts despite hard rules. We’re exploring whether a pre-tool-use hook that rejects inline grading attempts would be more reliable than prompt instructions alone.
The broader lesson: porting between AI agent frameworks isn’t about syntax translation. It’s about understanding the execution model — who decides what, when, and with what information. Claude Code’s model is “one smart agent with infinite flexibility.” Kiro CLI’s model is “many focused agents with explicit contracts.” Neither is wrong, but they demand different architectures for the same problem. And in the named-agent world, idempotency must be designed in — you can’t assume the default agent is what you think it is.
Try It: Building a Skill with Meta-Builder #
If you want to use this architecture in your own project, here’s the quickstart:
Prerequisites #
# Enable experimental features needed by the pipeline
kiro-cli settings chat.enableTodoList true
kiro-cli settings chat.enableDelegate true
Step 1: Copy the agents and skill into your project #
# Clone or copy these into your .kiro/ directory:
.kiro/agents/meta-builder.json
.kiro/agents/skill-executor.json
.kiro/agents/baseline.json
.kiro/agents/skill-grader.json
.kiro/agents/skill-comparator.json
.kiro/agents/skill-analyzer.json
.kiro/agents/llm-worker.json
.kiro/agents/skill-trigger-tester.json
.kiro/prompts/meta-builder.md
.kiro/skills/skill-creator/ # The full skill-creator skill
Step 2: Start a session with meta-builder #
kiro-cli chat --agent meta-builder
Step 3: Tell it what you want #
> I want to build a skill that searches YouTube and returns a markdown table with video stats
The meta-builder will:
- Interview you (max 3 questions per turn)
- Draft the SKILL.md and test cases
- Ask for confirmation
- Run evaluations:
--agent skill-executor(with skill) vs--agent baseline(without) - Grade via
skill-gradersubagent - Run blind comparison via
skill-comparator - Aggregate benchmarks and generate an HTML report
- Present results and iterate based on your feedback
What you’ll see #
The pipeline produces a /tmp/<skill-name>-workspace/ directory with:
- Transcripts from both variants
- Grading JSON with pass/fail + evidence per assertion
- A benchmark showing the delta (e.g., +80% pass rate with skill)
- A standalone HTML report for reviewing outputs qualitatively
Tips #
- Be specific in your intent — “a skill that does X, outputs Y format, triggered when user says Z” gives better results than vague requests
- Review the test prompts — the meta-builder drafts them, but you know your users better
- The first iteration is rarely final — expect 2-3 rounds of eval → feedback → improve
- Check the HTML report — it shows side-by-side outputs so you can judge quality, not just pass rates
- Environment variables — if your skill needs API keys, export them before starting the session so eval runs can access them
References #
- The Complete Guide to Building Skills for Claude (PDF) — Anthropic’s official methodology for skill design and evaluation
- skill-creator source (GitHub) — The original Claude Code skill we ported from
- Kiro CLI documentation — Agent configuration, subagents, and hooks reference
This post was co-authored by Mladen Trampic and Kiro, demonstrating the collaborative approach to technical content creation.

