Summary of Multi-Turn Evaluation Results
The multi-turn evaluation suite, designed to simulate real-world development scenarios across various stages (TDD, debugging, planning, refactoring), yielded the following results:
Configurations:
- Superpowers: SessionStart hook + skills.
- Plain Skills: Same skills installed without any hooks or hints.
- CLAUDE.md: Equivalent guidelines written as static rules, always in context.
- CLAUDE.md + Hint: One-liner hint in CLAUDE.md saying "invoke the relevant skill before coding" + skills installed.
Scenarios:
- TDD – Email Validator
- Debugging – Broken LRU Cache
- Planning – Rate Limiter
- Refactoring – Express Middleware
- Mixed – HTTP Client Retry
Key Findings:
- Superpowers: The model consistently invoked skills across all scenarios, leading to a high pass rate (70-90%).
- Plain Skills: The model rarely or never opened the skill drawer on its own, resulting in very low success rates.
- CLAUDE.md: With static rules always in context, the model performed significantly
Read the full article at DEV Community
Want to create content about this topic? Use Nemati AI tools to generate articles, social posts, and more.

![[AINews] The Unreasonable Effectiveness of Closing the Loop](/_next/image?url=https%3A%2F%2Fmedia.nemati.ai%2Fmedia%2Fblog%2Fimages%2Farticles%2F600e22851bc7453b.webp&w=3840&q=75)



