⏱️ Estimated reading time: 7 min

Skills Were Written by Hand

Agent skill documents, the documents that describe how an agent should perform a specific task, are written by humans almost universally. What steps to follow, which tools to use, how to recover from failure. Writing a good skill document takes experience and time. And it does not end once the document is written. When an agent fails, the skill must be revised, and that too is a human responsibility.

arXiv:2605.23904, “SkillOpt: Executive Strategy for Self-Evolving Agent Skills,” automates this process. It treats skill documents as trainable external components and systematically improves them based on agent execution results.

Text-Space Optimization

The core idea in SkillOpt is a “text-space optimizer.” It optimizes the skill document text itself without touching model weights. This differs from conventional skill authoring in three ways.

Scored rollouts. Every time an agent executes a skill, a rollout (execution trajectory) is recorded and scored. High-scoring rollouts indicate what is working. Low-scoring rollouts point to which parts of the skill are problematic.

Controlled edits. An optimizer model analyzes rollout data and proposes edits to the skill document. There are three edit types: add, delete, and replace. This is constrained transformation, not free rewriting. The constraint matters because the narrower the scope of each edit, the easier it is to trace which change caused which shift in performance.

Validation gate. A proposed edit is not applied immediately. A change is accepted only when the validation score improves. Edits that worsen performance are filtered out automatically.

No Inference-Time Overhead

One practical advantage of SkillOpt is that it adds no inference-time overhead. Skill optimization happens offline. Collecting rollouts, proposing edits, and validating are all separate from agent execution.

A deployed agent simply references the optimized skill document. Since optimization itself is not in the execution path, latency is unaffected. There is no tradeoff of “better skill means slower execution.”

Experimental Results

Two figures appear in the abstract.

Up to +23.5 points (GPT-5.5). Across multiple benchmarks and models, agents using SkillOpt-optimized skills outperformed baselines. The largest single improvement was 23.5 points using GPT-5.5. Which benchmark produced this number requires the full paper.

Cross-model transfer. Skills optimized on one model remain effective when transferred to a different model. This is a significant property for justifying the cost of skill optimization. If skills optimized on GPT-5.5 continue to improve performance when applied to Qwen-family models, a skill optimized once can be reused across multiple models.

Cross-environment transfer is also reported: optimized skills maintain their effect across different execution environments.

Hand-Written Skills vs. SkillOpt Skills

The baseline the paper compares against is skills written by hand or revised informally. This comparison is meaningful because it is realistic. That is exactly how skills are managed in most agent systems.

SkillOpt outperforms hand-written skills for two reasons. First, it optimizes based on actual execution data. Real failure and success patterns are a more accurate signal than a skill author’s intuition. Second, it improves continuously. As new rollouts accumulate, skills are updated. Instead of a document written once and frozen, the skill becomes a living document.

Connection to ThakiCloud’s Skill System

Looking at the skills under ThakiCloud’s .claude/skills/ through SkillOpt’s lens reveals a natural connection.

The current skill management process is manual. selfharness-evolve attempts nightly skill improvement, but how it uses rollout data varies by skill. Applying SkillOpt’s framework would enable a more systematic approach.

Concretely, a pipeline is needed that automatically collects rollouts when a skill is executed and assigns scores to each run. Scores should be defined based on whether the skill achieved its intended goal, whether it completed without errors, and how efficient the run was. Proposing controlled edits from that data, validating them, and applying accepted changes is the SkillOpt cycle.

The sonnet-format-determinism rule already points toward validating skill output quality through code. Using those validation results as SkillOpt-style improvement signals would integrate naturally with the current architecture.

Cross-model transferability matters especially here. ThakiCloud’s platform routes across multiple models including sonnet, opus, and haiku. If skills optimized against one model also perform well on others, the cost of optimization drops substantially.

Caveats

SkillOpt has limits. Even with controlled edits, accumulated changes can drift in unexpected directions. A skill that improves on a specific benchmark while losing generality in the process is the classic failure mode.

The validation gate partially prevents this but not completely. Tracking the full history of skill edits and having a rollback mechanism to a prior version when performance drops must be built alongside.

Closing Thoughts

SkillOpt moves agent skills from “artifacts deployed and frozen” to “living components that improve continuously.” The +23.5 point figure and cross-model transferability are worth serious examination. Any team operating a skill-based agent system should read the full paper.

Original paper: https://arxiv.org/abs/2605.23904