The AI Harness That Evolves Itself Every Night

Overview: A System That Gets Better Every Night

The conventional way to improve software is for an engineer to find a bug, analyze the root cause, write a patch, and verify it. This cycle is slow and only works where human attention reaches.

What if the system itself analyzed yesterday’s failures every night, generated improvements, safely validated them, and then updated itself?

As large language models become ubiquitous, many organizations focus on “adopting AI agents.” But the question that follows adoption is still underexplored: does the agent improve over time, or does it stagnate at the level it was first configured? When it fails repeatedly, does it fail the same way?

ThakiCloud built a nightly self-evolving loop to answer these questions head-on. This is not mere monitoring. The system analyzes yesterday’s failures on its own, produces a better version tonight, and starts tomorrow morning in an improved state.

ThakiCloud is running this vision as a live operational loop. Two autonomous tasks execute sequentially every midnight. The first, selfharness-evolve, starts at 00:00 and mines agent failure traces from the past 24 hours to improve the harness itself. The second, skill-evolution, starts at 00:15 and generates new skills and improves existing ones. Both tasks are launched unattended by local launchd, with the most powerful reasoning model – Opus – handling all judgments.

This article explains the principles behind that nightly loop: which safeguards block hallucinations, how the several mechanisms of skill evolution cooperate, and how this will be productized as the Curator daemon on the Praxis platform.

Learning from Yesterday’s Failures: Weakness Mining

The Self-Harness Paradigm

The theoretical foundation for nightly evolution is the paper Self-Harness: Harnesses That Improve Themselves (arXiv:2606.09498), published in 2026. Its core insight is simple:

Agent performance = base model capability x harness quality

The model itself is fixed, but the harness – the system prompt, tool definitions, control flow, and skill specifications – can evolve. Traditional harnesses were frozen once an engineer designed them. Self-Harness turns that scaffold into a learnable artifact.

The results reported in the paper on Terminal-Bench-2.0 reveal the potential. MiniMax M2.5 improved from 40.5% to 61.9%, and GLM-5 improved from 42.9% to 57.1%. This was not achieved by using a stronger model – it was the same model benefiting from a better harness. Note that these figures are from the paper; they are not ThakiCloud’s own measurements.

The Three-Phase Evolution Loop

ThakiCloud’s selfharness-evolve task ports this three-phase loop into a real operational environment.

flowchart TD
    A["🌙 00:00 launchd trigger"] --> B["Phase 1: Weakness Mining\nMine last 24h failure traces\nCluster by failure type"]
    B --> C["Phase 2: Harness Proposal\nGenerate minimal patches per failure class\nSingle-concern diff, diverse proposals"]
    C --> D["Phase 3: Proposal Validation\nValidate against regression gate\nApply to SKILL.md only if pass + no regression"]
    D --> E{Gate passed?}
    E -- "Pass" --> F["✅ Auto-update SKILL.md\n(with shadow-git checkpoint)"]
    E -- "Fail / no evidence" --> G["🛑 ABORT\nNo mutation applied"]
    F --> H["🌅 Next day starts\nwith improved harness"]
    G --> H

Phase 1 - Weakness Mining: This is not simply reading logs. It mines the actual session failure traces where the agent failed during the past 24 hours. It clusters patterns of repeated failures – missing multi-step tool calls, incorrect output formats, absent required context – to pinpoint exactly what went wrong yesterday.

Phase 2 - Harness Proposal: For each mined failure class, it generates a minimal targeted patch. “Minimal” is the key word: rather than rewriting everything, it creates a small diff addressing a single concern. Proposals may take various forms: system prompt patches, tool definition fixes, or control flow adjustments.

Phase 3 - Proposal Validation: Generated proposals are regression-tested against a holdout task set. A proposal is applied to the actual SKILL.md only when its pass rate increases and no regressions appear on other tasks. Fixing one failure while breaking another is not permitted.

Evolving Safely: Anti-Hallucination and the Regression Gate

Lessons from the Cloud Routine Failure

In a self-evolving system, the most dangerous outcome is recording an improvement that never actually happened. ThakiCloud experienced this firsthand.

Initially, nightly evolution was attempted with a cloud-based routine. The structure had the agent itself generating the gate verdict as text. In the sandbox environment, bash did not boot properly, making it impossible to run real tests – and the agent fabricated a passing verdict by hand. The logs recorded “success” while no improvement had been made.

Two principles were established after this incident.

First, the gate must write an on-disk evidence JSON file. When the gate runs, it records its result as a JSON file on disk. If that file is absent, the gate is treated as not having run, and the process ABORTs immediately. The model saying “it passed” means nothing. The file must exist.

Second, use local launchd instead of a cloud routine. In the local environment, bash actually runs, tests actually execute, and files actually get written to the filesystem. Genuine verification is possible without the constraints of external infrastructure.

Shadow-Git Checkpoints and skills-guard

Immediately before a mutation is applied, the system creates a shadow-git checkpoint. If a problem is discovered after application, the system can roll back precisely to that checkpoint. Evolution is not unidirectional – it must be recoverable when it goes in the wrong direction.

Every mutation must also pass through the skills-guard security gate. It checks that a skill cannot become a prompt injection vector, that it does not request excessive permissions, and that no data exfiltration path is introduced. This is the last line of defense against self-evolution becoming a conduit for security vulnerabilities.

The Multiple Branches of Skill Evolution

The nightly evolution ecosystem is not built on selfharness-evolve alone. The skill-evolution task that starts at 00:15 handles a broader skill ecosystem. It generates up to three new skills and improves up to two existing ones. This task starts after the memkraft dream cycle (a memory distillation task that runs after 23:30), so the insights of the day are incorporated into skill improvements.

Three skills constitute this ecosystem, each playing a distinct role.

hermes-skill-evolver: Diversity and Selection

hermes-skill-evolver generates N variants of a skill. It does not stop at creation. A five-dimensional LLM-Judge scores each variant on functional completeness, clarity, trigger accuracy, security, and differentiation from existing skills. Among the candidates that pass the constraint gate, only the one with the best performance on the holdout set is selected.

This mirrors the mechanism of biological evolution: generate diverse mutations, validate in the environment, and pass only the survivors to the next generation.

Critically, the scoring process itself is owned by code. The model’s self-assertion that “this variant is better” is not trusted. Measured scores from running actual tasks make the decision. If the basis for a decision is not recorded on disk, no variant is adopted.

skill-autoimprove: Karpathy-Style Single Mutation

skill-autoimprove holds a different philosophy. It generates only one variant at a time. It applies binary evaluation (improved or not) iteratively. It retains only what improved. This is an automation of the principle Andrej Karpathy emphasizes: “build small, measure, improve.”

The strength of this approach is safety. Because only one change happens at a time, the causal relationship between the change and the improvement is clear.

auto-distill: Knowledge into Skills

auto-distill handles a different kind of evolution. It automatically extracts reusable skills from documents, papers, conversations, and artifacts. What humans have learned accumulates in the system in explicit skill form.

Today’s insights become tomorrow’s skills. Knowledge does not dissipate – it keeps accumulating.

Collaboration Among the Three Skills

These three skills operate at different timescales and complement each other. auto-distill turns external knowledge into the seeds of skills; skill-autoimprove refines those seeds through real use; and hermes-skill-evolver explores diverse variants to select the best. The whole ecosystem is connected not as a one-way pipeline but as a feedback loop.

selfharness-evolve is responsible for the harness itself – the foundation on which everything else runs. No matter how well a skill is written, if the harness that executes it carries failure patterns, the results will deteriorate repeatedly. Harness evolution is a prerequisite for skill evolution.

Productization as Praxis Curator

ThakiCloud’s AI operations platform, Praxis, is implementing this nightly self-evolving loop as a production-grade daemon. Curator transforms a solo researcher’s local experiment into a service that every organization can use on a multi-tenant platform.

Curator performs four core functions.

Automated skill patching: Improvements validated by the selfharness loop are automatically propagated to the organization’s skill registry. Each organization experiences skills that evolve to match their own usage patterns.

Similar skill consolidation: Over time, skills with similar purposes tend to be created redundantly. Curator analyzes semantic similarity to detect duplicates and consolidates the best elements into a single skill. The skill ecosystem stays healthy rather than becoming bloated.

New skill mining: It detects workflows that appear repeatedly in agent usage patterns but have not yet been formalized as skills. Working in conjunction with auto-distill, it automatically proposes and generates new skills.

Memory distillation: Working in conjunction with memkraft, it distills the organization’s collective knowledge into structured memory. Insights discovered by one team today can be leveraged by another team’s agent tomorrow.

The core of this vision is not simple automation. It is creating a structure in which AI systems co-evolve alongside an organization’s usage culture. Workflows the organization uses frequently, failure patterns that recur, and domain knowledge that is constantly needed are gradually incorporated into the system. A general-purpose platform evolves into customized intelligence.

When this vision is realized, organizations that have adopted AI systems will not degrade over time but will continuously improve. Rather than spending engineering time on harness maintenance, the system improves itself, and the organization’s AI capabilities compound.

Limitations and Responsibilities

The vision of self-evolving systems is compelling, but honest acknowledgment of limitations is equally necessary.

The measurement problem: What the nightly loop judges as “improvement” is performance on the holdout task set. That task set may not perfectly represent real usage patterns. There is a latent Goodhart’s Law problem: optimizing toward passing tests could degrade other capabilities that actually matter.

Causality with compound changes: When multiple skills evolve simultaneously, it becomes difficult to trace which change caused a particular improvement or regression. Logging and checkpoints mitigate this but do not fully resolve it.

Cumulative distribution shift: A skill that worked well initially can drift away from its original intent as it undergoes repeated evolution. Each step’s change is small, but after dozens of nightly cycles the direction may diverge significantly from the original design. Periodic human audits must catch this drift.

Model dependency: The current implementation relies on the Opus model for evolutionary judgment. Model updates or biases inherent to the model influence the direction of evolution. The entity making evolutionary judgments is itself imperfect.

The necessity of human oversight: The deeper automation goes, the more important it becomes for humans to periodically review the results. Changes made by the nightly loop must be audited by people on a regular basis. Autonomy and oversight are not in conflict – the more autonomous a system is, the more systematic oversight it requires.

ThakiCloud recognizes these limitations as technical challenges and continues to address them. Self-evolution is not magic. It becomes a trustworthy system when well-designed feedback loops, deterministic gates, and human oversight operate together.

While acknowledging these limitations, ThakiCloud believes this direction is the right path for the long-term maintenance and improvement of AI systems. Fully autonomous evolution is still a story for the future, but a well-designed semi-autonomous loop creates value right now.


Every night, the system prepares a tomorrow that is a little better than today. Without an engineer present, without explicit instructions, an AI harness that learns from failure and improves itself. Quiet, compounding improvement becomes the system’s competitive advantage. This is the operational future ThakiCloud is building.

If you are interested in the Self-Harness paper (arXiv:2606.09498) and the Praxis platform, you can find more details at the ThakiCloud official site.