← Back to Homepage
February 2026 · Research Blog

Towards Self-Evolving AI: From Code to Agents to You

Three tightly connected projects โ€” CURE, RLAnything, and OpenClaw-RL โ€” that systematically unfreeze every component of the RL pipeline: first the reward, then the reward model and environment, and finally the data itself. The result is AI systems that evolve on their own.

Ling Yang

Gen-Verse · Princeton AI Lab · Princeton University

๐Ÿงฌ CURE ยท NeurIPS '25 Spotlight ๐Ÿ”„ RLAnything ๐Ÿฆž OpenClaw-RL

Over the past year, our team has followed a single research thread to its logical conclusion โ€” and the result is three tightly connected projects that, together, define a complete arc for self-evolving AI systems.

The thread starts with a simple observation: reinforcement learning works best when every component in the system can improve. In practice, most RL pipelines freeze the reward, freeze the environment, and freeze the data distribution โ€” training only the policy. We asked: what happens when you systematically unfreeze each of these?

CURE unfreezes the reward. Instead of relying on ground-truth test cases, a coder and a unit tester co-evolve through RL, learning to supervise each other. RLAnything then unfreezes the reward model and the environment โ€” jointly optimizing all three components in a closed loop that works across GUI agents, text-game agents, and coding LLMs. OpenClaw-RL takes the final step: it unfreezes the data itself, turning a user's everyday conversations into training signals so your personal AI agent improves simply because you use it.

Each project inherits the lessons from the one before it. CURE's pairwise reward design informed RLAnything's reward model co-optimization; RLAnything's async training infrastructure became the backbone of OpenClaw-RL's four-loop architecture. This blog walks through the three projects in sequence, with enough technical detail that you can understand โ€” and reproduce โ€” the key ideas.

The Journey So Far

Jun 2025 ยท NeurIPS 2025 Spotlight
Step 1: CURE โ€” Unfreeze the Reward
A coder and a unit tester co-evolve through RL with no ground-truth code. The pairwise reward matrix replaces static test cases with a self-improving supervision signal.
Feb 2026 ยท Preprint
Step 2: RLAnything โ€” Unfreeze the Entire System
Building on CURE's co-evolution insight, we jointly optimize environment, policy, and reward model in a closed loop โ€” proving theoretically and empirically that dynamic systems outperform static ones.
Feb 2026 ยท Open Source
Step 3: OpenClaw-RL โ€” Unfreeze the Data
Inheriting RLAnything's async training infrastructure, we turn everyday conversations into training signals. Your personal AI agent gets better simply because you use it.

CURE: When Coders and Testers Co-Evolve

NeurIPS 2025 Spotlight ยท Top 3%
Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, Mengdi Wang

Motivation

Mathematical reasoning in LLMs has been successfully incentivized via RL with verifiable rewards. In code generation, the standard recipe relies on ground-truth test cases: run the model's code against known tests, get a binary pass/fail reward. But ground-truth tests are expensive to curate and hard to scale. Meanwhile, unit test generation itself is a critical capability โ€” accurate unit tests are essential for enabling self-checking and self-correction during inference, test-time scaling, and agentic coding pipelines.

This raises a natural question: can a coder and a unit tester teach each other through RL, without any ground-truth code as supervision?

Technical Approach: Pairwise Reward Matrix

CURE introduces a self-play framework where a single LLM acts as both a code generator and a unit test generator. The key insight is that during RL training, the coder naturally produces both correct and incorrect solutions. The incorrect ones are gold for the unit tester โ€” they reveal exactly the failure modes that good tests should catch.

Core Algorithm: Reward Design via Interaction Outcomes

Given a coding task, the model generates N code solutions and M unit tests. CURE constructs a pairwise reward matrix R โˆˆ {0,1}Nร—M where Rij = 1 if code i passes test j. From this matrix:

Theoretical Foundation: Reward Precision

We formulate the final objective and introduce the concept of reward precision โ€” the probability that a unit test correctly identifies whether a given code solution is correct. Through our theoretical analysis (Section 3.1โ€“3.2 of the paper), we derive individual-level rewards for each generated unit test. This is non-trivial: a test that passes all solutions (including buggy ones) is useless, as is a test that fails everything. The optimal tests are those that discriminate โ€” passing correct code while failing incorrect code.

For long chain-of-thought models, we further introduce a response-length-guided reward transformation (Section 3.4). Long-CoT models tend to produce verbose unit tests; the transformation encourages the model to maintain high test quality while reducing unnecessary reasoning overhead, achieving 64.8% inference efficiency without sacrificing accuracy.

Results

+5.3%
Code generation accuracy
(over Qwen2.5-Instruct)
+9.0%
Best-of-N accuracy
+8.1%
Agentic coding tasks
+25.1%
Agentic unit test gen

The resulting ReasonFlux-Coder models (7B & 14B) outperform similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder across five benchmarks. Our long-CoT model ReasonFlux-Coder-4B consistently outperforms Qwen3-4B. Perhaps most excitingly, the co-evolved model can serve as an effective reward model for RL training of base models โ€” pointing toward fully self-supervised code optimization where the trained model bootstraps reward signals for the next generation.

RLAnything: Forging the Complete RL System

Preprint ยท Feb 2026
RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, Ling Yang

Four Problems That Hold Back Agentic RL

CURE showed that co-evolving two roles within a single model can work remarkably well. But it operated in a relatively clean domain โ€” code has verifiable outputs. Agentic tasks in the wild face four critical problems:

Problems Addressed

Technical Approach: Closed-Loop Dynamic Optimization

RLAnything jointly forges three components through closed-loop optimization:

Component 1: Integrated Feedback for the Policy

The policy receives combined signals from both verifiable outcome rewards and step-wise process rewards from the reward model. This addresses the sparse-reward problem directly โ€” instead of a single binary signal at the end of a 50-step trajectory, each intermediate step receives meaningful supervision. Our experiments show this integrated feedback consistently outperforms outcome-only training across all benchmarks.

Component 2: Self-Improving Reward Model

The reward model is jointly optimized with the policy via consistency feedback โ€” combining outcome supervision with self-consistency signals. As the policy improves, the reward model adapts to the evolving distribution of trajectories. This avoids the staleness problem that plagues frozen reward models in conventional RLHF pipelines. The reward model's step-wise evaluations feed back into policy training, creating a virtuous cycle.

Component 3: Theory-Motivated Environment Adaptation

We prove theoretically that balancing task difficulty benefits both policy and reward model optimization. When tasks are too easy or too hard, the importance sampling weights become extremely unbalanced, hurting both the policy gradient estimation and the reward model training. Our framework automatically adapts environment task difficulty using critic feedback from both the policy model and the reward model, keeping tasks in the optimal learning zone.

Theoretical Insight: Why Task Difficulty Matters for Reward Models

Beyond the well-known benefit for policy training, we show that reward model training also suffers under extreme task distributions. When the policy succeeds 100% of the time, the reward model sees only positive examples and cannot learn to discriminate quality. When success is 0%, it sees no useful signal at all. Our theoretical result motivates automatic environment adaptation that benefits the entire system โ€” environment tasks leverage critic feedback from both the policy and the reward model to drive targeted task adjustment, enabling active learning from experience.

Results

+9.1%
OSWorld
(GUI agents, Qwen3-VL-8B)
+18.7%
AlfWorld
(text-game, Qwen2.5-7B)
+11.9%
LiveBench
(LLM tasks, Qwen2.5-7B)

Each added dynamic component consistently improves the overall system across computer-use agents, text-based LLM agents, and coding LLMs. A key finding: optimized reward-model signals outperform outcome signals that rely on human labels. This means we can train GUI agents without manual evaluation scripts for every task โ€” a huge step toward truly self-evolving agents that learn from experience without human annotation bottlenecks.

We release both the policy models (RLAnything-7B/8B) and reward models (RLAnything-Reward-8B/14B) as open-source checkpoints for the community.

OpenClaw-RL: Your Agent Gets Better as You Use It

Open Source ยท Feb 2026
OpenClaw-RL: Empowering OpenClaw with RL โ€” Train a personalized agent simply by talking to it

Yinjie Wang, Mengdi Wang, Ling Yang

CURE optimizes coding. RLAnything optimizes general agents. But what about optimizing an AI that serves you โ€” one that learns your preferences, your communication style, your workflow, from nothing more than the conversations you already have?

OpenClaw-RL brings our RL research to where it matters most: the personal AI assistant. Built on top of OpenClaw and the Slime RL framework, it turns everyday conversations into training signals through a fully asynchronous architecture. You chat normally; in the background, the system collects trajectories, evaluates them with a process reward model, and updates the policy โ€” all without interrupting your experience.

Architecture: Four Asynchronous Loops

Most RL-for-LLM systems assume centralized, batch-mode training with pre-collected datasets. OpenClaw-RL takes a fundamentally different approach โ€” it decouples the system into four independent async processes:

Fully Async 4-Component Architecture

None of these components block one another. The model serves your requests while training runs in the background.

Two Learning Paradigms

OpenClaw-RL offers two complementary optimization methods, each suited to different types of user feedback:

Binary RL (GRPO)

A process reward model scores each assistant turn as good (+1), bad (โˆ’1), or neutral (0) based on the subsequent user reaction. The scalar rewards drive GRPO advantage estimation with PPO-style clipped surrogate loss. This works best when you provide implicit feedback โ€” thumbs up/down, corrections, or simply whether you continue a line of conversation.

Engineering Details: Binary RL

On-Policy Distillation (OPD)

When your feedback contains rich textual information ("you should have checked the file first", "don't use that library"), the system extracts hindsight hints from the next-state feedback. These hints augment the original prompt to create an "enhanced teacher" whose token-level log-probability gap with the student becomes a directional advantage signal โ€” far richer than any scalar reward.

Engineering Details: On-Policy Distillation

Self-Hosted & Private by Design

The entire stack โ€” model, PRM, training โ€” runs on your own infrastructure. Conversation data never leaves your system. No external API keys required. Default configuration uses 8 GPUs (4 for the training actor, 2 for rollout, 2 for PRM), but this is fully configurable. All conversations and PRM evaluations are logged to JSONL for analysis and debugging.

Quick Start

OpenClaw-RL ships as a drop-in enhancement for OpenClaw. You start the RL server (choose Binary RL or OPD), configure OpenClaw to route to your server's OpenAI-compatible endpoint, and start chatting. The system handles the rest:

# Binary RL mode
cd slime && bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.sh

# On-Policy Distillation mode
cd slime && bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.sh
Your agent gets better the more you use it. That's the promise โ€” and with OpenClaw-RL, it's now a reality you can run on your own GPUs.

The Bigger Picture: Stop Treating RL Components as Fixed

Stepping back, these three projects share a common philosophy: stop treating parts of the RL system as fixed.

In CURE, we unfroze the reward signal โ€” instead of relying on ground-truth test cases, the unit tester co-evolves with the coder, learning from the coder's own mistakes to provide increasingly precise reward signals. The theoretical analysis of reward precision gives principled individual-level rewards without human annotation.

In RLAnything, we unfroze the reward model and the environment alongside the policy. The reward model adapts to the evolving policy via consistency feedback; the environment adapts to the evolving capabilities of both via critic feedback. Our theoretical results show why task difficulty balance matters not just for the policy but for the reward model's training dynamics as well.

In OpenClaw-RL, we unfroze the data pipeline itself โ€” making the user's natural behavior the source of learning signal. The fully asynchronous architecture means that the feedback loop between user behavior and model improvement is continuous and non-blocking, with two complementary paradigms (scalar rewards via GRPO, directional token-level signals via OPD) covering the full spectrum of user interaction patterns.

Each step removed an assumption that was holding RL back from real-world impact. And each step made the system more autonomous, more adaptive, and more aligned with what users actually need.

Looking Ahead

Our roadmap has two tracks. Track 1 deepens personal agent optimization: broader model family support, best-recipe discovery via large-scale experiments, and extending learning beyond the policy to skills and long-term memory. Track 2 scales up agentic RL infrastructure for general agents, starting with computer-use scenarios.

The goal is a future where AI systems don't just follow instructions โ€” they evolve with you. All code, models, and training recipes are open-sourced. We hope these tools are useful to the community โ€” and we're excited to see what you build with them.

Citation

If you find our work useful, please consider citing:

@article{wang2025cure,
  title={Co-Evolving LLM Coder and Unit Tester via
         Reinforcement Learning},
  author={Wang, Yinjie and Yang, Ling and Tian, Ye
          and Shen, Ke and Wang, Mengdi},
  journal={arXiv preprint arXiv:2506.03136},
  year={2025}
}

@article{wang2026rlanything,
  title={RLAnything: Forge Environment, Policy, and
         Reward Model in Completely Dynamic RL System},
  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke
          and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2602.02488},
  year={2026}
}

@misc{openclawrl,
  author={Wang, Yinjie and Wang, Mengdi and Yang, Ling},
  title={OpenClaw-RL},
  year={2026},
  url={https://github.com/Gen-Verse/OpenClaw-RL}
}