Towards Self-Evolving AI — Ling Yang, Princeton University

Over the past year, our team has followed a single research thread to its logical conclusion — and the result is three tightly connected projects that, together, define a complete arc for self-evolving AI systems.

The thread starts with a simple observation: reinforcement learning works best when every component in the system can improve. In practice, most RL pipelines freeze the reward, freeze the environment, and freeze the data distribution — training only the policy. We asked: what happens when you systematically unfreeze each of these?

CURE unfreezes the reward. Instead of relying on ground-truth test cases, a coder and a unit tester co-evolve through RL, learning to supervise each other. RLAnything then unfreezes the reward model and the environment — jointly optimizing all three components in a closed loop that works across GUI agents, text-game agents, and coding LLMs. OpenClaw-RL takes the final step: it unfreezes the data itself, turning a user's everyday conversations into training signals so your personal AI agent improves simply because you use it.

Each project inherits the lessons from the one before it. CURE's pairwise reward design informed RLAnything's reward model co-optimization; RLAnything's async training infrastructure became the backbone of OpenClaw-RL's four-loop architecture. This blog walks through the three projects in sequence, with enough technical detail that you can understand — and reproduce — the key ideas.

The Journey So Far

Jun 2025 · NeurIPS 2025 Spotlight

Step 1: CURE — Unfreeze the Reward

A coder and a unit tester co-evolve through RL with no ground-truth code. The pairwise reward matrix replaces static test cases with a self-improving supervision signal.

Feb 2026 · Preprint

Step 2: RLAnything — Unfreeze the Entire System

Building on CURE's co-evolution insight, we jointly optimize environment, policy, and reward model in a closed loop — proving theoretically and empirically that dynamic systems outperform static ones.

Feb 2026 · Open Source

Step 3: OpenClaw-RL — Unfreeze the Data

Inheriting RLAnything's async training infrastructure, we turn everyday conversations into training signals. Your personal AI agent gets better simply because you use it.

CURE: When Coders and Testers Co-Evolve

NeurIPS 2025 Spotlight · Top 3%

Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, Mengdi Wang

Paper Code Models Tweet

Motivation

Mathematical reasoning in LLMs has been successfully incentivized via RL with verifiable rewards. In code generation, the standard recipe relies on ground-truth test cases: run the model's code against known tests, get a binary pass/fail reward. But ground-truth tests are expensive to curate and hard to scale. Meanwhile, unit test generation itself is a critical capability — accurate unit tests are essential for enabling self-checking and self-correction during inference, test-time scaling, and agentic coding pipelines.

This raises a natural question: can a coder and a unit tester teach each other through RL, without any ground-truth code as supervision?

Technical Approach: Pairwise Reward Matrix

CURE introduces a self-play framework where a single LLM acts as both a code generator and a unit test generator. The key insight is that during RL training, the coder naturally produces both correct and incorrect solutions. The incorrect ones are gold for the unit tester — they reveal exactly the failure modes that good tests should catch.

Core Algorithm: Reward Design via Interaction Outcomes

Given a coding task, the model generates N code solutions and M unit tests. CURE constructs a pairwise reward matrix R ∈ {0,1}^N×M where R_ij = 1 if code i passes test j. From this matrix:

Coder reward: Derived from how many tests a solution passes — high-pass solutions are likely correct.
Unit tester reward: Derived through a theoretical analysis of reward precision. We prove individual-level rewards for each test by analyzing its ability to distinguish correct from incorrect code, without ever seeing ground-truth.
Mutual supervision: The coder's mistakes create diverse failure modes; the tester learns to catch them. Better tests then provide more accurate rewards for the coder's next iteration.

Theoretical Foundation: Reward Precision

We formulate the final objective and introduce the concept of reward precision — the probability that a unit test correctly identifies whether a given code solution is correct. Through our theoretical analysis (Section 3.1–3.2 of the paper), we derive individual-level rewards for each generated unit test. This is non-trivial: a test that passes all solutions (including buggy ones) is useless, as is a test that fails everything. The optimal tests are those that discriminate — passing correct code while failing incorrect code.

For long chain-of-thought models, we further introduce a response-length-guided reward transformation (Section 3.4). Long-CoT models tend to produce verbose unit tests; the transformation encourages the model to maintain high test quality while reducing unnecessary reasoning overhead, achieving 64.8% inference efficiency without sacrificing accuracy.

Results

+5.3%

Code generation accuracy
(over Qwen2.5-Instruct)

+9.0%

Best-of-N accuracy

+8.1%

Agentic coding tasks

+25.1%

Agentic unit test gen

The resulting ReasonFlux-Coder models (7B & 14B) outperform similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder across five benchmarks. Our long-CoT model ReasonFlux-Coder-4B consistently outperforms Qwen3-4B. Perhaps most excitingly, the co-evolved model can serve as an effective reward model for RL training of base models — pointing toward fully self-supervised code optimization where the trained model bootstraps reward signals for the next generation.

RLAnything: Forging the Complete RL System

Preprint · Feb 2026

RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, Ling Yang

Paper Code Policy Models Blog

Four Problems That Hold Back Agentic RL

CURE showed that co-evolving two roles within a single model can work remarkably well. But it operated in a relatively clean domain — code has verifiable outputs. Agentic tasks in the wild face four critical problems:

Problems Addressed

Sparse Rewards: Current RLVR relies on binary outcome rewards (success/fail). For agentic tasks with 30–50 step trajectories, this sparse signal is nearly useless — the agent has no idea which step went wrong.
Static Reward Models: Existing reward models are frozen during RL training. They can't adapt to the evolving policy, leading to outdated and unreliable supervision.
Fixed Environments: Tasks are either too hard (0% success → no learning signal) or too easy (100% success → no challenge). Manual curriculum design doesn't scale.
Human Annotation Bottleneck: GUI agents need hand-crafted evaluation scripts for every task — fundamentally unscalable.

Technical Approach: Closed-Loop Dynamic Optimization

RLAnything jointly forges three components through closed-loop optimization:

Component 1: Integrated Feedback for the Policy

The policy receives combined signals from both verifiable outcome rewards and step-wise process rewards from the reward model. This addresses the sparse-reward problem directly — instead of a single binary signal at the end of a 50-step trajectory, each intermediate step receives meaningful supervision. Our experiments show this integrated feedback consistently outperforms outcome-only training across all benchmarks.

Component 2: Self-Improving Reward Model

The reward model is jointly optimized with the policy via consistency feedback — combining outcome supervision with self-consistency signals. As the policy improves, the reward model adapts to the evolving distribution of trajectories. This avoids the staleness problem that plagues frozen reward models in conventional RLHF pipelines. The reward model's step-wise evaluations feed back into policy training, creating a virtuous cycle.

Component 3: Theory-Motivated Environment Adaptation

We prove theoretically that balancing task difficulty benefits both policy and reward model optimization. When tasks are too easy or too hard, the importance sampling weights become extremely unbalanced, hurting both the policy gradient estimation and the reward model training. Our framework automatically adapts environment task difficulty using critic feedback from both the policy model and the reward model, keeping tasks in the optimal learning zone.

Theoretical Insight: Why Task Difficulty Matters for Reward Models

Beyond the well-known benefit for policy training, we show that reward model training also suffers under extreme task distributions. When the policy succeeds 100% of the time, the reward model sees only positive examples and cannot learn to discriminate quality. When success is 0%, it sees no useful signal at all. Our theoretical result motivates automatic environment adaptation that benefits the entire system — environment tasks leverage critic feedback from both the policy and the reward model to drive targeted task adjustment, enabling active learning from experience.

Results

+9.1%

OSWorld
(GUI agents, Qwen3-VL-8B)

+18.7%

AlfWorld
(text-game, Qwen2.5-7B)

+11.9%

LiveBench
(LLM tasks, Qwen2.5-7B)

Each added dynamic component consistently improves the overall system across computer-use agents, text-based LLM agents, and coding LLMs. A key finding: optimized reward-model signals outperform outcome signals that rely on human labels. This means we can train GUI agents without manual evaluation scripts for every task — a huge step toward truly self-evolving agents that learn from experience without human annotation bottlenecks.

We release both the policy models (RLAnything-7B/8B) and reward models (RLAnything-Reward-8B/14B) as open-source checkpoints for the community.

OpenClaw-RL: Your Agent Gets Better as You Use It

Open Source · Feb 2026

OpenClaw-RL: Empowering OpenClaw with RL — Train a personalized agent simply by talking to it

Yinjie Wang, Mengdi Wang, Ling Yang

Code OpenClaw Blog

CURE optimizes coding. RLAnything optimizes general agents. But what about optimizing an AI that serves you — one that learns your preferences, your communication style, your workflow, from nothing more than the conversations you already have?

OpenClaw-RL brings our RL research to where it matters most: the personal AI assistant. Built on top of OpenClaw and the Slime RL framework, it turns everyday conversations into training signals through a fully asynchronous architecture. You chat normally; in the background, the system collects trajectories, evaluates them with a process reward model, and updates the policy — all without interrupting your experience.

Architecture: Four Asynchronous Loops

Most RL-for-LLM systems assume centralized, batch-mode training with pre-collected datasets. OpenClaw-RL takes a fundamentally different approach — it decouples the system into four independent async processes:

Fully Async 4-Component Architecture

Agent Serving: Your self-hosted model is wrapped in OpenClaw as an OpenAI-compatible API, serving real-time requests.
Rollout Collection: API messages are automatically classified into main-line (trainable) vs. side (non-trainable) turns. The next user message serves as a natural "next-state" signal.
PRM Judging: A process reward model evaluates each assistant turn asynchronously with majority voting for robust scoring.
Policy Training: Ready samples are submitted to the trainer as they become available. Weight updates happen gracefully — submission pauses during model updates, then resumes.

None of these components block one another. The model serves your requests while training runs in the background.

Two Learning Paradigms

OpenClaw-RL offers two complementary optimization methods, each suited to different types of user feedback:

Binary RL (GRPO)

A process reward model scores each assistant turn as good (+1), bad (−1), or neutral (0) based on the subsequent user reaction. The scalar rewards drive GRPO advantage estimation with PPO-style clipped surrogate loss. This works best when you provide implicit feedback — thumbs up/down, corrections, or simply whether you continue a line of conversation.

Engineering Details: Binary RL

Session-aware training: Multi-turn conversations are tracked per-session with proper turn ordering.
At-least-one guarantee: Every session contributes at least one effective training sample, ensuring no conversation is wasted.
Majority voting: PRM evaluation uses multiple votes for robust scoring, filtering out noisy individual assessments.

On-Policy Distillation (OPD)

When your feedback contains rich textual information ("you should have checked the file first", "don't use that library"), the system extracts hindsight hints from the next-state feedback. These hints augment the original prompt to create an "enhanced teacher" whose token-level log-probability gap with the student becomes a directional advantage signal — far richer than any scalar reward.

Engineering Details: On-Policy Distillation

Hint quality filtering: Among m majority votes, only the longest, most informative hint is selected. Trivial hints (e.g., "looks good") are discarded.
Teacher log-prob optimization: Only response-suffix log-probabilities are computed to reduce peak GPU memory during the teacher forward pass.
Token-level directional signal: The log-prob gap between enhanced teacher and current student provides gradient direction at every token position — orders of magnitude richer than a single scalar per turn.

Self-Hosted & Private by Design

The entire stack — model, PRM, training — runs on your own infrastructure. Conversation data never leaves your system. No external API keys required. Default configuration uses 8 GPUs (4 for the training actor, 2 for rollout, 2 for PRM), but this is fully configurable. All conversations and PRM evaluations are logged to JSONL for analysis and debugging.

Quick Start

OpenClaw-RL ships as a drop-in enhancement for OpenClaw. You start the RL server (choose Binary RL or OPD), configure OpenClaw to route to your server's OpenAI-compatible endpoint, and start chatting. The system handles the rest:

# Binary RL mode
cd slime && bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.sh

# On-Policy Distillation mode
cd slime && bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.sh

Your agent gets better the more you use it. That's the promise — and with OpenClaw-RL, it's now a reality you can run on your own GPUs.

The Bigger Picture: Stop Treating RL Components as Fixed

Stepping back, these three projects share a common philosophy: stop treating parts of the RL system as fixed.

In CURE, we unfroze the reward signal — instead of relying on ground-truth test cases, the unit tester co-evolves with the coder, learning from the coder's own mistakes to provide increasingly precise reward signals. The theoretical analysis of reward precision gives principled individual-level rewards without human annotation.

In RLAnything, we unfroze the reward model and the environment alongside the policy. The reward model adapts to the evolving policy via consistency feedback; the environment adapts to the evolving capabilities of both via critic feedback. Our theoretical results show why task difficulty balance matters not just for the policy but for the reward model's training dynamics as well.

In OpenClaw-RL, we unfroze the data pipeline itself — making the user's natural behavior the source of learning signal. The fully asynchronous architecture means that the feedback loop between user behavior and model improvement is continuous and non-blocking, with two complementary paradigms (scalar rewards via GRPO, directional token-level signals via OPD) covering the full spectrum of user interaction patterns.

Each step removed an assumption that was holding RL back from real-world impact. And each step made the system more autonomous, more adaptive, and more aligned with what users actually need.

Looking Ahead

Our roadmap has two tracks. Track 1 deepens personal agent optimization: broader model family support, best-recipe discovery via large-scale experiments, and extending learning beyond the policy to skills and long-term memory. Track 2 scales up agentic RL infrastructure for general agents, starting with computer-use scenarios.

The goal is a future where AI systems don't just follow instructions — they evolve with you. All code, models, and training recipes are open-sourced. We hope these tools are useful to the community — and we're excited to see what you build with them.

Citation

If you find our work useful, please consider citing:

@article{wang2025cure,
  title={Co-Evolving LLM Coder and Unit Tester via
         Reinforcement Learning},
  author={Wang, Yinjie and Yang, Ling and Tian, Ye
          and Shen, Ke and Wang, Mengdi},
  journal={arXiv preprint arXiv:2506.03136},
  year={2025}
}

@article{wang2026rlanything,
  title={RLAnything: Forge Environment, Policy, and
         Reward Model in Completely Dynamic RL System},
  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke
          and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2602.02488},
  year={2026}
}

@misc{openclawrl,
  author={Wang, Yinjie and Wang, Mengdi and Yang, Ling},
  title={OpenClaw-RL},
  year={2026},
  url={https://github.com/Gen-Verse/OpenClaw-RL}
}