Over the past year, our team has followed a single research thread to its logical conclusion โ and the result is three tightly connected projects that, together, define a complete arc for self-evolving AI systems.
The thread starts with a simple observation: reinforcement learning works best when every component in the system can improve. In practice, most RL pipelines freeze the reward, freeze the environment, and freeze the data distribution โ training only the policy. We asked: what happens when you systematically unfreeze each of these?
CURE unfreezes the reward. Instead of relying on ground-truth test cases, a coder and a unit tester co-evolve through RL, learning to supervise each other. RLAnything then unfreezes the reward model and the environment โ jointly optimizing all three components in a closed loop that works across GUI agents, text-game agents, and coding LLMs. OpenClaw-RL takes the final step: it unfreezes the data itself, turning a user's everyday conversations into training signals so your personal AI agent improves simply because you use it.
Each project inherits the lessons from the one before it. CURE's pairwise reward design informed RLAnything's reward model co-optimization; RLAnything's async training infrastructure became the backbone of OpenClaw-RL's four-loop architecture. This blog walks through the three projects in sequence, with enough technical detail that you can understand โ and reproduce โ the key ideas.
The Journey So Far
CURE: When Coders and Testers Co-Evolve
Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, Mengdi Wang
Motivation
Mathematical reasoning in LLMs has been successfully incentivized via RL with verifiable rewards. In code generation, the standard recipe relies on ground-truth test cases: run the model's code against known tests, get a binary pass/fail reward. But ground-truth tests are expensive to curate and hard to scale. Meanwhile, unit test generation itself is a critical capability โ accurate unit tests are essential for enabling self-checking and self-correction during inference, test-time scaling, and agentic coding pipelines.
This raises a natural question: can a coder and a unit tester teach each other through RL, without any ground-truth code as supervision?
Technical Approach: Pairwise Reward Matrix
CURE introduces a self-play framework where a single LLM acts as both a code generator and a unit test generator. The key insight is that during RL training, the coder naturally produces both correct and incorrect solutions. The incorrect ones are gold for the unit tester โ they reveal exactly the failure modes that good tests should catch.
Core Algorithm: Reward Design via Interaction Outcomes
Given a coding task, the model generates N code solutions and M unit tests. CURE constructs a pairwise reward matrix R โ {0,1}NรM where Rij = 1 if code i passes test j. From this matrix:
- Coder reward: Derived from how many tests a solution passes โ high-pass solutions are likely correct.
- Unit tester reward: Derived through a theoretical analysis of reward precision. We prove individual-level rewards for each test by analyzing its ability to distinguish correct from incorrect code, without ever seeing ground-truth.
- Mutual supervision: The coder's mistakes create diverse failure modes; the tester learns to catch them. Better tests then provide more accurate rewards for the coder's next iteration.
Theoretical Foundation: Reward Precision
We formulate the final objective and introduce the concept of reward precision โ the probability that a unit test correctly identifies whether a given code solution is correct. Through our theoretical analysis (Section 3.1โ3.2 of the paper), we derive individual-level rewards for each generated unit test. This is non-trivial: a test that passes all solutions (including buggy ones) is useless, as is a test that fails everything. The optimal tests are those that discriminate โ passing correct code while failing incorrect code.
For long chain-of-thought models, we further introduce a response-length-guided reward transformation (Section 3.4). Long-CoT models tend to produce verbose unit tests; the transformation encourages the model to maintain high test quality while reducing unnecessary reasoning overhead, achieving 64.8% inference efficiency without sacrificing accuracy.
Results
(over Qwen2.5-Instruct)
The resulting ReasonFlux-Coder models (7B & 14B) outperform similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder across five benchmarks. Our long-CoT model ReasonFlux-Coder-4B consistently outperforms Qwen3-4B. Perhaps most excitingly, the co-evolved model can serve as an effective reward model for RL training of base models โ pointing toward fully self-supervised code optimization where the trained model bootstraps reward signals for the next generation.
RLAnything: Forging the Complete RL System
Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, Ling Yang
Four Problems That Hold Back Agentic RL
CURE showed that co-evolving two roles within a single model can work remarkably well. But it operated in a relatively clean domain โ code has verifiable outputs. Agentic tasks in the wild face four critical problems:
Problems Addressed
- Sparse Rewards: Current RLVR relies on binary outcome rewards (success/fail). For agentic tasks with 30โ50 step trajectories, this sparse signal is nearly useless โ the agent has no idea which step went wrong.
- Static Reward Models: Existing reward models are frozen during RL training. They can't adapt to the evolving policy, leading to outdated and unreliable supervision.
- Fixed Environments: Tasks are either too hard (0% success โ no learning signal) or too easy (100% success โ no challenge). Manual curriculum design doesn't scale.
- Human Annotation Bottleneck: GUI agents need hand-crafted evaluation scripts for every task โ fundamentally unscalable.
Technical Approach: Closed-Loop Dynamic Optimization
RLAnything jointly forges three components through closed-loop optimization:
Component 1: Integrated Feedback for the Policy
The policy receives combined signals from both verifiable outcome rewards and step-wise process rewards from the reward model. This addresses the sparse-reward problem directly โ instead of a single binary signal at the end of a 50-step trajectory, each intermediate step receives meaningful supervision. Our experiments show this integrated feedback consistently outperforms outcome-only training across all benchmarks.
Component 2: Self-Improving Reward Model
The reward model is jointly optimized with the policy via consistency feedback โ combining outcome supervision with self-consistency signals. As the policy improves, the reward model adapts to the evolving distribution of trajectories. This avoids the staleness problem that plagues frozen reward models in conventional RLHF pipelines. The reward model's step-wise evaluations feed back into policy training, creating a virtuous cycle.
Component 3: Theory-Motivated Environment Adaptation
We prove theoretically that balancing task difficulty benefits both policy and reward model optimization. When tasks are too easy or too hard, the importance sampling weights become extremely unbalanced, hurting both the policy gradient estimation and the reward model training. Our framework automatically adapts environment task difficulty using critic feedback from both the policy model and the reward model, keeping tasks in the optimal learning zone.
Theoretical Insight: Why Task Difficulty Matters for Reward Models
Beyond the well-known benefit for policy training, we show that reward model training also suffers under extreme task distributions. When the policy succeeds 100% of the time, the reward model sees only positive examples and cannot learn to discriminate quality. When success is 0%, it sees no useful signal at all. Our theoretical result motivates automatic environment adaptation that benefits the entire system โ environment tasks leverage critic feedback from both the policy and the reward model to drive targeted task adjustment, enabling active learning from experience.
Results
(GUI agents, Qwen3-VL-8B)
(text-game, Qwen2.5-7B)
(LLM tasks, Qwen2.5-7B)
Each added dynamic component consistently improves the overall system across computer-use agents, text-based LLM agents, and coding LLMs. A key finding: optimized reward-model signals outperform outcome signals that rely on human labels. This means we can train GUI agents without manual evaluation scripts for every task โ a huge step toward truly self-evolving agents that learn from experience without human annotation bottlenecks.
We release both the policy models (RLAnything-7B/8B) and reward models (RLAnything-Reward-8B/14B) as open-source checkpoints for the community.
OpenClaw-RL: Your Agent Gets Better as You Use It
Yinjie Wang, Mengdi Wang, Ling Yang
CURE optimizes coding. RLAnything optimizes general agents. But what about optimizing an AI that serves you โ one that learns your preferences, your communication style, your workflow, from nothing more than the conversations you already have?
OpenClaw-RL brings our RL research to where it matters most: the personal AI assistant. Built on top of OpenClaw and the Slime RL framework, it turns everyday conversations into training signals through a fully asynchronous architecture. You chat normally; in the background, the system collects trajectories, evaluates them with a process reward model, and updates the policy โ all without interrupting your experience.
Architecture: Four Asynchronous Loops
Most RL-for-LLM systems assume centralized, batch-mode training with pre-collected datasets. OpenClaw-RL takes a fundamentally different approach โ it decouples the system into four independent async processes:
Fully Async 4-Component Architecture
- Agent Serving: Your self-hosted model is wrapped in OpenClaw as an OpenAI-compatible API, serving real-time requests.
- Rollout Collection: API messages are automatically classified into main-line (trainable) vs. side (non-trainable) turns. The next user message serves as a natural "next-state" signal.
- PRM Judging: A process reward model evaluates each assistant turn asynchronously with majority voting for robust scoring.
- Policy Training: Ready samples are submitted to the trainer as they become available. Weight updates happen gracefully โ submission pauses during model updates, then resumes.
None of these components block one another. The model serves your requests while training runs in the background.
Two Learning Paradigms
OpenClaw-RL offers two complementary optimization methods, each suited to different types of user feedback:
Binary RL (GRPO)
A process reward model scores each assistant turn as good (+1), bad (โ1), or neutral (0) based on the subsequent user reaction. The scalar rewards drive GRPO advantage estimation with PPO-style clipped surrogate loss. This works best when you provide implicit feedback โ thumbs up/down, corrections, or simply whether you continue a line of conversation.
Engineering Details: Binary RL
- Session-aware training: Multi-turn conversations are tracked per-session with proper turn ordering.
- At-least-one guarantee: Every session contributes at least one effective training sample, ensuring no conversation is wasted.
- Majority voting: PRM evaluation uses multiple votes for robust scoring, filtering out noisy individual assessments.
On-Policy Distillation (OPD)
When your feedback contains rich textual information ("you should have checked the file first", "don't use that library"), the system extracts hindsight hints from the next-state feedback. These hints augment the original prompt to create an "enhanced teacher" whose token-level log-probability gap with the student becomes a directional advantage signal โ far richer than any scalar reward.
Engineering Details: On-Policy Distillation
- Hint quality filtering: Among m majority votes, only the longest, most informative hint is selected. Trivial hints (e.g., "looks good") are discarded.
- Teacher log-prob optimization: Only response-suffix log-probabilities are computed to reduce peak GPU memory during the teacher forward pass.
- Token-level directional signal: The log-prob gap between enhanced teacher and current student provides gradient direction at every token position โ orders of magnitude richer than a single scalar per turn.
Self-Hosted & Private by Design
The entire stack โ model, PRM, training โ runs on your own infrastructure. Conversation data never leaves your system. No external API keys required. Default configuration uses 8 GPUs (4 for the training actor, 2 for rollout, 2 for PRM), but this is fully configurable. All conversations and PRM evaluations are logged to JSONL for analysis and debugging.
Quick Start
OpenClaw-RL ships as a drop-in enhancement for OpenClaw. You start the RL server (choose Binary RL or OPD), configure OpenClaw to route to your server's OpenAI-compatible endpoint, and start chatting. The system handles the rest:
# Binary RL mode cd slime && bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.sh # On-Policy Distillation mode cd slime && bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.sh
Your agent gets better the more you use it. That's the promise โ and with OpenClaw-RL, it's now a reality you can run on your own GPUs.
The Bigger Picture: Stop Treating RL Components as Fixed
Stepping back, these three projects share a common philosophy: stop treating parts of the RL system as fixed.
In CURE, we unfroze the reward signal โ instead of relying on ground-truth test cases, the unit tester co-evolves with the coder, learning from the coder's own mistakes to provide increasingly precise reward signals. The theoretical analysis of reward precision gives principled individual-level rewards without human annotation.
In RLAnything, we unfroze the reward model and the environment alongside the policy. The reward model adapts to the evolving policy via consistency feedback; the environment adapts to the evolving capabilities of both via critic feedback. Our theoretical results show why task difficulty balance matters not just for the policy but for the reward model's training dynamics as well.
In OpenClaw-RL, we unfroze the data pipeline itself โ making the user's natural behavior the source of learning signal. The fully asynchronous architecture means that the feedback loop between user behavior and model improvement is continuous and non-blocking, with two complementary paradigms (scalar rewards via GRPO, directional token-level signals via OPD) covering the full spectrum of user interaction patterns.
Each step removed an assumption that was holding RL back from real-world impact. And each step made the system more autonomous, more adaptive, and more aligned with what users actually need.
Looking Ahead
Our roadmap has two tracks. Track 1 deepens personal agent optimization: broader model family support, best-recipe discovery via large-scale experiments, and extending learning beyond the policy to skills and long-term memory. Track 2 scales up agentic RL infrastructure for general agents, starting with computer-use scenarios.
The goal is a future where AI systems don't just follow instructions โ they evolve with you. All code, models, and training recipes are open-sourced. We hope these tools are useful to the community โ and we're excited to see what you build with them.
Citation
If you find our work useful, please consider citing:
@article{wang2025cure,
title={Co-Evolving LLM Coder and Unit Tester via
Reinforcement Learning},
author={Wang, Yinjie and Yang, Ling and Tian, Ye
and Shen, Ke and Wang, Mengdi},
journal={arXiv preprint arXiv:2506.03136},
year={2025}
}
@article{wang2026rlanything,
title={RLAnything: Forge Environment, Policy, and
Reward Model in Completely Dynamic RL System},
author={Wang, Yinjie and Xie, Tianbao and Shen, Ke
and Wang, Mengdi and Yang, Ling},
journal={arXiv preprint arXiv:2602.02488},
year={2026}
}
@misc{openclawrl,
author={Wang, Yinjie and Wang, Mengdi and Yang, Ling},
title={OpenClaw-RL},
year={2026},
url={https://github.com/Gen-Verse/OpenClaw-RL}
}