You correct your AI agent โ "no, check the file first." The agent calls a tool, it returns an error. You ask a follow-up because the first answer was wrong. These are all next-state signals: free, naturally occurring training signals that every agent interaction produces. Yet no existing agentic RL system recovers them as a live, online learning source.
That observation motivated OpenClaw-RL: a fully asynchronous RL framework that turns everyday conversations into gradient updates โ so your agent evolves simply by being used. The same framework also scales to terminal, GUI, SWE, and tool-call settings for general-purpose agent training.
But OpenClaw-RL is not an isolated project. It is the natural convergence point of several research lines our team has pursued over the past two years. This post tells the story of that convergence โ how separate ideas about reward signals, structured feedback, and async systems came together into one unified framework.
Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, Ling Yang
Three Research Lines, One Framework
OpenClaw-RL sits at the intersection of three questions our team has been pursuing independently. Each question led to a series of projects; OpenClaw-RL is where all three answers meet.
From Process Rewards to Conversational Judgment
Good process reward is the prerequisite for agentic RL to work at all. From ReasonFlux-PRM's chain-of-thought process rewards, to RLAnything's co-evolving reward model that adapts as the policy improves, we have repeatedly validated one principle: step-level reward signals vastly outperform sparse outcome-only signals. In OpenClaw-RL, we bring this insight to its most natural form โ the user's next utterance is the judgment on the agent's last turn. A PRM converts these conversational reactions into binary rewards via majority voting. No human annotation required.
From Thought Templates to Token-Level Distillation
A scalar reward tells the model "wrong." But it doesn't tell the model how to change. From Buffer of Thoughts and SuperCorrect โ which distill structured thought templates into smaller models โ to ReasonFlux's hierarchical reasoning library, this line has always been about converting rich feedback into learnable signal. OpenClaw-RL's On-Policy Distillation (OPD) is the culmination: extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage at every position. Not just "you were wrong" โ but "here's how each token should have been different."
From Co-Evolution to Zero-Blocking Infrastructure
Nobody wants to wait for an agent that's "still training." From CURE's co-evolving coder-tester system, to RLAnything's closed-loop optimization of environment, policy, and reward model, we progressively learned that decoupling is everything. OpenClaw-RL takes this to its logical endpoint: four fully independent async loops โ SGLang serving, environment rollout, PRM judging, Megatron training โ each running at its own pace with zero coordination overhead. The model keeps serving live requests while training happens in the background.
The Core Insight: Next-State Signals Are Universal
Every agent interaction generates a next-state signal: the user's reply, a tool output, a terminal state change, a GUI transition. These signals are ubiquitous yet wasted โ no existing system recovers them as live learning sources. OpenClaw-RL is built on a simple but powerful observation: these next-state signals encode two complementary forms of information, and you need both.
Evaluative Signals โ Binary RL
The first form is evaluative: how well did the action perform? A PRM judge reads the next-state signal and assigns a scalar reward. In the personal-agent setting, the user's next message itself is the judgment โ a correction means the last turn was bad; continued engagement means it was good. The scalar rewards drive GRPO advantage estimation with PPO-style clipped surrogate loss. This covers broad interaction patterns with reliable, if coarse, supervision.
Directive Signals โ On-Policy Distillation
The second form is directive: how should the action have been different? When the next state contains rich textual information โ "you should have checked the file first," a tool returning a detailed error trace โ the system extracts hindsight hints and uses them to construct an enhanced teacher context. The log-probability gap between this enhanced teacher and the current student becomes a token-level directional advantage signal, orders of magnitude richer than any scalar reward.
Why Both Signals Are Essential
Binary RL has broad coverage โ every turn can receive a score โ but the signal is coarse: a single scalar per turn. OPD provides dense, token-level guidance, but only when the next state contains sufficiently informative hints โ making it sparser in coverage. The two are fundamentally complementary: evaluative signals tell the model what went wrong; directive signals tell it how to fix it.
The Combination: Better Together
The complementarity is not just theoretical. When we combine Binary RL and OPD in a unified training recipe, the results speak for themselves:
(personalization score)
(personalization score)
(personalization score)
The jump from 0.17/0.24 to 0.76 is striking. Binary RL provides broad scalar supervision across all turns; OPD injects dense token-level corrections where rich feedback is available. The combination leverages the coverage of one and the precision of the other. Evaluative signal and directive signal are two sides of the same next-state coin โ you need both.
Architecture: Four Async Loops, Zero Blocking
The engineering lesson from CURE to RLAnything to OpenClaw-RL has been consistent: decouple everything. OpenClaw-RL runs four fully independent async components built on the Slime RL framework:
Fully Decoupled 4-Component Architecture
- Agent Serving (SGLang): The policy model is wrapped in OpenClaw as an OpenAI-compatible API, serving live user requests in real time.
- Rollout Collection: Multi-turn conversations are automatically tracked per-session. API messages are classified into main-line (trainable) vs. side (non-trainable) turns. The next user/tool/environment response serves as the natural next-state signal.
- PRM Judging: A process reward model evaluates each assistant turn asynchronously. Majority voting across multiple assessments ensures robust scoring.
- Policy Training (Megatron): Ready samples are submitted to the trainer as they become available. Weight updates are graceful โ submission pauses during checkpointing, then resumes automatically.
None of these components block one another. Your agent keeps serving requests while training runs in the background. This is the same design principle we validated in CURE's co-evolving system and RLAnything's closed-loop optimization โ but taken to its engineering extreme.
Personal Agents and General Agents โ Unified
A key design decision in OpenClaw-RL is that personal and general agent training share the same infrastructure. The difference is not in the system โ it's in the source of next-state signals.
Track 1: Personal Agent Optimization
In the personal-agent setting, the model is deployed as your everyday coding assistant via OpenClaw. You chat normally; the system intercepts live multi-turn conversations in the background. Your re-queries, corrections, and explicit feedback are all automatically recovered as training signals. The agent improves simply by being used โ no manual labeling, no batch data collection, no training interruption. Everything runs on your own infrastructure, fully self-hosted and private.
Track 2: General Agent Training at Scale
The same async infrastructure powers scalable RL for general-purpose agents. We provide ready-to-use implementations across four real-world settings โ terminal, GUI, SWE, and tool-call โ with large-scale environment parallelization. In these settings, next-state signals come from environment state changes rather than user feedback, but the learning loop is identical: serve, collect, judge, train.
Supported Real-World Settings (Track 2)
- Terminal RL: Agents interact with real shell environments; terminal output serves as next-state signal.
- GUI RL: Agents operate desktop environments (via OSWorld); screenshot state transitions provide feedback.
- SWE RL: Agents resolve real GitHub issues; test suite results and code review signals drive learning.
- Tool-Call RL: Agents invoke external APIs; tool outputs and error traces inform optimization.
Looking Back: A Natural Convergence
When we started CURE, the question was narrow: can a coder and a unit tester teach each other without ground truth? When we built RLAnything, the question expanded: can we jointly optimize the environment, reward model, and policy? When we developed ReasonFlux and Buffer of Thoughts, the question was: how do we convert structured reasoning feedback into learnable signals?
In hindsight, all of these were different angles on the same problem: how to extract maximal learning signal from every interaction, and do it without stopping the system. OpenClaw-RL is where these answers converge โ process rewards from the ReasonFlux line, structured feedback distillation from the BoT/SuperCorrect line, and fully async co-evolution from the CURE/RLAnything line, all fused into a single framework that serves both personal and general agents.
Every conversation is a training opportunity. Every tool error is a reward signal. Every correction is a distillation target. OpenClaw-RL just makes sure none of them go to waste.
Getting Started
OpenClaw-RL ships multiple training recipes. Pick the one that matches your setup:
# Personal agent โ Binary RL cd slime && bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.sh # Personal agent โ On-Policy Distillation cd slime && bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.sh # Personal agent โ Combined (recommended) cd slime && bash ../openclaw-combine/run_qwen3_4b_openclaw_combine.sh # General agent โ Terminal RL cd slime && bash ../terminal-rl/terminal_qwen3_8b_rl.sh # General agent โ Tool-Call RL cd slime && bash ../toolcall-rl/retool_qwen3_4b_rl.sh
LoRA training and Tinker cloud deployment are also supported. See the GitHub repository for full documentation.
Citation
If you find our work useful, please consider citing:
@article{wang2026openclawrl,
title={OpenClaw-RL: Train Any Agent Simply by Talking},
author={Wang, Yinjie and Chen, Xuyang and Jin, Xiaolong
and Wang, Mengdi and Yang, Ling},
journal={arXiv preprint arXiv:2603.10165},
year={2026}
}
@article{wang2025cure,
title={Co-Evolving LLM Coder and Unit Tester via
Reinforcement Learning},
author={Wang, Yinjie and Yang, Ling and Tian, Ye
and Shen, Ke and Wang, Mengdi},
journal={arXiv preprint arXiv:2506.03136},
year={2025}
}
@article{wang2026rlanything,
title={RLAnything: Forge Environment, Policy, and
Reward Model in Completely Dynamic RL System},
author={Wang, Yinjie and Xie, Tianbao and Shen, Ke
and Wang, Mengdi and Yang, Ling},
journal={arXiv preprint arXiv:2602.02488},
year={2026}
}