← All Posts
March 2026 · Research Blog

OpenClaw-RL: Let Your Agent Evolve Through Conversation

Every interaction with an agent produces a free training signal โ€” a correction, a tool error, a follow-up question. OpenClaw-RL is the framework that finally recovers all of them. This post tells the story of how our team's research lines in process rewards, structured feedback, and async RL systems converge into a single, unified framework.

Ling Yang

Gen-Verse · Princeton AI Lab · Princeton University

๐Ÿฆž OpenClaw-RL ยท #1 HF Daily Papers ๐ŸŽฏ ReasonFlux-PRM ๐Ÿ’ก Buffer of Thoughts โšก RLAnything

You correct your AI agent โ€” "no, check the file first." The agent calls a tool, it returns an error. You ask a follow-up because the first answer was wrong. These are all next-state signals: free, naturally occurring training signals that every agent interaction produces. Yet no existing agentic RL system recovers them as a live, online learning source.

That observation motivated OpenClaw-RL: a fully asynchronous RL framework that turns everyday conversations into gradient updates โ€” so your agent evolves simply by being used. The same framework also scales to terminal, GUI, SWE, and tool-call settings for general-purpose agent training.

But OpenClaw-RL is not an isolated project. It is the natural convergence point of several research lines our team has pursued over the past two years. This post tells the story of that convergence โ€” how separate ideas about reward signals, structured feedback, and async systems came together into one unified framework.

Technical Report ยท March 2026 ยท #1 HF Daily Papers
OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, Ling Yang

Three Research Lines, One Framework

OpenClaw-RL sits at the intersection of three questions our team has been pursuing independently. Each question led to a series of projects; OpenClaw-RL is where all three answers meet.

Line 1 โ€” How to give reward signals

From Process Rewards to Conversational Judgment

Good process reward is the prerequisite for agentic RL to work at all. From ReasonFlux-PRM's chain-of-thought process rewards, to RLAnything's co-evolving reward model that adapts as the policy improves, we have repeatedly validated one principle: step-level reward signals vastly outperform sparse outcome-only signals. In OpenClaw-RL, we bring this insight to its most natural form โ€” the user's next utterance is the judgment on the agent's last turn. A PRM converts these conversational reactions into binary rewards via majority voting. No human annotation required.

Line 3 โ€” The system must be fully asynchronous

From Co-Evolution to Zero-Blocking Infrastructure

Nobody wants to wait for an agent that's "still training." From CURE's co-evolving coder-tester system, to RLAnything's closed-loop optimization of environment, policy, and reward model, we progressively learned that decoupling is everything. OpenClaw-RL takes this to its logical endpoint: four fully independent async loops โ€” SGLang serving, environment rollout, PRM judging, Megatron training โ€” each running at its own pace with zero coordination overhead. The model keeps serving live requests while training happens in the background.

The Core Insight: Next-State Signals Are Universal

Every agent interaction generates a next-state signal: the user's reply, a tool output, a terminal state change, a GUI transition. These signals are ubiquitous yet wasted โ€” no existing system recovers them as live learning sources. OpenClaw-RL is built on a simple but powerful observation: these next-state signals encode two complementary forms of information, and you need both.

Evaluative Signals โ†’ Binary RL

The first form is evaluative: how well did the action perform? A PRM judge reads the next-state signal and assigns a scalar reward. In the personal-agent setting, the user's next message itself is the judgment โ€” a correction means the last turn was bad; continued engagement means it was good. The scalar rewards drive GRPO advantage estimation with PPO-style clipped surrogate loss. This covers broad interaction patterns with reliable, if coarse, supervision.

Directive Signals โ†’ On-Policy Distillation

The second form is directive: how should the action have been different? When the next state contains rich textual information โ€” "you should have checked the file first," a tool returning a detailed error trace โ€” the system extracts hindsight hints and uses them to construct an enhanced teacher context. The log-probability gap between this enhanced teacher and the current student becomes a token-level directional advantage signal, orders of magnitude richer than any scalar reward.

Why Both Signals Are Essential

Binary RL has broad coverage โ€” every turn can receive a score โ€” but the signal is coarse: a single scalar per turn. OPD provides dense, token-level guidance, but only when the next state contains sufficiently informative hints โ€” making it sparser in coverage. The two are fundamentally complementary: evaluative signals tell the model what went wrong; directive signals tell it how to fix it.

The Combination: Better Together

The complementarity is not just theoretical. When we combine Binary RL and OPD in a unified training recipe, the results speak for themselves:

0.17
Binary RL alone
(personalization score)
0.24
OPD alone
(personalization score)
0.76
Combined
(personalization score)

The jump from 0.17/0.24 to 0.76 is striking. Binary RL provides broad scalar supervision across all turns; OPD injects dense token-level corrections where rich feedback is available. The combination leverages the coverage of one and the precision of the other. Evaluative signal and directive signal are two sides of the same next-state coin โ€” you need both.

Architecture: Four Async Loops, Zero Blocking

The engineering lesson from CURE to RLAnything to OpenClaw-RL has been consistent: decouple everything. OpenClaw-RL runs four fully independent async components built on the Slime RL framework:

Fully Decoupled 4-Component Architecture

None of these components block one another. Your agent keeps serving requests while training runs in the background. This is the same design principle we validated in CURE's co-evolving system and RLAnything's closed-loop optimization โ€” but taken to its engineering extreme.

Personal Agents and General Agents โ€” Unified

A key design decision in OpenClaw-RL is that personal and general agent training share the same infrastructure. The difference is not in the system โ€” it's in the source of next-state signals.

Track 1: Personal Agent Optimization

In the personal-agent setting, the model is deployed as your everyday coding assistant via OpenClaw. You chat normally; the system intercepts live multi-turn conversations in the background. Your re-queries, corrections, and explicit feedback are all automatically recovered as training signals. The agent improves simply by being used โ€” no manual labeling, no batch data collection, no training interruption. Everything runs on your own infrastructure, fully self-hosted and private.

Track 2: General Agent Training at Scale

The same async infrastructure powers scalable RL for general-purpose agents. We provide ready-to-use implementations across four real-world settings โ€” terminal, GUI, SWE, and tool-call โ€” with large-scale environment parallelization. In these settings, next-state signals come from environment state changes rather than user feedback, but the learning loop is identical: serve, collect, judge, train.

Supported Real-World Settings (Track 2)

Looking Back: A Natural Convergence

When we started CURE, the question was narrow: can a coder and a unit tester teach each other without ground truth? When we built RLAnything, the question expanded: can we jointly optimize the environment, reward model, and policy? When we developed ReasonFlux and Buffer of Thoughts, the question was: how do we convert structured reasoning feedback into learnable signals?

In hindsight, all of these were different angles on the same problem: how to extract maximal learning signal from every interaction, and do it without stopping the system. OpenClaw-RL is where these answers converge โ€” process rewards from the ReasonFlux line, structured feedback distillation from the BoT/SuperCorrect line, and fully async co-evolution from the CURE/RLAnything line, all fused into a single framework that serves both personal and general agents.

Every conversation is a training opportunity. Every tool error is a reward signal. Every correction is a distillation target. OpenClaw-RL just makes sure none of them go to waste.

Getting Started

OpenClaw-RL ships multiple training recipes. Pick the one that matches your setup:

# Personal agent โ€” Binary RL
cd slime && bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.sh

# Personal agent โ€” On-Policy Distillation
cd slime && bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.sh

# Personal agent โ€” Combined (recommended)
cd slime && bash ../openclaw-combine/run_qwen3_4b_openclaw_combine.sh

# General agent โ€” Terminal RL
cd slime && bash ../terminal-rl/terminal_qwen3_8b_rl.sh

# General agent โ€” Tool-Call RL
cd slime && bash ../toolcall-rl/retool_qwen3_4b_rl.sh

LoRA training and Tinker cloud deployment are also supported. See the GitHub repository for full documentation.

Citation

If you find our work useful, please consider citing:

@article{wang2026openclawrl,
  title={OpenClaw-RL: Train Any Agent Simply by Talking},
  author={Wang, Yinjie and Chen, Xuyang and Jin, Xiaolong
          and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2603.10165},
  year={2026}
}

@article{wang2025cure,
  title={Co-Evolving LLM Coder and Unit Tester via
         Reinforcement Learning},
  author={Wang, Yinjie and Yang, Ling and Tian, Ye
          and Shen, Ke and Wang, Mengdi},
  journal={arXiv preprint arXiv:2506.03136},
  year={2025}
}

@article{wang2026rlanything,
  title={RLAnything: Forge Environment, Policy, and
         Reward Model in Completely Dynamic RL System},
  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke
          and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2602.02488},
  year={2026}
}