OpenClaw-RL: Where All Research Lines Converge — Ling Yang, Princeton University

You correct your AI agent — "no, check the file first." The agent calls a tool, it returns an error. You ask a follow-up because the first answer was wrong. These are all next-state signals: free, naturally occurring training signals that every agent interaction produces. Yet no existing agentic RL system recovers them as a live, online learning source.

That observation motivated OpenClaw-RL: a fully asynchronous RL framework that turns everyday conversations into gradient updates — so your agent evolves simply by being used. The same framework also scales to terminal, GUI, SWE, and tool-call settings for general-purpose agent training.

But OpenClaw-RL is not an isolated project. It is the natural convergence point of several research lines our team has pursued over the past two years. This post tells the story of that convergence — how separate ideas about reward signals, structured feedback, and async systems came together into one unified framework.

Technical Report · March 2026 · #1 HF Daily Papers

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, Ling Yang

Paper Code Blog OpenClaw

Three Research Lines, One Framework

OpenClaw-RL sits at the intersection of three questions our team has been pursuing independently. Each question led to a series of projects; OpenClaw-RL is where all three answers meet.

Line 1 — How to give reward signals

From Process Rewards to Conversational Judgment

Good process reward is the prerequisite for agentic RL to work at all. From ReasonFlux-PRM's chain-of-thought process rewards, to RLAnything's co-evolving reward model that adapts as the policy improves, we have repeatedly validated one principle: step-level reward signals vastly outperform sparse outcome-only signals. In OpenClaw-RL, we bring this insight to its most natural form — the user's next utterance is the judgment on the agent's last turn. A PRM converts these conversational reactions into binary rewards via majority voting. No human annotation required.

ReasonFlux-PRM RLAnything OpenClaw-RL

Line 2 — Feedback is more than a score

From Thought Templates to Agent Skills to Token-Level Distillation

A scalar reward tells the model "wrong." But it doesn't tell the model how to change. This line has always been about one thing: high-level structured guidelines matter. In Buffer of Thoughts, we showed that distilling reusable thought templates — abstract reasoning blueprints that tell the model how to approach a problem class — dramatically improves LLM problem-solving. SuperCorrect extended this by distilling both templates and self-correction strategies from large teachers into smaller models. ReasonFlux scaled it further into a hierarchical library of thought templates that grows and refines itself. In today's agent era, this same concept has a new name — skills: reusable high-level procedures that guide how agents act. We proved the importance of this structured-guideline representation well before the agent wave arrived. OpenClaw-RL's On-Policy Distillation (OPD) is the natural next step: extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage at every position. Not just "you were wrong" — but "here's how each token should have been different."

Buffer of Thoughts SuperCorrect ReasonFlux OpenClaw-RL OPD

Line 3 — The system must be fully asynchronous

From Co-Evolution to Zero-Blocking Infrastructure

Nobody wants to wait for an agent that's "still training." From CURE's co-evolving coder-tester system, to RLAnything's closed-loop optimization of environment, policy, and reward model, we progressively learned that decoupling is everything. OpenClaw-RL takes this to its logical endpoint: four fully independent async loops — SGLang serving, environment rollout, PRM judging, Megatron training — each running at its own pace with zero coordination overhead. The model keeps serving live requests while training happens in the background.

CURE RLAnything OpenClaw-RL

The Core Insight: Next-State Signals Are Universal

Every agent interaction generates a next-state signal: the user's reply, a tool output, a terminal state change, a GUI transition. These signals are ubiquitous yet wasted — no existing system recovers them as live learning sources. OpenClaw-RL is built on a simple but powerful observation: these next-state signals encode two complementary forms of information, and you need both.

Evaluative Signals → Binary RL

The first form is evaluative: how well did the action perform? A PRM judge reads the next-state signal and assigns a scalar reward. In the personal-agent setting, the user's next message itself is the judgment — a correction means the last turn was bad; continued engagement means it was good. The scalar rewards drive GRPO advantage estimation with PPO-style clipped surrogate loss. This covers broad interaction patterns with reliable, if coarse, supervision.

Directive Signals → On-Policy Distillation

The second form is directive: how should the action have been different? This is the same question that thought templates answer for reasoning and skills answer for agents — providing structured, high-level guidance rather than just a pass/fail verdict. When the next state contains rich textual information — "you should have checked the file first," a tool returning a detailed error trace — the system extracts hindsight hints, which function as on-the-fly thought templates for the specific interaction. These hints are used to construct an enhanced teacher context, and the log-probability gap between this enhanced teacher and the current student becomes a token-level directional advantage signal, orders of magnitude richer than any scalar reward.

Why Both Signals Are Essential

Binary RL has broad coverage — every turn can receive a score — but the signal is coarse: a single scalar per turn. OPD provides dense, token-level guidance, but only when the next state contains sufficiently informative hints — making it sparser in coverage. The two are fundamentally complementary: evaluative signals tell the model what went wrong; directive signals tell it how to fix it.

The Combination: Better Together

The complementarity is not just theoretical. When we combine Binary RL and OPD in a unified training recipe, the results speak for themselves:

0.17

Binary RL alone
(personalization score)

0.24

OPD alone
(personalization score)

0.76

Combined
(personalization score)

The jump from 0.17/0.24 to 0.76 is striking. Binary RL provides broad scalar supervision across all turns; OPD injects dense token-level corrections where rich feedback is available. The combination leverages the coverage of one and the precision of the other. Evaluative signal and directive signal are two sides of the same next-state coin — you need both.

Architecture: Four Async Loops, Zero Blocking

The engineering lesson from CURE to RLAnything to OpenClaw-RL has been consistent: decouple everything. OpenClaw-RL runs four fully independent async components built on the Slime RL framework:

Fully Decoupled 4-Component Architecture

Agent Serving (SGLang): The policy model is wrapped in OpenClaw as an OpenAI-compatible API, serving live user requests in real time.
Rollout Collection: Multi-turn conversations are automatically tracked per-session. API messages are classified into main-line (trainable) vs. side (non-trainable) turns. The next user/tool/environment response serves as the natural next-state signal.
PRM Judging: A process reward model evaluates each assistant turn asynchronously. Majority voting across multiple assessments ensures robust scoring.
Policy Training (Megatron): Ready samples are submitted to the trainer as they become available. Weight updates are graceful — submission pauses during checkpointing, then resumes automatically.

None of these components block one another. Your agent keeps serving requests while training runs in the background. This is the same design principle we validated in CURE's co-evolving system and RLAnything's closed-loop optimization — but taken to its engineering extreme.

Personal Agents and General Agents — Unified

A key design decision in OpenClaw-RL is that personal and general agent training share the same infrastructure. The difference is not in the system — it's in the source of next-state signals.

Track 1: Personal Agent Optimization

In the personal-agent setting, the model is deployed as your everyday coding assistant via OpenClaw. You chat normally; the system intercepts live multi-turn conversations in the background. Your re-queries, corrections, and explicit feedback are all automatically recovered as training signals. The agent improves simply by being used — no manual labeling, no batch data collection, no training interruption. Everything runs on your own infrastructure, fully self-hosted and private.

Track 2: General Agent Training at Scale

The same async infrastructure powers scalable RL for general-purpose agents. We provide ready-to-use implementations across four real-world settings — terminal, GUI, SWE, and tool-call — with large-scale environment parallelization. In these settings, next-state signals come from environment state changes rather than user feedback, but the learning loop is identical: serve, collect, judge, train.

Supported Real-World Settings (Track 2)

Terminal RL: Agents interact with real shell environments; terminal output serves as next-state signal.
GUI RL: Agents operate desktop environments (via OSWorld); screenshot state transitions provide feedback.
SWE RL: Agents resolve real GitHub issues; test suite results and code review signals drive learning.
Tool-Call RL: Agents invoke external APIs; tool outputs and error traces inform optimization.

Looking Back: A Natural Convergence

When we started CURE, the question was narrow: can a coder and a unit tester teach each other without ground truth? When we built RLAnything, the question expanded: can we jointly optimize the environment, reward model, and policy? When we developed Buffer of Thoughts and ReasonFlux, the question was: can high-level structured guidelines — what we called thought templates, and what the agent community now calls skills — be distilled and reused to make models fundamentally better? We proved that answer was yes, long before "agent skills" became a common concept.

In hindsight, all of these were different angles on the same problem: how to extract maximal learning signal from every interaction, and do it without stopping the system. OpenClaw-RL is where these answers converge — process rewards from the ReasonFlux line, structured guideline distillation (from thought templates to OPD hints) from the BoT/SuperCorrect line, and fully async co-evolution from the CURE/RLAnything line, all fused into a single framework that serves both personal and general agents.

Every conversation is a training opportunity. Every tool error is a reward signal. Every correction is a distillation target. OpenClaw-RL just makes sure none of them go to waste.

Getting Started

OpenClaw-RL ships multiple training recipes. Pick the one that matches your setup:

# Personal agent — Binary RL
cd slime && bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.sh

# Personal agent — On-Policy Distillation
cd slime && bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.sh

# Personal agent — Combined (recommended)
cd slime && bash ../openclaw-combine/run_qwen3_4b_openclaw_combine.sh

# General agent — Terminal RL
cd slime && bash ../terminal-rl/terminal_qwen3_8b_rl.sh

# General agent — Tool-Call RL
cd slime && bash ../toolcall-rl/retool_qwen3_4b_rl.sh

LoRA training and Tinker cloud deployment are also supported. See the GitHub repository for full documentation.

Citation

If you find our work useful, please consider citing:

@article{wang2026openclawrl,
  title={OpenClaw-RL: Train Any Agent Simply by Talking},
  author={Wang, Yinjie and Chen, Xuyang and Jin, Xiaolong
          and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2603.10165},
  year={2026}
}

@article{wang2025cure,
  title={Co-Evolving LLM Coder and Unit Tester via
         Reinforcement Learning},
  author={Wang, Yinjie and Yang, Ling and Tian, Ye
          and Shen, Ke and Wang, Mengdi},
  journal={arXiv preprint arXiv:2506.03136},
  year={2025}
}

@article{wang2026rlanything,
  title={RLAnything: Forge Environment, Policy, and
         Reward Model in Completely Dynamic RL System},
  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke
          and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2602.02488},
  year={2026}
}