The idea behind this research line is deceptively simple: humans don't solve hard problems from scratch every time — they recall high-level strategies, adapt them to the specific problem, and refine their approach when things go wrong. Can we teach LLMs to do the same?
We call these high-level strategies thought templates: abstract, reusable reasoning blueprints that tell a model how to approach a problem class rather than how to solve any single instance. In today's agent era, the same concept goes by a different name: skills — reusable high-level procedures that guide how agents act. The structural parallel is exact: both are composable, transferable, and far more efficient than learning every behavior from scratch. We proved the importance of this structured-guideline representation well before agents became the dominant paradigm.
The journey began in mid-2024 with Buffer of Thoughts, a prompting framework that stores and retrieves reusable thought templates — the earliest form of "skills for reasoning." The results were striking — an 8B model equipped with BoT could rival a 70B model — but the templates lived outside the model. This naturally led to two follow-up questions: (1) can we bake these skills into the model's weights through training? and (2) can we use RL to learn optimal sequences of skills for complex, multi-step problems?
SuperCorrect answers the first question. It distills hierarchical thought templates — reasoning skills — from a large teacher model into a smaller student model via SFT, and uses a novel cross-model collaborative DPO to teach the student to self-correct using the teacher's error-correction traces. ReasonFlux answers the second. It builds a structured library of ~500 thought templates and trains the model via hierarchical RL to plan optimal skill trajectories — sequences of high-level strategies that decompose complex problems into manageable sub-problems.
Each project inherits and extends the core concept from the one before it. BoT's thought templates became SuperCorrect's distilled reasoning skills; SuperCorrect's skill-guided reasoning became ReasonFlux's skill trajectory optimization. This blog traces that evolution — and connects it forward to the agent era.
Buffer of Thoughts: The First Reasoning Skills
A thought-augmented reasoning framework that stores distilled high-level thought templates in a meta-buffer, retrieves them for new problems, and dynamically updates the buffer as more tasks are solved. NeurIPS 2024 Spotlight.
The Problem: Reasoning from Scratch Every Time
Existing LLM reasoning methods fall into two camps. Single-query methods (like Chain-of-Thought) require manually designed exemplars for each task type and lack generalization. Multi-query methods (like Tree of Thoughts or Graph of Thoughts) explore multiple reasoning paths but are computationally expensive due to recursive expansion. Both approaches share a deeper limitation: they build reasoning structures from scratch for every problem, without leveraging the common patterns across similar problems.
Meta-Buffer and Thought Templates
BoT's core innovation is the meta-buffer — a lightweight library of "thought templates" distilled from previously solved problems. Each thought template captures a high-level reasoning strategy (e.g., "formulate as a constraint satisfaction problem," "reduce to a graph traversal," "apply backward induction") that generalizes across similar problem types. Templates are not specific solutions but abstract reasoning patterns that can be instantiated with problem-specific details.
Template distillation uses carefully designed in-context examples of two types: in-task examples (templates from the same problem domain) and cross-task examples (templates from one domain used to solve problems in another, e.g., a code-related template applied to a math problem). Cross-task distillation is critical for generalization — it ensures templates capture truly abstract reasoning patterns rather than domain-specific tricks.
Problem Distiller + Instantiation
Before reasoning begins, a problem distiller extracts key information, variables, and constraints from the input using a meta-prompt. This separates the extraction and comprehension stages from the reasoning stage, reducing the cognitive load on the LLM during actual problem-solving.
Given the distilled problem representation, BoT retrieves the most relevant thought template from the meta-buffer and adaptively instantiates it — filling in the abstract template with concrete, problem-specific reasoning steps. This instantiation process generates the final reasoning chain with a single LLM query, achieving the accuracy benefits of multi-query methods at single-query cost.
Buffer Manager: Dynamic Self-Improvement
The meta-buffer is not static. A buffer manager continuously refines it: when a new problem is solved, the manager decides whether the solution contains a genuinely new reasoning pattern worth distilling into a template, or whether an existing template should be updated. To avoid redundancy while preserving new insights, the manager computes similarity scores between candidate templates and existing ones, only adding templates that are sufficiently novel.
BoT demonstrates significant improvements across 10 reasoning-intensive tasks: +11% on Game of 24, +20% on Geometric Shapes, +51% on Checkmate-in-One, all while requiring only 12% of the cost of multi-query methods on average. Most strikingly, Llama3-8B equipped with BoT shows the potential to match or surpass the Llama3-70B model — demonstrating that the right reasoning scaffolding can close a 10× parameter gap.
SuperCorrect: Distilling Skills into Model Weights
A two-stage framework that distills hierarchical thought templates from a teacher model and uses cross-model collaborative DPO to teach smaller models to self-correct. Published at ICLR 2025.
From Retrieved Skills to Internalized Skills
BoT showed that thought templates — reasoning skills — dramatically improve LLM problem-solving, but the skills lived in an external buffer. They needed to be retrieved at inference time, and the model itself didn't internalize them. SuperCorrect asks: what if we distill these skills directly into the model's weights?
Stage 1: Hierarchical Thought SFT
SuperCorrect deepens BoT's template concept by introducing hierarchical thought templates with two levels of abstraction. For each problem, a large teacher model (e.g., GPT-4) generates: (1) a high-level thought — a generalized solution strategy applicable to similar problems (analogous to BoT's thought templates), and (2) a detailed solution — a step-by-step explanation of the critical reasoning steps with fine-grained intermediate justifications.
Compared to standard CoT or BoT-style templates, these hierarchical templates offer deeper reasoning insights because the high-level strategy provides direction while the detailed steps provide the precision needed for error detection and correction. The student model is fine-tuned on these hierarchical template-solution pairs via SFT, learning to produce both the strategic overview and the detailed execution for new problems.
Stage 2: Cross-Model Collaborative DPO
A well-known limitation of LLM self-correction is that models struggle to identify their own errors — they are biased by their own reasoning context. SuperCorrect solves this with cross-model collaborative DPO, which pairs two types of correction traces:
- Self-correction traces: The student model's own (often failed) attempt to locate and fix errors in its reasoning.
- Cross-model correction traces: The teacher model's correction of the student's errors, identifying the exact thought step where reasoning went wrong and providing the corrected reasoning path.
These paired traces form the preference data for DPO training: the teacher's cross-model corrections are treated as "chosen" responses, and the student's self-corrections as "rejected" responses. This teaches the student to adopt the teacher's error-correction strategies — effectively breaking the bottleneck of the student's own reasoning capacity by injecting the teacher's skills and knowledge through preference learning.
SuperCorrect-7B achieves 70.2% on MATH and 89.5% on GSM8K, establishing new SOTA among all 7B models at the time of publication. Three model variants are released: SuperCorrect-Qwen-7B, SuperCorrect-DeepSeek-7B, and SuperCorrect-Llama-7B, along with the training datasets for both stages.
ReasonFlux: Learning to Plan with Skills via RL
A hierarchical reasoning framework that trains a model via RL to plan optimal sequences of thought templates, with an inference-time scaling system that adaptively expands template trajectories. Trained on 8 GPUs; surpasses o1-preview. NeurIPS 2025 Spotlight.
From Single Skills to Skill Trajectories
BoT retrieves one skill per problem; SuperCorrect distills a two-level skill per problem. But truly complex reasoning — competition-level math, multi-step proofs, olympiad problems — often requires composing multiple skills in sequence: first formulate the problem algebraically, then identify a symmetry, then apply an inequality bound, then verify edge cases. ReasonFlux makes this explicit by optimizing over skill trajectories: ordered sequences of thought templates, where each template guides one phase of the overall solution.
Structured Skill Library
ReasonFlux constructs a library of approximately 500 high-level thought templates — each one a reusable reasoning skill (e.g., "apply Cauchy-Schwarz inequality," "reduce to modular arithmetic," "use pigeonhole principle on the complement"). Skills are structured for efficient retrieval with metadata about applicable problem types, prerequisite conditions, and expected outcomes. Unlike BoT's dynamically growing buffer, this library is curated and fixed, making it more stable for RL training.
Hierarchical Reinforcement Learning over Skills
The key innovation is performing RL not on raw token-level CoT trajectories (which can be thousands of tokens long) but on the high-level skill trajectory. For a given problem, the model proposes a sequence of skills — a "plan" — and then instantiates each skill to produce detailed reasoning steps. RL optimizes over the skill sequence using preference pairs: multiple candidate trajectories are sampled, evaluated against the ground-truth answer, and the model learns to prefer trajectories that lead to correct solutions.
This hierarchical approach has two major advantages. First, it dramatically reduces the search space: instead of exploring the combinatorial space of all possible token sequences, RL operates over a much smaller space of skill orderings. Second, the resulting reasoning is more explainable — each step in the solution is labeled with its high-level strategy, making the reasoning structure transparent and interpretable.
Inference-Time Skill Scaling
ReasonFlux introduces a novel inference scaling system that dynamically adjusts the number and complexity of skills at test time. For simpler problems, one or two skills suffice; for competition-level problems, the system may allocate many skills with deeper instantiation at each step. This adaptive scaling achieves a better exploration-exploitation trade-off: the model explores broadly when uncertain (sampling diverse skill combinations) and exploits confidently when a clear strategy emerges (committing to a specific skill trajectory and refining execution).
Concretely, the system retrieves candidate skills for each sub-problem, scores them using the RL-trained policy, and either commits to the top-scoring skill or explores alternatives based on a confidence threshold. This produces skill trajectories that are more explainable than the raw long-CoT outputs of models like DeepSeek-R1, because each reasoning phase has an explicit strategic label.
ReasonFlux-32B achieves 91.2% on MATH (surpassing o1-preview by 6.7%) and solves an average of 56.7% of AIME problems (surpassing o1-preview by 27% and DeepSeek-V3 by 45%). These results are achieved training with only 8 GPUs, demonstrating the efficiency gains from operating in skill space rather than raw token space. The model family has since expanded to include ReasonFlux-F1-7B/14B/32B (SFT distillations of ReasonFlux-Zero trajectories), ReasonFlux-Coder, and ReasonFlux-PRM (trajectory-aware process reward models).
The Common Thread: Skills All the Way Down
The best reasoning systems don't just think harder — they think more strategically, by reusing and composing high-level skills rather than generating every reasoning step from scratch.
Across these three projects, one idea evolves through three levels of integration. BoT introduces thought templates — reasoning skills — as an external retrieval system at inference time: powerful but separate from the model. SuperCorrect internalizes these skills into the model's weights through distillation, and adds the ability to self-correct using teacher-derived error traces. ReasonFlux makes skills first-class objects in the training loop itself, using RL to learn which sequences of skills lead to correct solutions on the hardest problems.
Each step brings skills closer to the core of the model's reasoning process: from external retrieval (BoT) to weight-level internalization (SuperCorrect) to RL-optimized planning (ReasonFlux). And each step produces stronger results: BoT closes a 10× parameter gap; SuperCorrect sets 7B SOTA; ReasonFlux surpasses o1-preview on competition math with 8 GPUs. The lesson is clear: teaching models which skills to apply and when — not just giving them more tokens to reason with — is a remarkably efficient path to stronger AI.
Beyond Reasoning: From Reasoning Skills to Agent Skills
The thought template idea did not stop at mathematical reasoning. In retrospect, what we discovered with BoT, SuperCorrect, and ReasonFlux is a general design principle: high-level structured skills — abstract blueprints that tell a system how to approach a class of problems — are a uniquely powerful representation for improving AI systems. The specific domain (math, code, tool use) is secondary; what matters is the level of abstraction.
In today's agent ecosystem, skills are everywhere: "how to debug a failing test," "how to navigate a complex UI form," "how to structure a multi-step API call." These agent skills guide behavior at a strategic level, just as our thought templates guide reasoning at a strategic level. The structural parallel is exact: both are high-level, composable, transferable across instances, and far more efficient than learning every behavior from scratch. Thought templates were reasoning skills before the word "skill" became the standard vocabulary.
And the insight carries forward directly. In our recent work on OpenClaw-RL, the On-Policy Distillation (OPD) mechanism extracts textual hints from next-state signals — corrections, tool errors, environment feedback — and uses them to construct enhanced teacher contexts for token-level distillation. These hindsight hints are, in essence, on-the-fly skills generated from live interactions: high-level structured guidance that tells the model not just that it was wrong, but how each token should have been different.
The evolution is clear: from static skill retrieval (BoT) → weight-level skill distillation (SuperCorrect) → RL-optimized skill planning (ReasonFlux) → live, interaction-driven skill generation (OpenClaw-RL OPD). Each step makes skills more dynamic, more tightly integrated with the learning loop, and more broadly applicable — from math problems to agent behaviors to personalized AI assistants.
Citation
If you find our work useful, please consider citing:
@article{yang2024bot,
title={Buffer of Thoughts: Thought-Augmented
Reasoning with Large Language Models},
author={Yang, Ling and Yu, Zhaochen and Zhang,
Tianjun and Cao, Shiyi and Xu, Minkai
and Zhang, Wentao and Gonzalez, Joseph E.
and Cui, Bin},
journal={NeurIPS 2024 Spotlight},
year={2024}
}
@article{yang2024supercorrect,
title={SuperCorrect: Supervising and Correcting
Language Models with Error-Driven Insights},
author={Yang, Ling and Yu, Zhaochen and Zhang,
Tianjun and Xu, Minkai and Gonzalez,
Joseph E. and Cui, Bin and Yan, Shuicheng},
journal={ICLR 2025},
year={2024}
}
@article{yang2025reasonflux,
title={ReasonFlux: Hierarchical LLM Reasoning
via Scaling Thought Templates},
author={Yang, Ling and Yu, Zhaochen and Cui, Bin
and Wang, Mengdi},
journal={arXiv preprint arXiv:2502.06772},
year={2025}
}