← Back to Home
2024 – 2025

Scaling LLM Reasoning via Thought Templates

Three tightly connected projects — Buffer of Thoughts, SuperCorrect, and ReasonFlux — that evolve a single idea from prompting-time retrieval to training-time distillation to RL-optimized hierarchical reasoning, culminating in a 32B model that surpasses o1-preview on MATH and AIME.

Ling Yang

Gen-Verse · Princeton AI Lab · Princeton University

Buffer of Thoughts SuperCorrect ReasonFlux

The idea behind this research line is deceptively simple: humans don't solve hard problems from scratch every time — they recall high-level strategies, adapt them to the specific problem, and refine their approach when things go wrong. Can we teach LLMs to do the same?

We started exploring this question in mid-2024 with Buffer of Thoughts, a prompting framework that stores and retrieves reusable "thought templates." The results were striking — an 8B model equipped with BoT could rival a 70B model — but the templates lived outside the model, as a retrieval system at inference time. This naturally led to two follow-up questions: (1) can we bake these templates into the model's weights through training? and (2) can we use RL to learn optimal sequences of templates for complex, multi-step problems?

SuperCorrect answers the first question. It distills hierarchical thought templates from a large teacher model into a smaller student model via SFT, and uses a novel cross-model collaborative DPO to teach the student to self-correct using the teacher's error-correction traces. ReasonFlux answers the second. It builds a structured library of ~500 thought templates and trains the model via hierarchical RL to plan optimal template trajectories — sequences of high-level strategies that decompose complex problems into manageable sub-problems.

Each project inherits and extends the core concept from the one before it. BoT's thought templates became SuperCorrect's hierarchical templates; SuperCorrect's template-guided reasoning became ReasonFlux's template trajectory optimization. This blog traces that evolution in detail.

Jun 2024 · NeurIPS 2024 Spotlight
Step 1: Buffer of Thoughts — Templates as a Retrieval System
Introduces "thought templates" — high-level reasoning strategies distilled from solved problems and stored in a meta-buffer for retrieval. Achieves +51% on Checkmate-in-One at 12% the cost of Tree of Thoughts.
Oct 2024 · ICLR 2025
Step 2: SuperCorrect — Templates Distilled into Weights
Building on BoT's template concept, distills hierarchical thought templates from a teacher into a student via SFT, then adds cross-model collaborative DPO for self-correction. SuperCorrect-7B surpasses DeepSeekMath-7B by 7.8% on MATH.
Feb 2025 · NeurIPS 2025 Spotlight
Step 3: ReasonFlux — RL over Template Trajectories
Culmination of the line: hierarchical RL optimizes sequences of thought templates rather than raw token chains. ReasonFlux-32B trained on 8 GPUs surpasses o1-preview by 6.7% on MATH and solves 56.7% of AIME problems.

Buffer of Thoughts: Learning to Reuse Reasoning Strategies

Buffer of Thoughts
Thought-Augmented Reasoning with Large Language Models

A thought-augmented reasoning framework that stores distilled high-level thought templates in a meta-buffer, retrieves them for new problems, and dynamically updates the buffer as more tasks are solved. NeurIPS 2024 Spotlight.

The Problem: Reasoning from Scratch Every Time

Existing LLM reasoning methods fall into two camps. Single-query methods (like Chain-of-Thought) require manually designed exemplars for each task type and lack generalization. Multi-query methods (like Tree of Thoughts or Graph of Thoughts) explore multiple reasoning paths but are computationally expensive due to recursive expansion. Both approaches share a deeper limitation: they build reasoning structures from scratch for every problem, without leveraging the common patterns across similar problems.

Meta-Buffer and Thought Templates

BoT's core innovation is the meta-buffer — a lightweight library of "thought templates" distilled from previously solved problems. Each thought template captures a high-level reasoning strategy (e.g., "formulate as a constraint satisfaction problem," "reduce to a graph traversal," "apply backward induction") that generalizes across similar problem types. Templates are not specific solutions but abstract reasoning patterns that can be instantiated with problem-specific details.

Template distillation uses carefully designed in-context examples of two types: in-task examples (templates from the same problem domain) and cross-task examples (templates from one domain used to solve problems in another, e.g., a code-related template applied to a math problem). Cross-task distillation is critical for generalization — it ensures templates capture truly abstract reasoning patterns rather than domain-specific tricks.

Problem Distiller + Instantiation

Before reasoning begins, a problem distiller extracts key information, variables, and constraints from the input using a meta-prompt. This separates the extraction and comprehension stages from the reasoning stage, reducing the cognitive load on the LLM during actual problem-solving.

Given the distilled problem representation, BoT retrieves the most relevant thought template from the meta-buffer and adaptively instantiates it — filling in the abstract template with concrete, problem-specific reasoning steps. This instantiation process generates the final reasoning chain with a single LLM query, achieving the accuracy benefits of multi-query methods at single-query cost.

Buffer Manager: Dynamic Self-Improvement

The meta-buffer is not static. A buffer manager continuously refines it: when a new problem is solved, the manager decides whether the solution contains a genuinely new reasoning pattern worth distilling into a template, or whether an existing template should be updated. To avoid redundancy while preserving new insights, the manager computes similarity scores between candidate templates and existing ones, only adding templates that are sufficiently novel.

+51%
on Checkmate-in-One
12%
cost vs ToT/GoT
8B → 70B
Llama3-8B+BoT ≈ 70B

BoT demonstrates significant improvements across 10 reasoning-intensive tasks: +11% on Game of 24, +20% on Geometric Shapes, +51% on Checkmate-in-One, all while requiring only 12% of the cost of multi-query methods on average. Most strikingly, Llama3-8B equipped with BoT shows the potential to match or surpass the Llama3-70B model — demonstrating that the right reasoning scaffolding can close a 10× parameter gap.

SuperCorrect: Baking Templates into Weights

SuperCorrect
Advancing Small LLM Reasoning with Thought Template Distillation and Self-Correction

A two-stage framework that distills hierarchical thought templates from a teacher model and uses cross-model collaborative DPO to teach smaller models to self-correct. Published at ICLR 2025.

From Retrieval to Distillation

BoT showed that thought templates dramatically improve reasoning, but the templates lived in an external buffer — they needed to be retrieved at inference time, and the model itself didn't internalize the reasoning patterns. SuperCorrect asks: what if we distill these templates directly into the model's weights?

Stage 1: Hierarchical Thought SFT

SuperCorrect deepens BoT's template concept by introducing hierarchical thought templates with two levels of abstraction. For each problem, a large teacher model (e.g., GPT-4) generates: (1) a high-level thought — a generalized solution strategy applicable to similar problems (analogous to BoT's thought templates), and (2) a detailed solution — a step-by-step explanation of the critical reasoning steps with fine-grained intermediate justifications.

Compared to standard CoT or BoT-style templates, these hierarchical templates offer deeper reasoning insights because the high-level strategy provides direction while the detailed steps provide the precision needed for error detection and correction. The student model is fine-tuned on these hierarchical template-solution pairs via SFT, learning to produce both the strategic overview and the detailed execution for new problems.

Stage 2: Cross-Model Collaborative DPO

A well-known limitation of LLM self-correction is that models struggle to identify their own errors — they are biased by their own reasoning context. SuperCorrect solves this with cross-model collaborative DPO, which pairs two types of correction traces:

These paired traces form the preference data for DPO training: the teacher's cross-model corrections are treated as "chosen" responses, and the student's self-corrections as "rejected" responses. This teaches the student to adopt the teacher's error-correction strategies — effectively breaking the bottleneck of the student's own reasoning capacity by injecting the teacher's skills and knowledge through preference learning.

+7.8%
vs DeepSeekMath-7B (MATH)
+15.1%
vs Qwen2.5-Math-7B (GSM8K)
70.2%
MATH accuracy (7B SOTA)

SuperCorrect-7B achieves 70.2% on MATH and 89.5% on GSM8K, establishing new SOTA among all 7B models at the time of publication. Three model variants are released: SuperCorrect-Qwen-7B, SuperCorrect-DeepSeek-7B, and SuperCorrect-Llama-7B, along with the training datasets for both stages.

ReasonFlux: Hierarchical RL over Template Trajectories

ReasonFlux
Hierarchical LLM Reasoning via Scaling Thought Templates

A hierarchical reasoning framework that trains a model via RL to plan optimal sequences of thought templates, with an inference-time scaling system that adaptively expands template trajectories. Trained on 8 GPUs; surpasses o1-preview. NeurIPS 2025 Spotlight.

From Single Templates to Template Trajectories

BoT retrieves one template per problem; SuperCorrect distills a two-level template per problem. But truly complex reasoning — competition-level math, multi-step proofs, olympiad problems — often requires composing multiple strategies in sequence: first formulate the problem algebraically, then identify a symmetry, then apply an inequality bound, then verify edge cases. ReasonFlux makes this explicit by optimizing over template trajectories: ordered sequences of thought templates, where each template guides one phase of the overall solution.

Structured Template Library

ReasonFlux constructs a library of approximately 500 high-level thought templates, each representing a reusable reasoning strategy (e.g., "apply Cauchy-Schwarz inequality," "reduce to modular arithmetic," "use pigeonhole principle on the complement"). Templates are structured for efficient retrieval with metadata about applicable problem types, prerequisite conditions, and expected outcomes. Unlike BoT's dynamically growing buffer, this library is curated and fixed, making it more stable for RL training.

Hierarchical Reinforcement Learning

The key innovation is performing RL not on raw token-level CoT trajectories (which can be thousands of tokens long) but on the high-level template trajectory. For a given problem, the model proposes a sequence of templates — a "plan" — and then instantiates each template to produce detailed reasoning steps. RL optimizes over the template sequence using preference pairs: multiple candidate trajectories are sampled, evaluated against the ground-truth answer, and the model learns to prefer trajectories that lead to correct solutions.

This hierarchical approach has two major advantages. First, it dramatically reduces the search space: instead of exploring the combinatorial space of all possible token sequences, RL operates over a much smaller space of template orderings. Second, the resulting reasoning is more explainable — each step in the solution is labeled with its high-level strategy, making the reasoning structure transparent and interpretable.

Inference-Time Template Scaling

ReasonFlux introduces a novel inference scaling system that dynamically adjusts the number and complexity of templates at test time. For simpler problems, one or two templates suffice; for competition-level problems, the system may allocate many templates with deeper instantiation at each step. This adaptive scaling achieves a better exploration-exploitation trade-off: the model explores broadly when uncertain (sampling diverse template combinations) and exploits confidently when a clear strategy emerges (committing to a specific template trajectory and refining execution).

Concretely, the system retrieves candidate templates for each sub-problem, scores them using the RL-trained policy, and either commits to the top-scoring template or explores alternatives based on a confidence threshold. This produces template trajectories that are more explainable than the raw long-CoT outputs of models like DeepSeek-R1, because each reasoning phase has an explicit strategic label.

91.2%
MATH (+6.7% vs o1-preview)
56.7%
AIME (+27% vs o1-preview)
8 GPUs
total training compute

ReasonFlux-32B achieves 91.2% on MATH (surpassing o1-preview by 6.7%) and solves an average of 56.7% of AIME problems (surpassing o1-preview by 27% and DeepSeek-V3 by 45%). These results are achieved training with only 8 GPUs, demonstrating the efficiency gains from operating in template space rather than raw token space. The model family has since expanded to include ReasonFlux-F1-7B/14B/32B (SFT distillations of ReasonFlux-Zero trajectories), ReasonFlux-Coder, and ReasonFlux-PRM (trajectory-aware process reward models).

The Common Thread

The best reasoning systems don't just think harder — they think more strategically, by reusing and composing high-level problem-solving patterns rather than generating every reasoning step from scratch.

Across these three projects, one idea evolves through three levels of integration. BoT introduces thought templates as an external retrieval system at inference time — powerful but separate from the model. SuperCorrect integrates templates into the model's weights through distillation, and adds the ability to self-correct using teacher-derived error traces. ReasonFlux makes templates first-class objects in the training loop itself, using RL to learn which sequences of templates lead to correct solutions on the hardest problems.

Each step brings templates closer to the core of the model's reasoning process: from external retrieval (BoT) to weight-level internalization (SuperCorrect) to RL-optimized planning (ReasonFlux). And each step produces stronger results: BoT closes a 10× parameter gap; SuperCorrect sets 7B SOTA; ReasonFlux surpasses o1-preview on competition math with 8 GPUs. The lesson is clear: teaching models how to reason — not just giving them more tokens to reason with — is a remarkably efficient path to stronger AI.

Citation

If you find our work useful, please consider citing:

@article{yang2024bot,
  title={Buffer of Thoughts: Thought-Augmented
         Reasoning with Large Language Models},
  author={Yang, Ling and Yu, Zhaochen and Zhang,
          Tianjun and Cao, Shiyi and Xu, Minkai
          and Zhang, Wentao and Gonzalez, Joseph E.
          and Cui, Bin},
  journal={NeurIPS 2024 Spotlight},
  year={2024}
}

@article{yang2024supercorrect,
  title={SuperCorrect: Supervising and Correcting
         Language Models with Error-Driven Insights},
  author={Yang, Ling and Yu, Zhaochen and Zhang,
          Tianjun and Xu, Minkai and Gonzalez,
          Joseph E. and Cui, Bin and Yan, Shuicheng},
  journal={ICLR 2025},
  year={2024}
}

@article{yang2025reasonflux,
  title={ReasonFlux: Hierarchical LLM Reasoning
         via Scaling Thought Templates},
  author={Yang, Ling and Yu, Zhaochen and Cui, Bin
          and Wang, Mengdi},
  journal={arXiv preprint arXiv:2502.06772},
  year={2025}
}