Building Diffusion Language Models — Ling Yang, Princeton University

Diffusion language models (dLLMs) represent a fundamentally different way to generate text: instead of predicting tokens left to right, they denoise an entire sequence in parallel. This architectural choice opens the door to genuinely new capabilities — bidirectional reasoning, parallel multimodal generation, and flexible inference-time compute allocation — but it also demands a new training and post-training stack built from the ground up.

Over the past year, our team has built that stack, one layer at a time. The three projects in this blog form a single, continuous engineering effort. TraceRL solves the foundational problem of how to do RL for dLLMs at all — aligning post-training with the model's actual inference trajectory. MMaDA then uses this RL infrastructure to build the first unified multimodal diffusion model that can reason over text, understand images, and generate images under a single architecture. MMaDA-Parallel takes the final step: it enables text and images to be denoised simultaneously, with bidirectional attention at every step, solving the error propagation problem that plagues sequential thinking-aware generation.

Each project directly depends on the one before it. TraceRL's trajectory-aware training became MMaDA's UniGRPO algorithm; MMaDA's unified discrete diffusion architecture became MMaDA-Parallel's backbone; and MMaDA-Parallel's ParaRL strategy extends the RL toolkit to dense trajectory-level rewards. This blog walks through the three projects in that order.

Sep 2025 · Preprint

Step 1: TraceRL / TraDo — RL Infrastructure for dLLMs

A trajectory-aware RL framework that aligns post-training with diffusion inference traces, plus a diffusion-based value model for stability. Produces SOTA dLLMs (TraDo-4B/8B) that beat 7B AR models.

May 2025 · NeurIPS 2025

Step 2: MMaDA — Unified Multimodal Diffusion Foundation

Building on TraceRL's RL toolkit, MMaDA unifies text reasoning, multimodal understanding, and image generation under a single diffusion architecture with shared probabilistic formulation and UniGRPO post-training.

Nov 2025 · Preprint

Step 3: MMaDA-Parallel — Parallel Thinking-Aware Generation

Inheriting MMaDA's architecture, enables bidirectional text-image denoising at every step. ParaRL applies semantic rewards along the trajectory to enforce cross-modal consistency.

TraceRL & TraDo: Reinventing RL for Diffusion Language Models

TraceRL / TraDo

Trajectory-Aware RL Framework for dLLMs

A comprehensive framework for building, training, and deploying diffusion LLMs across full-attention and block-attention architectures, with SOTA models TraDo-4B/8B.

Paper GitHub

The Core Problem: Train-Inference Mismatch

Traditional dLLM training corrupts tokens with random masking, but inference follows a structured, sequential unmasking trajectory — high-confidence tokens are revealed first, then progressively harder ones. This creates a fundamental mismatch: the model is trained on uniformly random corruption patterns but must perform well on the structured patterns it encounters during inference.

TraceRL: Trajectory-Aware Training

TraceRL closes this gap by incorporating the model's preferred inference trajectory directly into RL post-training. During each training iteration, the framework first samples a complete inference trace — the sequence of denoising steps the model would actually execute — then uses this trace to construct the training objective. This means the model is optimized on the exact corruption-to-clean patterns it will encounter at test time, rather than on random masks.

A key innovation is the shrinkage parameter that controls how many trajectory steps are used in each training update. Rather than backpropagating through every denoising step (which is prohibitively expensive), TraceRL selects a sparse subset of steps along the trajectory, reducing training complexity by a factor of s while preserving gradient quality.

Diffusion-Based Value Model

Vanilla policy-gradient methods suffer from high variance in the diffusion setting because a single trajectory contains dozens of denoising steps with correlated noise. TraceRL introduces a diffusion-based value model that estimates the expected return at each step of the denoising trajectory, providing per-step baselines that dramatically reduce gradient variance. This value model also naturally accommodates process reward models (PRMs), enabling fine-grained supervision at intermediate denoising steps.

Engineering: The dLLM-RL Framework

Beyond the algorithm, the release includes a full-stack open-source framework supporting: (1) multiple RL methods — TraceRL, coupled RL, and random-masking RL; (2) both full-attention (LLaDA-style) and block-attention (SDAR-style) dLLM architectures; (3) accelerated inference via improved KV-cache and JetEngine integration; (4) SFT pipelines including block SFT, semi-AR SFT, and long-CoT fine-tuning with multi-node support; (5) block-size adaptation — TraceRL can adapt a model trained on block size B=4 to B=8, improving sampling flexibility without retraining from scratch.

+6.1%

vs Qwen2.5-7B on math

+51.3%

vs Llama3.1-8B on math

+18.1%

1st long-CoT dLLM on MATH500

TraDo-4B-Instruct, despite being smaller than 7B-scale AR models, consistently outperforms them on complex math reasoning. TraDo-8B-Thinking is the first long-CoT diffusion language model, trained via curriculum learning that progressively extends reasoning length. These results establish that diffusion language models are not just a theoretical curiosity — with the right RL framework, they can match and beat the best autoregressive models at reasoning.

MMaDA: One Architecture, Three Modalities

MMaDA

Multimodal Large Diffusion Language Models

The first unified diffusion foundation model that achieves strong performance on textual reasoning, multimodal understanding, and text-to-image generation — all under a single architecture and training objective.

Paper GitHub Models

Unified Discrete Diffusion Architecture

Most multimodal models bolt together modality-specific components — separate encoders, decoders, and loss functions for text and images. MMaDA takes a radically different approach: both text and images are represented as discrete tokens, and a single masked diffusion process operates over the combined token space. The model's training objective is simply to predict masked tokens, regardless of whether they represent words or image patches. This eliminates modality-specific engineering and enables genuine cross-modal reasoning during denoising.

Shared Probabilistic Formulation

Text is tokenized using the LLaDA tokenizer; images are encoded into discrete visual tokens via a VQ codebook. Both are arranged into a single sequence and corrupted by the same masked diffusion forward process. The reverse process — a single Transformer predicting all masked tokens — recovers both text and image content simultaneously. This shared formulation ensures that the model learns cross-modal interactions naturally during pretraining, without explicit alignment losses or contrastive objectives.

Mixed Long-CoT Fine-Tuning

To enable complex reasoning after pretraining, MMaDA introduces a mixed long chain-of-thought (CoT) fine-tuning strategy. The key insight is that CoT reasoning should be unified across modalities: the model learns to produce step-by-step reasoning traces for math problems, image understanding questions, and even world-knowledge-aware image generation prompts, all in the same format. Diverse reasoning trajectories are generated using open-source LLMs and VLMs, then filtered by SOTA verifiers to retain only high-quality, long-form CoT samples.

This mixed-CoT stage serves a dual purpose: it teaches the model to reason, and it provides a strong initialization for the subsequent RL stage — a "cold start" that gives the RL algorithm a reasonable policy to improve upon rather than optimizing from random exploration.

UniGRPO: Unified RL for Diffusion Models

Building directly on TraceRL's insights, MMaDA introduces UniGRPO — a unified policy-gradient-based RL algorithm tailored for multimodal diffusion models. UniGRPO extends GRPO to the diffusion setting with diversified reward modeling: different reward functions for different task types (math verifiers for reasoning, CLIP-based rewards for image generation, VLM-based rewards for multimodal understanding), all optimized under the same policy gradient framework. The random-masking RL variant from the dLLM-RL toolkit is used here, with rewards computed over the generated outputs and gradients propagated through the masked diffusion objective.

parameters, one architecture

> LLaMA-3

on text reasoning

> SDXL

on image generation

MMaDA-8B surpasses LLaMA-3-7B and Qwen2-7B on textual reasoning, outperforms Show-o and SEED-X on multimodal understanding, and exceeds SDXL and Janus on text-to-image generation. The three-stage pipeline — pretraining, mixed-CoT SFT, UniGRPO — is fully open-sourced, with checkpoints released at each stage: MMaDA-8B-Base, MMaDA-8B-MixCoT, and MMaDA-8B-Max (after RL).

MMaDA-Parallel: Thinking and Generating in Lockstep

MMaDA-Parallel

Parallel Thinking-Aware Image Editing and Generation

A parallel multimodal diffusion framework that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory, solving the error propagation problem of sequential thinking.

Paper GitHub Models

The Problem: Sequential Thinking Can Hurt

"Thinking before generating" sounds intuitive — let the model reason in text, then produce an image conditioned on that reasoning. But MMaDA-Parallel identifies a critical failure mode: in sequential, autoregressive pipelines, errors in the reasoning text propagate irreversibly into the generated image. If the CoT produces a wrong spatial description, the image generator has no way to recover. The authors introduce ParaBench, a new benchmark evaluating both text and image outputs, and show that this performance degradation correlates strongly with poor alignment between generated reasoning and the final image.

Parallel Denoising Architecture

The solution is architectural: instead of generating text first, then images, MMaDA-Parallel denoise both modalities simultaneously. Text and image tokens are arranged in an interleaved sequence with bidirectional attention, and a single mask predictor operates over the full combined sequence at each denoising step. This means at step t, the partially denoised text can attend to the partially denoised image and vice versa — creating a continuous feedback loop where semantic concepts in text and their visual counterparts co-evolve together.

The authors observe a striking emergent behavior: during parallel denoising, the image region corresponding to a specific semantic concept is often refined simultaneously with its textual counterpart. The model naturally learns to synchronize cross-modal information without explicit supervision of this alignment.

ParaRL: Reinforcement Learning Along the Trajectory

Standard SFT and conventional RL algorithms optimize only for the final output quality. ParaRL goes further: it applies semantic rewards at multiple points along the denoising trajectory, enforcing cross-modal consistency not just in the final result but throughout the generation process.

Since computing rewards at every single step is prohibitively expensive, ParaRL adopts a sparse optimization strategy: during each online rollout, a fixed subset of step indices S ⊂ {1, ..., |τ|} is pre-selected, and rewards are computed only at those steps. The method adapts a diffusion GRPO objective with token-level likelihood ratios, standardizing advantages across the sparsely sampled steps to maintain gradient quality.

Data Curation for Parallel Thinking

Training a parallel thinking-aware model requires quadruplets — (input image, instruction, reasoning trace, output image) — that don't exist in standard datasets. The team curates this data by taking existing image editing datasets and using Qwen-2.5-VL to generate plausible reasoning traces connecting instructions to output images. This synthetic data is used for supervised fine-tuning on the MMaDA backbone before applying ParaRL.

Two model variants are released: MMaDA-Parallel-A (based on Amused-VQ tokenizer, trained from Lumina-DiMOO) and MMaDA-Parallel-M (based on MagVITv2, trained from MMaDA), offering different quality-robustness trade-offs.

+6.9%

Output Alignment vs Bagel

SOTA

on ParaBench (open-source)

released 8B model variants

The Common Thread

Diffusion language models are not autoregressive models with a different sampling strategy — they are a genuinely new computational paradigm that demands new training algorithms, new multimodal architectures, and new approaches to thinking-aware generation.

Across these three projects, a consistent philosophy emerges: take what works for autoregressive LLMs and reimagine it natively for the diffusion setting. TraceRL reimagines RL training by respecting the denoising trajectory. MMaDA reimagines multimodal modeling by using a single diffusion process over discrete tokens. MMaDA-Parallel reimagines thinking-aware generation by replacing sequential reasoning with parallel co-denoising.

The result is a complete, open-source stack — from low-level RL algorithms to high-level multimodal applications — that demonstrates dLLMs can match or exceed autoregressive models across text reasoning, multimodal understanding, image generation, and thinking-aware editing. And because diffusion models decode in parallel, they open capabilities (like bidirectional cross-modal attention at every step) that are simply impossible in the autoregressive paradigm.

Citation

If you find our work useful, please consider citing:

@article{wang2025tracerl,
  title={Revolutionizing Reinforcement Learning Framework
         for Diffusion Large Language Models},
  author={Wang, Yinjie and Yang, Ling and Li, Bowen
          and Tian, Ye and Shen, Ke and Wang, Mengdi},
  journal={arXiv preprint arXiv:2509.06949},
  year={2025}
}

@article{yang2025mmada,
  title={MMaDA: Multimodal Large Diffusion
         Language Models},
  author={Yang, Ling and Tian, Ye and Li, Bowen
          and Zhang, Xinchen and Shen, Ke
          and Tong, Yunhai and Wang, Mengdi},
  journal={arXiv preprint arXiv:2505.15809},
  year={2025}
}

@article{tian2025mmadaparallel,
  title={MMaDA-Parallel: Multimodal Large Diffusion
         Language Models for Thinking-Aware Editing
         and Generation},
  author={Tian, Ye and Yang, Ling and Yang, Jiongfan
          and Wang, Anran and Tian, Yu and Zheng, Jiani
          and Wang, Haochen and Teng, Zhiyang
          and Wang, Zhuochen and Wang, Yinjie
          and Tong, Yunhai and Wang, Mengdi
          and Li, Xiangtai},
  journal={arXiv preprint arXiv:2511.09611},
  year={2025}
}

Building Diffusion Language Models That Reason, See, and Generate

TraceRL & TraDo: Reinventing RL for Diffusion Language Models

The Core Problem: Train-Inference Mismatch

TraceRL: Trajectory-Aware Training

Diffusion-Based Value Model

Engineering: The dLLM-RL Framework

MMaDA: One Architecture, Three Modalities

Unified Discrete Diffusion Architecture

Shared Probabilistic Formulation

Mixed Long-CoT Fine-Tuning

UniGRPO: Unified RL for Diffusion Models

MMaDA-Parallel: Thinking and Generating in Lockstep

The Problem: Sequential Thinking Can Hurt

Parallel Denoising Architecture

ParaRL: Reinforcement Learning Along the Trajectory

Data Curation for Parallel Thinking

The Common Thread

Citation