Skip to main content

DiffusionGemma: Google's 4x Faster Text Generation Model

Google's DiffusionGemma generates entire blocks of text at once, hitting 1,000+ tokens per second on an H100 — up to 4x faster than standard autoregressive Gemma models.

DiffusionGemma: Google's 4x Faster Text Generation Modelai.google.dev

What is DiffusionGemma?

DiffusionGemma is an experimental open model from Google. It uses discrete text diffusion to generate entire blocks of text at once, rather than one token at a time. It is built on the 26B (4B active) Mixture-of-Experts (MoE) Gemma 4 architecture and released under an Apache 2.0 license. Google built it for researchers and developers who need fast, interactive local workflows — things like in-line editing and rapid iteration.

How Fast Is DiffusionGemma?

According to Google's blog post, DiffusionGemma is up to 4x faster than autoregressive Gemma models on GPUs. Reported speeds are 1,000+ tokens per second on an NVIDIA H100 and 700+ tokens per second on an NVIDIA GeForce RTX 5090.

The speed comes from parallel decoding. The model generates a 256-token block in one forward pass. Every token attends to all others at once via bi-directional attention. This moves the compute bottleneck away from memory bandwidth.

How Does Discrete Text Diffusion Work?

Standard large language models produce tokens one at a time, left to right. DiffusionGemma uses a different method called block-autoregressive multi-canvas sampling. It starts with a noisy block of tokens and refines them in parallel until the block is complete.

You might also like

The official model documentation describes this as "iteratively denoising blocks of tokens (a 'canvas') in parallel to dramatically boost decoding speeds." An autoregressive encoder first processes and caches the prompt. The diffusion decoder then refines the output canvas using bi-directional attention.

What Hardware Does DiffusionGemma Need?

DiffusionGemma activates only 3.8B parameters during inference, even though its total parameter count is 26B. When quantized, it fits within 18GB VRAM. That puts it within reach of consumer GPUs for local use.

Google recommends a maximum of 48 denoising steps per canvas. Adaptive early stopping usually cuts that to 12–16 steps, depending on the task. Simpler tasks — like structured code generation — finish in fewer steps.

What Are the Tradeoffs vs. Gemma 4?

Google is direct about the limits. DiffusionGemma scores below Gemma 4 on standard benchmarks. Google frames this as an acceptable cost for speed-critical use cases.

The parallel decoding advantage also shrinks under heavy cloud workloads. As Google's documentation notes, the throughput benefit is strongest at low-to-medium batch sizes on a single accelerator. That makes it a local-inference tool, not a cloud-serving replacement.

Where Can Developers Get DiffusionGemma?

Model weights are available on Hugging Face, Kaggle, and Google's Vertex AI Model Garden. Developers can fine-tune it using Google's open-source hackable_diffusion repository on GitHub. The model accepts text, image, and video inputs. Audio input is not supported.

DiffusionGemma also includes a Thinking Mode — configurable reasoning channels that let it work through a problem step by step before producing a final answer.

Sources: Google AI Developer Docs · Google Blog · Let's Data Science

Frequently asked questions

What is DiffusionGemma?
DiffusionGemma is an experimental open model from Google that uses discrete text diffusion to generate 256-token blocks in parallel, rather than producing one token at a time. It is built on the 26B MoE Gemma 4 architecture and released under an Apache 2.0 license.
How fast is DiffusionGemma compared to standard models?
Google reports up to 4x faster text generation on GPUs versus autoregressive Gemma models, with throughput of 1,000+ tokens per second on an NVIDIA H100 and 700+ tokens per second on an NVIDIA GeForce RTX 5090.
What VRAM does DiffusionGemma require?
When quantized, DiffusionGemma fits within 18GB VRAM. It activates only 3.8B of its 26B total parameters during inference, making it compatible with consumer GPUs for local use.

Sources

  1. Google's blog post blog.google
  2. official model documentation ai.google.dev
  3. Let's Data Science letsdatascience.com

Keep reading

0 Comments

Log in to comment

Not a member yet? Join the community