What is DiffusionGemma?
DiffusionGemma is an experimental open model from Google. It uses discrete text diffusion to generate entire blocks of text at once, rather than one token at a time. It is built on the 26B (4B active) Mixture-of-Experts (MoE) Gemma 4 architecture and released under an Apache 2.0 license. Google built it for researchers and developers who need fast, interactive local workflows — things like in-line editing and rapid iteration.
How Fast Is DiffusionGemma?
According to Google's blog post, DiffusionGemma is up to 4x faster than autoregressive Gemma models on GPUs. Reported speeds are 1,000+ tokens per second on an NVIDIA H100 and 700+ tokens per second on an NVIDIA GeForce RTX 5090.
The speed comes from parallel decoding. The model generates a 256-token block in one forward pass. Every token attends to all others at once via bi-directional attention. This moves the compute bottleneck away from memory bandwidth.
How Does Discrete Text Diffusion Work?
Standard large language models produce tokens one at a time, left to right. DiffusionGemma uses a different method called block-autoregressive multi-canvas sampling. It starts with a noisy block of tokens and refines them in parallel until the block is complete.
You might also like
The official model documentation describes this as "iteratively denoising blocks of tokens (a 'canvas') in parallel to dramatically boost decoding speeds." An autoregressive encoder first processes and caches the prompt. The diffusion decoder then refines the output canvas using bi-directional attention.
What Hardware Does DiffusionGemma Need?
DiffusionGemma activates only 3.8B parameters during inference, even though its total parameter count is 26B. When quantized, it fits within 18GB VRAM. That puts it within reach of consumer GPUs for local use.
Google recommends a maximum of 48 denoising steps per canvas. Adaptive early stopping usually cuts that to 12–16 steps, depending on the task. Simpler tasks — like structured code generation — finish in fewer steps.
What Are the Tradeoffs vs. Gemma 4?
Google is direct about the limits. DiffusionGemma scores below Gemma 4 on standard benchmarks. Google frames this as an acceptable cost for speed-critical use cases.
The parallel decoding advantage also shrinks under heavy cloud workloads. As Google's documentation notes, the throughput benefit is strongest at low-to-medium batch sizes on a single accelerator. That makes it a local-inference tool, not a cloud-serving replacement.
Where Can Developers Get DiffusionGemma?
Model weights are available on Hugging Face, Kaggle, and Google's Vertex AI Model Garden. Developers can fine-tune it using Google's open-source hackable_diffusion repository on GitHub. The model accepts text, image, and video inputs. Audio input is not supported.
DiffusionGemma also includes a Thinking Mode — configurable reasoning channels that let it work through a problem step by step before producing a final answer.
Sources: Google AI Developer Docs · Google Blog · Let's Data Science

0 Comments
Log in to comment
Not a member yet? Join the community
Pick a meme
KlipyHave a great take?
Drop your email — we'll send a magic link so you can post it. No password.
Not a member of the community? Join today.
Join the community →