The Modular Pipeline Had a Good Run

For the last few years, if you were building a multimodal AI system, the architecture basically wrote itself. You'd grab a vision encoder — something pretrained to turn images into vectors — wire it up to your language model, maybe add a projection layer so the shapes matched, and call it done. If you needed audio, you added an audio encoder to the front. The pipeline was modular, composable, and completely understood.

It worked. And Google just quietly retired it.

What Gemma 4 12B actually changed

When Google released Gemma 4 12B on June 3rd, most coverage focused on the benchmark numbers and the fact that it runs on a laptop. Both are real. But the architectural change buried in the release notes is more interesting than either.

Gemma 4 12B has no multimodal encoders. Not a smaller one. Not a faster one. None.

For vision, Google replaced the encoder with a single matrix multiplication plus positional embeddings and normalization. The LLM backbone takes over from there. For audio, they went further — no encoder at all. The raw audio signal gets projected directly into the same dimensional space as text tokens. The model handles everything from there as if it were just more tokens.

That's not a minor optimization. That's a different philosophy about where perception should happen.

Why the encoder pipeline existed in the first place

The modular approach made sense when it was invented. Foundation vision models like CLIP had been trained on billions of image-text pairs to build rich visual representations. Why throw that away and start from scratch? You wouldn't. You'd plug it in.

The problem is that plugging it in comes with costs that compound as systems get more ambitious. Separate encoders add memory footprint — you're loading multiple pretrained models, not one. They add latency — each modality gets processed independently before anything meaningful can happen between them. And they add a subtle alignment problem: the encoder learned to represent vision in service of its training objective, not yours. At the fusion point, you're hoping the representations line up well enough. Usually they do. Sometimes they don't, in ways that are hard to diagnose.

When the encoder is gone and the backbone handles everything, cross-modal reasoning can happen at every layer of the network, not just at the point where the outputs get stitched together. The model can attend to a word and an image region in the same pass, at the same depth, with the same machinery. That's a different kind of understanding.

What this means for the pipelines being built right now

If you're designing a multimodal system today, you're probably reaching for familiar components. A vision encoder here. A projection layer there. Maybe an audio encoder if you need it. That pattern is well-documented, has clear failure modes, and is supported by every major framework.

It's also the pattern that's being phased out.

I don't mean this as a warning to panic and rebuild everything. The modular pipeline will run in production for years — there's nothing wrong with it that demands an emergency migration. But if you're starting something new, or making architecture choices that will last a while, the direction is clear. The field is moving toward unified backbones that handle all modalities natively, and the tradeoffs are shifting in that direction too: less memory, less latency, fewer seams where things can go wrong.

The encoder isn't dying because it failed. It's retiring because the backbone got good enough to do the job itself.

The pattern underneath the pattern

There's a recurring shape to how ML architecture evolves. A specialized component does something the general model can't. It gets bolted on. It becomes standard. The general model trains longer, on more data, with better techniques — and eventually absorbs the specialized component entirely. The bolt-on becomes a footnote.

We saw it with attention mechanisms absorbing recurrence. We saw it with in-context learning absorbing fine-tuning for certain tasks. Now we're watching it happen with multimodal perception.

The lesson isn't that modular architectures are bad. They're often the right tool when the backbone isn't strong enough to handle everything. The lesson is that "the backbone isn't strong enough yet" has a shelf life.

Gemma 4 12B is sitting on a laptop with 16GB of RAM, processing images and audio without a single dedicated encoder, and hitting benchmark numbers that match much larger models built the old way.

The encoder had a good run. The backbone is ready now.

The Modular Pipeline Had a Good Run.

What Gemma 4 12B actually changed

Why the encoder pipeline existed in the first place

What this means for the pipelines being built right now

The pattern underneath the pattern

Comments 0