How much did Harvey reduce inference costs in its test with Fireworks AI?

Harvey cut inference costs by 3x without any reduction in quality, using a combination of Claude Opus and Fireworks AI's GLM 5.1.

What did Brian Armstrong predict about cheaper AI models?

Armstrong wrote on X that 80% of AI workloads will run on 99% cheaper models within 12–18 months, with 20% still using the latest models for tasks where maximum capability matters.

Why does the shift to smaller models matter for OpenAI and Anthropic?

Both companies are heading toward IPOs. A large-scale move to cheaper models would reduce demand for their most advanced offerings, dealing a financial blow at a critical moment.

Is the cheaper-model trend mainly about open-source vs. proprietary AI?

No. TechCrunch reports the real divide is between large models and small ones. Savings come from switching to a smaller model regardless of whether it is open-weight or proprietary.

Cheaper AI Models Gain Ground as Costs Rise

Legal AI startup Harvey cut inference costs by 3x in a live test with Fireworks AI. Coinbase's Brian Armstrong predicts 80% of workloads will run on 99% cheaper models within 12–18 months.

What Harvey and Fireworks AI Found

Legal AI startup Harvey ran a test with inference platform Fireworks AI. The result: a 3x reduction in inference costs with no drop in quality, according to TechCrunch.

The test combined Claude Opus and Fireworks' GLM 5.1. The most intensive tasks were routed to Opus. Everything else ran on the cheaper model. Server time and overall cost dropped sharply.

Harvey co-founder Gabe Pereyra explained the thinking behind it:

"Quality comes first, and in legal it always will. However, the definition of quality is evolving from simply using the most powerful model for everything, to using the best model that gets the right answer most efficiently."

What Brian Armstrong Predicted

Coinbase co-founder Brian Armstrong posted a specific forecast on X:

Morgan Stanley Raises Apple Target to $360 After WWDC

Data Center Power Demand Hits 42 GW in 2026

Meta Launches AI Business Agent on WhatsApp, Instagram

Supermicro Raises $7B to Fund $39B in AI Server Orders

"Demand for intelligence is near infinite, but 80% of workloads will be running on 99% cheaper models within 12-18 months. 20% of workloads will still run on latest gen models where IQ maxing is important."

That would be a large shift. Until recently, most AI companies competed on quality by defaulting to the most advanced model. If cheaper models can do the same work, a big share of the savings would come out of the big labs' revenues.

Who Stands to Lose

The stakes are high for OpenAI and Anthropic. Both are heading toward IPOs. A mass move to smaller models would hit them financially at a critical time.

The cost pressure is new. Token prices were heavily subsidized by investors for years. Enterprise clients had little reason to pick anything other than the top model. That is changing as subsidies slow and prices rise.

It's Not About Open-Source vs. Proprietary

A common framing puts this as a fight between Western proprietary models and Chinese or open-weight ones. TechCrunch pushes back on that.

The real divide is between large models and small ones. Switching from GPT-5.5 to DeepSeek's V4 Flash saves money — but so does switching to GPT-5.4-mini. There is an active price war between big-lab inference and independently served open-weight models. For the large-vs.-small question, it does not matter which type of small model wins.

What Enterprises Could Do Instead

Enterprises facing higher costs have options beyond switching models. They could make fewer API calls, use less context, or drop the least promising deployments. Whether cost pressure actually drives a shift to smaller models is still an open question.

What This Means for Frontier Model Training

The AI industry has followed a scaling-first approach for years — training the most compute-intensive models possible. This was shaped by what researchers call "the bitter lesson." Labs pushed hard to train bigger models, and investors subsidized the cost so clients never had to think about it.

If most workloads can run on smaller models without any quality loss, that raises a hard question: how do you justify the cost of training a frontier model at all?

The most concrete data point right now is Harvey's 3x cost reduction, confirmed in a live test with Fireworks AI using Claude Opus and GLM 5.1.

Frequently asked questions

How much did Harvey reduce inference costs in its test with Fireworks AI?: Harvey cut inference costs by 3x without any reduction in quality, using a combination of Claude Opus and Fireworks AI's GLM 5.1.
What did Brian Armstrong predict about cheaper AI models?: Armstrong wrote on X that 80% of AI workloads will run on 99% cheaper models within 12–18 months, with 20% still using the latest models for tasks where maximum capability matters.
Why does the shift to smaller models matter for OpenAI and Anthropic?: Both companies are heading toward IPOs. A large-scale move to cheaper models would reduce demand for their most advanced offerings, dealing a financial blow at a critical moment.
Is the cheaper-model trend mainly about open-source vs. proprietary AI?: No. TechCrunch reports the real divide is between large models and small ones. Savings come from switching to a smaller model regardless of whether it is open-weight or proprietary.

Sources

TechCrunch techcrunch.com

Keep reading

Anthropic Launches Claude Fable 5 With Safety Caps

Three Mile Island Restarts as Crane Clean Energy Center

DOE Opens RFI for AI Data Centers on Federal Land

Urenco USA Plans Multi-Billion Uranium Expansion

0 Comments

Not a member yet? Join the community