Skip to main content

OpenAI Cuts Inference Costs in Half on Key Models

OpenAI engineers quietly developed an optimization cutting inference costs by more than half. Logged-out ChatGPT traffic now runs on just a few hundred GPUs.

OpenAI Cuts Inference Costs in Half on Key Modelsdigg.com

What did OpenAI's engineers actually discover?

OpenAI engineers cut inference costs by more than half on the models the optimization touched, according to The Information. Inference cost is the computational expense a model incurs when responding to a user query — the metric most directly tied to AI service pricing and margins.

The engineers shared the finding internally with colleagues earlier this month. OpenAI has not issued a public statement confirming the development, and the exact technical method has not been disclosed.

How did this show up in the real world?

The first measurable result appeared in logged-out ChatGPT traffic. After the optimization was applied to that segment, the number of GPUs needed to power it dropped to just a few hundred, as reported by The Information via Digg.

That is a striking reduction. Logged-out traffic covers every visitor using ChatGPT without a free or paid account — a meaningful slice of overall usage.

What is the optimization, exactly?

The source did not disclose specific technical details. Speculation from observers includes quantization, KV caching, smarter query batching, or routing simpler queries to cheaper models — but none of these have been confirmed.

You might also like

What is confirmed: the gains came from improved utilization of existing server resources, not from deploying additional computing chips. That distinction matters. It means OpenAI squeezed more out of hardware it already owns.

Why does this matter for OpenAI's business?

Here's what we know so far about the financial context. OpenAI ended Q1 2026 with a 39% gross margin and has a stated target of 52% by year-end. Lower inference costs give the company room to improve margins, raise usage limits, or reduce API pricing pressure on developers.

The optimization has only been applied to logged-out ChatGPT traffic so far. How much headroom remains for paid traffic, the API, or other products is still an open question.

Who else is working on this?

OpenAI is not alone. Anthropic and Google are also working on server-level efficiency gains, according to Digg's reporting. No public benchmarks or technical details have emerged from either company.

For context on how rivals are competing: Claude paid users grew 75% since January 2026, and California partnered with Anthropic for state AI services — both signs that inference cost competitiveness has real commercial stakes.

What about OpenAI's hardware moves?

Separately, OpenAI showcased a keyboard called the Codex Micro at the AI Engineer World Fair in San Francisco on June 29. The device was co-developed with accessories company Work Louder. OpenAI spokesperson Dominik Kundel introduced it at the event, describing it as designed "to dramatically improve how people use Codex efficiently."

The Codex Micro debut is a separate development from the inference optimization story, but both arrived within days of each other. This also connects to broader moves in the developer tooling space — see Meta's ban on Claude Code & Codex for how rivals are responding to OpenAI's Codex push.

What are the limits of this news?

Several important caveats apply:

  • OpenAI has not officially confirmed the inference cost reduction
  • The information remains at the stage of "engineers sharing with select colleagues"
  • The optimization has only been applied to logged-out ChatGPT traffic so far
  • No timeline has been given for broader rollout or API pricing changes
  • The exact technical method remains undisclosed

For developers watching OpenAI GPT model updates closely, the gap between an internal engineering win and a productized cost reduction is still significant.

Before and after: what the sources confirm

Metric Before After
Inference cost (models touched) Baseline Reduced by more than 50%
GPUs for logged-out ChatGPT traffic Undisclosed baseline A few hundred
Source of gains Existing server efficiency, not new chips
Official confirmation from OpenAI None as of reporting

The most important confirmed fact: OpenAI engineers told colleagues this month that their optimization cut inference costs by more than half, and logged-out ChatGPT traffic is already running on just a few hundred GPUs as a result.

Frequently asked questions

**How much did OpenAI reduce its inference costs?**
OpenAI engineers developed an optimization that cut inference costs by more than half on the models it was applied to. The Information first reported the internal disclosure. The exact technical method has not been made public, and OpenAI has not officially confirmed the development. The gains came from better use of existing server resources rather than new hardware.
**How many GPUs is OpenAI using for logged-out ChatGPT traffic now?**
After the optimization was applied to logged-out ChatGPT traffic — meaning visitors without a free or paid account — the number of GPUs needed to power that traffic dropped to just a few hundred. This was the first visible result of the internal optimization that engineers shared with colleagues earlier in June 2026.
**What technique did OpenAI use to cut inference costs?**
The specific technical method has not been disclosed. Observers have suggested possibilities including quantization, KV caching, query batching, or routing simpler queries to cheaper models. What is confirmed is that the improvement came from better utilization of existing server infrastructure, not from deploying additional computing chips or acquiring new hardware.
**Has OpenAI officially confirmed the inference cost reduction?**
No. As of the reporting date, OpenAI had not issued an official public statement confirming the inference cost reduction. The information came from engineers sharing findings with select colleagues internally. The development remains unconfirmed by the company and has not yet been tied to any announced API pricing changes or product updates.
**What is the Codex Micro keyboard OpenAI showed at the AI Engineer World Fair?**
The Codex Micro is a specialized keyboard OpenAI co-developed with accessories company Work Louder. OpenAI spokesperson Dominik Kundel introduced it at the AI Engineer World Fair in San Francisco on June 29, 2026, describing it as designed to dramatically improve how people use Codex efficiently. Work Louder had teased the co-branded product earlier that same week.

Verified claims

Each key claim below was checked against its source — the exact supporting passage is quoted so you can confirm it yourself.

  1. OpenAI engineers cut inference costs by more than half on the models the optimization touched.

    cut inference costs by more than half
    Verified digg.com
  2. The gains came from improved utilization of existing server resources, not from deploying additional computing chips.

    existing server resources
    Verified odaily.news
  3. Anthropic and Google are also working on server-level efficiency gains.

    Anthropic and Google
    Verified digg.com

Sources

  1. as reported by The Information via Digg digg.com
  2. not from deploying additional computing chips odaily.news

Keep reading

0 Comments

Log in to comment

Not a member yet? Join the community