GPT 5.5 Review: Benchmarks, Pricing, and What Builders Need to Know

On 23 April 2026, OpenAI released GPT 5.5. This is the first fully retrained base model since GPT-4.5: new architecture, new training data, new capabilities. Not a point release. Not a fine-tune on top of an existing base. A ground-up retrain.

I build AI systems on the Claude API. I am certified across nine Anthropic courses. That is my bias, and I am stating it upfront. But if you build things with AI, you need to understand what just shipped, because this model moves several benchmarks meaningfully forward, and it changes the competitive picture in ways that matter for production decisions.

82.7%

Terminal-Bench 2.0

Context tokens

Price vs GPT 5.4

85.0%

ARC-AGI-2

What is GPT 5.5

GPT 5.5 is OpenAI's new flagship. It is a fully retrained base model, not an incremental update on top of GPT 5.4. The key specs:

1 million token context window. Up from GPT 5.4's context limit. This puts it in the same territory as Gemini's long-context models.
New base architecture. OpenAI has not disclosed full details, but the retraining from scratch means this is not the same foundation as GPT-4.5 with more RLHF on top. Different weights, different training distribution.
Agentic coding focus. The model was optimised for multi-step, tool-using agent workflows. OpenAI's own Codex platform runs on it.
Multimodal reasoning. Improved performance on vision-language tasks, spatial reasoning (OSWorld), and scientific benchmarks.

OpenAI positions this as their "most capable model ever." That claim checks out on several benchmarks. It does not check out on all of them.

What actually changed

The best way to understand what GPT 5.5 can do is to look at the demo footage. The launch showcased several capabilities that go beyond what GPT 5.4 could handle.

GPT 5.5 review covering benchmarks, demos, and real-world capabilities.

Agentic coding demos. OpenAI showed GPT 5.5 working inside Codex on multi-file, multi-step coding tasks. The model plans before it writes, creates and runs tests, and iterates on failures. The Terminal-Bench 2.0 score (82.7%) reflects this. For context, GPT 5.4 scored 75.1% on the same benchmark and Opus 4.7 scores 69.4%.

Deep research. The model demonstrated extended research sessions where it browses the web, synthesises multiple sources, and produces structured reports. This is not new in concept (Perplexity, Gemini Deep Research), but the depth and accuracy shown in the demos were notable.

Medical reasoning. OpenAI highlighted cancer detection results: 3 out of 4 cancer cases correctly identified in a diagnostic task. The brain tumour results were less impressive (0 out of 6). These are cherry-picked demo numbers, not peer-reviewed benchmarks, so treat them accordingly. But the direction is clear: OpenAI is pushing into specialised reasoning domains.

Anti-hallucination tests. This is where the picture gets complicated. OpenAI demonstrated scenarios where GPT 5.5 correctly refused to answer questions it did not have information about. But the SimpleQA hallucination benchmark tells a different story, which I will cover in the benchmarks section below.

Benchmarks: GPT 5.5 vs Opus 4.7 vs GPT 5.4

Here is the full comparison across every benchmark where I could find reliable data for at least two of the three models. Data sourced from Artificial Analysis, OpenAI's announcement, and Anthropic's published benchmarks.

Benchmark	GPT 5.5	Opus 4.7	GPT 5.4
Terminal-Bench 2.0	82.7%	69.4%	75.1%
ARC-AGI-2	85.0%	—	73.3%
GDPval	84.9%	—	83.0%
OSWorld-Verified	78.7%	—	75.0%
MCP Atlas	75.3%	77.3%	67.2%
AA Intelligence Index	60	57	—
SWE-Bench Pro	58.6%	64.3%	—
MMMLU (multilingual)	83.2%	91.5%	—
Hallucination rate (lower = better)	86%	36%	—

The table tells a nuanced story. GPT 5.5 dominates on agentic coding (Terminal-Bench), abstract reasoning (ARC-AGI-2), and the Artificial Analysis composite index. But Opus 4.7 holds clear leads on real-world software engineering (SWE-Bench Pro), multilingual understanding, and factual accuracy.

Where Opus 4.7 still wins

Three benchmarks stand out where Opus 4.7 maintains a meaningful lead.

Opus 4.7 leads

SWE-Bench Pro: 64.3% vs 58.6%
MMMLU (multilingual): 91.5% vs 83.2%
Hallucination rate: 36% vs 86%
MCP Atlas: 77.3% vs 75.3%

GPT 5.5 leads

Terminal-Bench 2.0: 82.7% vs 69.4%
ARC-AGI-2: 85.0% (no Opus data)
AA Intelligence Index: 60 vs 57
OSWorld-Verified: 78.7% (no Opus data)

SWE-Bench Pro measures autonomous software engineering on real GitHub issues. This is arguably the most practical coding benchmark: can the model take a real bug report and produce a working fix? Opus 4.7 at 64.3% versus GPT 5.5 at 58.6% is a meaningful gap. Terminal-Bench measures something different (terminal-based coding tasks in a sandboxed environment), which is why the two benchmarks tell different stories.

MMMLU (Multilingual Massive Multitask Language Understanding) tests reasoning across dozens of languages. Opus 4.7 at 91.5% versus GPT 5.5 at 83.2% is a substantial difference. If you build for international markets or non-English users, this matters.

Hallucination rate. This is the number that should get your attention. On SimpleQA (a factual accuracy benchmark), GPT 5.5 hallucinates on 86% of questions it gets wrong, compared to 36% for Opus 4.7. Put differently: when Opus 4.7 does not know something, it is far more likely to say so. When GPT 5.5 does not know something, it is far more likely to make something up. For any system where factual accuracy matters (legal, medical, financial, compliance), this gap is critical.

Codex and agentic coding

OpenAI launched Codex alongside GPT 5.5 as their agentic coding platform. The architecture is different from what most people expect from a "coding assistant."

Multiple agents. Codex can spin up multiple GPT 5.5 instances working on different parts of a codebase simultaneously. Each agent gets its own sandboxed environment.
400K context per agent. While the base model supports 1M tokens, each Codex agent operates with a 400K context window. This is still substantial, enough for most full repositories.
Full environment access. Agents can read files, write code, run tests, install dependencies, execute shell commands, and iterate on failures. This is closer to Claude Code's model than to a chat-based copilot.
GitHub integration. Codex connects to repositories, understands project structure, and can work across branches. It creates pull requests directly.

The Codex approach is essentially what Claude Code does: give the model tools and let it work autonomously on real engineering tasks. The 82.7% Terminal-Bench score reflects this capability. Whether that translates to reliable production use depends on the hallucination rate and the specific type of work you throw at it.

Official announcement

Read the full GPT 5.5 announcement on the OpenAI blog.

Pricing and access

GPT 5.5 is not cheap. Pricing is roughly 2x GPT 5.4 across both tiers.

Model	Input / 1M tokens	Output / 1M tokens
GPT 5.5 Thinking	$5	$30
GPT 5.5 Pro	$30	$180
GPT 5.4 (reference)	$2.50	$15
Claude Opus 4.7 (reference)	$5	$25

GPT 5.5 Thinking is the standard tier at $5/$30. This puts it at the same input price as Opus 4.7 ($5) but with a 20% premium on output ($30 vs $25). The Pro tier at $30/$180 is for maximum reasoning depth, comparable to what Anthropic offers with extended thinking on Opus.

Access is limited at launch. GPT 5.5 is available to Plus, Pro, Business, and Enterprise ChatGPT subscribers. Free tier users do not have access. API access is available through the standard OpenAI API.

The pricing matters because it affects system design decisions. At $30 per million output tokens, running GPT 5.5 Thinking on high-volume workflows is expensive. For comparison, Claude Sonnet 4.6 delivers strong performance at $3/$15 (input/output), and Claude Haiku 4.5 at $0.80/$4. The right model for a given task is rarely the most powerful one. It is the one that clears the quality bar at the lowest cost.

What this means if you build things

I run production AI systems on the Claude API. I use GPT models when they are the better fit for specific tasks. Here is my honest take on what GPT 5.5 changes.

The Terminal-Bench score is real. 82.7% is a meaningful jump. For sandboxed coding tasks, GPT 5.5 is the current leader. If you are building developer tools, code generation pipelines, or agentic coding workflows, this model deserves evaluation. The Codex platform makes it easy to test.

The hallucination rate is a problem. 86% on SimpleQA is not something you can prompt-engineer away. If your system generates content that users trust as factual (legal summaries, medical information, financial analysis, compliance reports), this model needs heavy guardrails or a verification layer. Opus 4.7's 36% is not perfect either, but the gap between 36% and 86% is the difference between "needs spot-checking" and "needs systematic verification."

SWE-Bench Pro still favours Opus. Terminal-Bench and SWE-Bench Pro measure different things. Terminal-Bench tests sandboxed terminal coding tasks. SWE-Bench Pro tests real software engineering on actual GitHub issues. If your use case looks more like "fix this real bug in a real codebase," Opus 4.7 at 64.3% still leads GPT 5.5 at 58.6%.

Multilingual? Not close. Opus 4.7 at 91.5% MMMLU versus GPT 5.5 at 83.2% is an 8-point gap. For international products, multilingual customer support, or any system serving non-English users, this matters.

The 1M context is useful but not unique. Gemini has offered long context for over a year. The practical question is not "how many tokens can it hold" but "how well does it reason over them." Early reports suggest GPT 5.5 handles long context well, but independent needle-in-a-haystack evaluations are still emerging.

Cost matters. GPT 5.5 Thinking at $5/$30 is comparable to Opus 4.7 at $5/$25 on input, slightly more expensive on output. For most production systems, the model choice should not be about which one tops a leaderboard. It should be about which one clears the quality bar for your specific task at the lowest cost. Sometimes that is GPT 5.5. Sometimes that is Opus. Often it is a smaller model entirely.

If you are building AI systems and need help evaluating where GPT 5.5, Opus 4.7, or a smaller model fits your specific use case, the AI Consulting and Roadmapping service is built around exactly this kind of assessment. And if you need a system that works across providers and handles model transitions without breaking, the AI Automation Systems service builds that architecture from the start.

OpenAI GPT 5.5 Claude Opus 4.7 Codex Benchmarks

Frequently asked questions

What is GPT 5.5?

GPT 5.5 is OpenAI's latest flagship language model, released on 23 April 2026. It is the first fully retrained base model since GPT-4.5, featuring a 1 million token context window, improved agentic coding capabilities, and new benchmarks across reasoning, code, and multimodal tasks.

How much does GPT 5.5 cost?

GPT 5.5 Thinking costs $5 per million input tokens and $30 per million output tokens. GPT 5.5 Pro costs $30 per million input tokens and $180 per million output tokens. Both tiers are roughly 2x the price of GPT 5.4.

Is GPT 5.5 better than Claude Opus 4.7?

It depends on the task. GPT 5.5 leads on Terminal-Bench 2.0 (82.7% vs 69.4%), ARC-AGI-2, and the Artificial Analysis Intelligence Index. Opus 4.7 leads on SWE-Bench Pro (64.3% vs 58.6%), multilingual reasoning MMMLU (91.5% vs 83.2%), and has a significantly lower hallucination rate (36% vs 86% on SimpleQA).

Who can access GPT 5.5?

GPT 5.5 is available to ChatGPT Plus, Pro, Business, and Enterprise subscribers. API access is available through the OpenAI API. It is not available on the free ChatGPT tier at launch.

What is the context window for GPT 5.5?

GPT 5.5 supports a 1 million token context window, a major increase over GPT 5.4. This makes it viable for processing entire codebases, long documents, and complex multi-step agent workflows without truncation.

Sources and credits

OpenAI: Introducing GPT 5.5 (official announcement)
GPT 5.5 review and benchmark analysis (YouTube)
Artificial Analysis (independent benchmark data)
Claude Opus 4.7 release coverage (JQ AI SYSTEMS)
Claude vs ChatGPT for Business Automation (JQ AI SYSTEMS)