What is Claude Opus 4's SWE-bench score?

Claude Opus 4 achieved a 72.5% score on SWE-bench Verified, the highest ever recorded. SWE-bench evaluates AI models on their ability to resolve real-world software engineering issues from open-source repositories, making it one of the most rigorous tests of practical coding ability.

How does Claude Sonnet 4 compare to Opus 4?

Claude Sonnet 4 matches or closely approaches Opus 4's performance on most practical benchmarks at approximately one-fifth the cost. For the majority of production workloads, Sonnet 4 delivers 95% of Opus 4's capability, making it the more cost-effective choice for most engineering teams.

What is Shopify Catalog?

Shopify Catalog is an AI-powered product enrichment system that automatically generates product descriptions, categorises items, optimises search metadata, and creates structured data for large product catalogues. It significantly reduces the manual effort required to maintain product content at scale.

What is Shopify Sidekick?

Shopify Sidekick is a voice-enabled AI assistant for Shopify merchants. It can answer questions about store performance, suggest marketing strategies, and execute administrative tasks through natural conversation, transforming how merchants interact with their store data.

How should engineering teams adopt AI coding tools?

Adoption typically progresses through four levels: autocomplete (10–20% productivity gain), chat-based assistance (20–40%), autonomous task completion (40–70% for routine tasks), and sustained autonomous work on complex multi-step tasks. Most enterprise teams are at Level 2, with the technology supporting Level 3–4.

May 15, 202510 min readAI & Technology

Claude Opus 4: World's Best Coding Model, and Shopify's AI Shopping Agents

Anthropic releases Claude Opus 4 with a 72.5% SWE-bench score — the highest ever measured — while Claude Sonnet 4 matches it at one-fifth the price. Shopify launches its AI-powered product Catalog and voice Sidekick, signalling that agentic commerce is no longer theoretical. May 2025 was the month AI became the best programmer in the room and started selling things.

ClaudeAnthropicOpus 4Sonnet 4Software EngineeringShopifyAgentic CommerceAI CodingSWE-bench

Giovanni van Dam

IT & Business Development Consultant

Claude Opus 4: 72.5% on SWE-bench — A New Standard for AI Coding

On 22 May 2025, Anthropic released Claude Opus 4, and the benchmark results were unambiguous: 72.5% on SWE-bench Verified, the highest score ever recorded. SWE-bench evaluates an AI model's ability to resolve real-world software engineering issues from open-source repositories — not toy problems, but actual bugs, feature requests, and refactoring tasks from production codebases.

Opus 4 was not merely incrementally better. It demonstrated sustained autonomous coding capability, working through complex multi-file changes, maintaining context across large codebases, and producing code that passed existing test suites without human intervention. Anthropic positioned it as the world's best coding model, and no competing benchmark challenged that claim.

For software teams, the implications were immediate. Code review, bug triage, test generation, and refactoring — tasks that consume 40–60% of senior developer time — could now be meaningfully delegated to an AI agent. Not as a suggestion engine, but as an autonomous contributor that writes, tests, and submits code.

Claude Sonnet 4: The Same Capability at One-Fifth the Price

Perhaps more significant than Opus 4 itself was the simultaneous release of Claude Sonnet 4. Sonnet 4 matched or closely approached Opus 4's coding performance at approximately one-fifth the cost. On many practical benchmarks, the difference between the two models was within the margin of error.

This pricing structure reflected Anthropic's strategic bet that the market was bifurcating: a premium tier for the most demanding, highest-stakes workloads, and a high-performance tier that would capture the vast majority of production usage. For most engineering teams, Sonnet 4 would be the right choice — delivering 95% of Opus 4's capability at 20% of the cost.

The competitive pressure this placed on the rest of the market was severe. OpenAI's models, Google's Gemini, and open-weight alternatives all had to contend with a model that was simultaneously the best coder available and aggressively priced for production deployment.

Shopify Catalog and Sidekick: AI Agents Enter Commerce

While Anthropic was transforming software engineering, Shopify was transforming commerce. At its Spring 2025 announcements, Shopify introduced two significant AI capabilities:

Shopify Catalog: An AI-powered product enrichment system that automatically generates descriptions, categorises products, optimises search metadata, and creates structured data for millions of SKUs. For merchants with large catalogues, this eliminated hundreds of hours of manual content creation.
Sidekick (Voice): A voice-enabled AI assistant for Shopify merchants that could answer questions about store performance, suggest marketing strategies, and execute administrative tasks through natural conversation. "What were my top-selling products last week?" became a voice query rather than a dashboard drill-down.

These were not experimental features. Shopify serves over 2 million merchants globally, and embedding AI directly into the merchant experience normalised agentic commerce for a massive user base. The message was clear: AI is not a feature you add to commerce — it is becoming the commerce platform itself.

AI Coding in the Enterprise: From Copilot to Colleague

Opus 4's SWE-bench performance accelerated a transition that had been building throughout 2024: the shift from AI as a coding assistant (suggesting completions, answering questions) to AI as a coding colleague (autonomously completing tasks, submitting pull requests, resolving issues).

The practical adoption pattern emerging in enterprise software teams followed a clear progression:

Level 1 — Autocomplete: AI suggests code as you type. Productivity gain: 10–20%. This is table stakes by mid-2025.
Level 2 — Chat-based assistance: AI answers questions, explains code, generates tests. Productivity gain: 20–40%.
Level 3 — Autonomous task completion: AI receives an issue, writes the code, runs the tests, opens a pull request. Productivity gain: 40–70% for routine tasks.
Level 4 — Sustained autonomous work: AI works on complex, multi-step engineering tasks for hours with minimal human oversight. This is where Opus 4 operates.

Most enterprise teams in May 2025 were at Level 2, with early adopters pushing into Level 3. The gap between where most teams are and where the technology allows them to be represents an enormous productivity opportunity. Discuss how to structure AI-augmented development for your engineering team.

Agentic Commerce Is No Longer Theoretical

The convergence of Opus 4's coding capability and Shopify's commerce AI pointed to a broader truth about May 2025: agentic AI had moved from research demos to production deployments. AI agents were writing code, managing product catalogues, answering merchant questions, and automating customer interactions — not in laboratory conditions, but at the scale of millions of users.

For businesses in e-commerce, retail, and technology, the competitive clock was now ticking. Merchants using Shopify's AI tools would produce better product content faster. Engineering teams using Opus 4 or Sonnet 4 would ship features more quickly. The productivity gap between AI-adopting and non-adopting businesses would widen with each quarter.

The strategic imperative was no longer to evaluate AI — it was to deploy it. Start with the highest-leverage workflows, measure the impact, and expand systematically. Learn how embedded technology leadership accelerates AI deployment across commerce and engineering.

Claude Opus 4: World's Best Coding Model, and Shopify's AI Shopping Agents

Claude Opus 4: 72.5% on SWE-bench — A New Standard for AI Coding

Claude Sonnet 4: The Same Capability at One-Fifth the Price

Shopify Catalog and Sidekick: AI Agents Enter Commerce

AI Coding in the Enterprise: From Copilot to Colleague

Agentic Commerce Is No Longer Theoretical

Frequently Asked Questions

Further Reading

Related Articles

Agent-to-Agent Communication: Google's A2A Protocol and Agentic Infrastructure

Apple's Liquid Glass and Cautious AI: What WWDC25 Tells Us About Enterprise Readiness

Giovanni van Dam