Grok 4 Dominates Benchmarks, But Not Agentic Thinking

July 11, 2025

I’ve been a happy user of GROK for some time, and Wednesday’s GROK 4 announcement left me both amazed and a bit let down. As Elon Musk put it during the live stream: “With respect to academic questions, Grok 4 is better than PhD level in every subject, no exceptions.”

That’s insane, and the performance backs it up!

Superior Benchmark Metrics

Beyond excelling in academics, Grok 4 shines in practical, long horizon tasks too. It topped Andon Labs’ Vending-Bench, a simulated environment testing AI agents’ ability to run a vending-machine business over 5-10 hours, with ~2,000 interactions and ~25 million tokens per run. Grok 4 outperformed others with a net worth of $4094.15 and 4569 units sold, showing the steepest growth curve—more than double the runner-up, Claude Opus 4.

What GROK 4 Offers Developers

For developers, GROK 4 brings some long-awaited exciting updates. The token caching capability is finally supported, reducing costs to $0.75 per 1 million cached input tokens (versus $3.00 standard). Additionally, GROK 4 introduces advanced tool calling, enabling seamless integration with external tools and services.

As someone passionate about AI agents, I often assess LLMs through the lens of multi-agent system support. Multi-agents architecture, if designed well, allows system scale with complexity and specialization. In the Vending-Bench test, Grok 4 operated as a single-agent system. There’s no indication it used subagents, role delegation, or any hierarchical reasoning. In short, this bench mark tested GROK 4 as a super agent, but not a multi-agent system.

The Missing Piece: Interleaved Thinking

GROK 4 supports multi-agents. But one crucial feature I look for is interleaved thinking, where a model thinks between tool calls and make more sophisticated reasoning after receiving tool results. It’s a capability to alternates between internal reasoning and external actions (like tool use) in a single session. It seems that GROK 4 doesn’t support it. Among today’s top models, only a few provide native support:

Google Gemini API: Gemini 2.5 Pro and 2.5 Flash have “thinking models” with dynamic thinking (adjustable via a thinkingBudget). They can interleave tool call results and the next tool calls fluidly in a single turn. The Gemini API allows you to request thought summaries. These provide synthesized versions of the models internal reasoning process, giving developers visibility into how the model is interpreting tool results and making decisions.
Anthropic Claude API: Claude Opus 4, Sonnet 4, and Sonnet 3.7 support “extended thinking” with internal reasoning blocks and interleaved thinking with tools, chaining multiple calls with reasoning steps in between, and making more nuanced decisions based on intermediate results.
OpenAI’s Assistants API: Mimics interleaved thinking with multi-step tool use, memory, and function calls, though the ChatGPT API itself requires agent frameworks for this behavior (not native support).

Unfortunately, there’s no definitive evidence that the Grok 4 API supports interleaved thinking. I cannot find anything about it in xAI’s documentation as of July 10, 2025.

Elon Musk noted during the live stream, “At times, it may lack common sense, and it has not yet invented new technologies or discovered new physics, but that is just a matter of time.” While this hints at future potential, for developers like me, interleaved thinking and agentic capabilities remain a critical missing piece today.

#grok4 #ai #AiAgent #vending-bench #gemini #claude #chatGpt #LLM

Share this post

Agentic AI, AI, AI Agent, GROK4, LLM

Subscribe to our newsletter

Keep up with the latest blog posts by staying updated. No spamming: we promise.

Is OpenAI Becoming a Chip Company

On October 6, AMD announced an equity-and-supply deal with OpenAI. You like me may ask

How AI Startups Hire in 2025: Challenges and New Patterns

Hiring in AI-native startups looks very different from the traditional tech playbook. Founders are experimenting

First Job to First Promotion: An Engineer’s Guide to Navigating Your Career in the Age of AI

Will AI Make “Career Development” Obsolete? — My New Book First Job to First Promotion Guides You Through the AI-Era Workplace

Hi, I’m Nicole Hu, founder of Foundry One Lab and former engineering leader at Google,