Cheaper tokens, bigger bills: The new math of AI infrastructure

Presented by NutanixAs enterprises move from AI experimentation into production deployment, the primary cost driver has shifted away from foundation model training and toward the infrastructure required to run thousands of concurrent inference workloads at scale, with agentic AI as the accelerant. Where early enterprise AI projects involved a handful of large, scheduled training jobs, production agentic environments require continuous support for short-lived, unpredictable requests that consume GPU, networking, and storage resources in ways traditional infrastructure was never designed to handle. For enterprise technology leaders, that shift is turning infrastructure efficiency into a make-or-break factor in AI economics. "Every employee with an AI assistant, every automated workflow, every agent pipeline needs models for inferencing and generates a lot of tokens," says Anindo Sengupta, VP of products at Nutanix. "Those inferencing requests land on a GPU infrastructure, traverse specia

Cheaper tokens, bigger bills: The new math of AI infrastructure

Presented by Nutanix


As enterprises move from AI experimentation into production deployment, the primary cost driver has shifted away from foundation model training and toward the infrastructure required to run thousands of concurrent inference workloads at scale, with agentic AI as the accelerant.

Where early enterprise AI projects involved a handful of large, scheduled training jobs, production agentic environments require continuous support for short-lived, unpredictable requests that consume GPU, networking, and storage resources in ways traditional infrastructure was never designed to handle. For enterprise technology leaders, that shift is turning infrastructure efficiency into a make-or-break factor in AI economics.

"Every employee with an AI assistant, every automated workflow, every agent pipeline needs models for inferencing and generates a lot of tokens," says Anindo Sengupta, VP of products at Nutanix. "Those inferencing requests land on a GPU infrastructure, traverse specialized networks, and pull data from storage systems purpose built to support these AI workloads."

Why cost per token is becoming a core infrastructure metric

Inference costs per token have dropped by roughly an order of magnitude over the past two years, driven by model efficiency improvements and competitive pressure among cloud providers. The expectation would be that enterprise AI is getting cheaper. Instead, total costs are rising, Sengupta says, pointing to what economists call the Jevons paradox: when a resource becomes cheaper to use, consumption tends to increase faster than the price drops.

So while the cost per token is going down by almost an order of 10 in the last couple of years, consumption has risen more than 100X. The result is that cost per token and GPU utilization are becoming primary operational metrics for enterprise IT, sitting alongside traditional measures like uptime and throughput.

"Cost per token is really about the total cost of ownership for serving inference models," Sengupta says. "Utilization is about making sure that once you have GPU assets, you're getting maximum return from them. These metrics will be critical for enterprise IT leaders."

What makes this difficult is the number of variables involved. Token costs shift depending on which models an organization runs, where workloads execute, and how prompts are structured.

"There are too many variables in cost to manage intuitively," Sengupta adds. "Optimizing it is an engineering problem, and one that requires continuous tuning."

Agentic workloads expose the limits of traditional infrastructure

Production agentic AI introduces a workload profile that traditional enterprise infrastructure was not designed to handle. Classic data center deployments are built around predictable loads and long planning cycles. Agentic environments produce unpredictable, high-frequency bursts of short inference requests, place new demands on networking and storage, and change faster than most procurement cycles allow.

The infrastructure supporting agentic AI is also structurally different from CPU-based computing. GPU topology, high-speed interconnects, parallel storage systems for agent memory and KV cache, and networking architectures capable of handling DPU offloading all represent new capabilities that require new operational skills.

Siloed infrastructure compounds these challenges. When GPU resources, networking, and data access are managed independently, scheduling inefficiencies accumulate, utilization drops, and costs climb. Organizations running fragmented stacks tend to underutilize expensive GPU assets while simultaneously bottlenecking on storage and network throughput.

Integrated stacks and the case for full-stack architecture

The response emerging among infrastructure vendors is a move toward tightly integrated, validated full-stack platforms designed specifically for production AI workloads. The premise is that end-to-end optimization across compute, networking, storage, and software layers produces better utilization and lower per-token costs than assembling best-of-breed components from separate vendors.

Nutanix's Agentic AI solutionrepresents one approach to this problem. Built on the Nutanix AHV hypervisor, Nutanix Enterprise AI and Nutanix Kubernetes Platform, the solution is designed to manage both the traditional compute layer where agent orchestration runs and the accelerated compute layer where inference executes. The company has introduced NVIDIA topology-aware enhancements to AHV that automatically optimize how GPUs, CPUs, memory, and DPUs are allocated to virtual machines, and has offloaded the Nutanix Flow Virtual Networking to BlueField DPUs, to free GPU cycles and sustain throughput without compromising security.

The solution supports instant deployment of NVIDIA NIM microservices and open-source models including Nemotron, and integrates an AI gateway that governs access to frontier cloud LLMs from Anthropic, Google, OpenAI, and others. The gateway also implements model context protocol (MCP) to allow agents to connect to enterprise data with granular access controls. The solution runs on Cisco infrastructure, allowing organizations to deploy on infrastructure they already operate.

"By integrating everything from the AHV hypervisor and Flow Virtual Networking up to the Kubernetes platform, you remove the silos that slow down AI projects," Sengupta explains.

Platform teams and developer agility cannot be traded off against each other

One organizational tension that scales with agentic AI adoption is the relationship between platform teams managing shared infrastructure and the developers building and running agent applications on top of it. These groups have historically operated with different tooling, different priorities, and different time horizons, but Sengupta argues that the core dynamic hasn't changed even as the technology has.

"Platform teams will continue to deliver a catalog of self-service AI capabilities that are also compliant to business needs, that they can serve to agentic AI builders," Sengupta says. "Mature AI teams will do a great job not just in GPU utilization, but in creating an operating model that enables fast AI infrastructure delivery to meet the pace of innovation that developers want. That's what is very critical to success."

The organizations that are managing GPU utilization most effectively tend to be further along in their AI adoption journey, with more established operating models and clearer cost accountability. For organizations earlier in that journey, the infrastructure design and operating model decisions being made now will determine whether AI projects can move from pilot to production without cost or complexity becoming the limiting factor.

The AI factory operating model

The emerging framework for enterprise AI infrastructure is the AI factory, a purpose-built environment for producing and running AI workloads at scale. The challenge is that most organizations will need to operate both traditional compute and accelerated compute simultaneously for years, requiring a common operating model that spans both technology paradigms without sacrificing agility.

With Nutanix, running on Cisco as part of the Cisco AI Pods, powered by Intel and optimized for the NVIDIA reference architecture, organizations get a production-ready, full-stack foundation by enabling AI factories to be securely and efficiently shared by thousands of agents, to achieve the lowest costs per token. The solution bridges the gap between the infrastructure and platform engineering teams who manage the hardware and the AI engineering and agentic AI developer teams who build and run agentic AI applications, making it truly affordable to run AI at a massive scale.

"The metrics that will determine whether an organization can sustain and scale its AI investment — cost per token, GPU utilization, scheduling efficiency — are infrastructure metrics," Sengupta says. "Managing them well is increasingly a precondition for making AI viable, not just functional."

Secure and scale your AI factory — explore the full-stack approach here.


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact [email protected].

Share

What's Your Reaction?

Like Like 0
Dislike Dislike 0
Love Love 0
Funny Funny 0
Angry Angry 0
Sad Sad 0
Wow Wow 0