Liquid-cooled AI systems expose the limits of traditional storage architecture
Presented by SolidigmLiquid cooling is rewriting the rules of AI infrastructure, but most deployments have not fully crossed the line. GPUs and CPUs have moved to liquid cooling, while storage has depended on airflow, creating an operationally inefficient hybrid architecture.What appears to be a pragmatic transition strategy is, in practice, a structural liability. “A hybrid cooling approach is an operationally inefficient situation,” explains Hardeep Singh, thermal-mechanical hardware team manager at Solidigm. “You’re paying for and maintaining two entirely separate, expensive cooling infrastructures, and could be exposed to the worst-of-both-world's problems.” While liquid cooling requires pumps, fluid manifolds, and coolant distribution units (CDUs), air-cooled components require CRAC units, cold aisles, and evaporative cooling towers. Organizations moving to a hybrid solution by just adding some liquid cooling are absorbing the cost premium without capturing the full TCO benefit.Th
Presented by Solidigm
Liquid cooling is rewriting the rules of AI infrastructure, but most deployments have not fully crossed the line. GPUs and CPUs have moved to liquid cooling, while storage has depended on airflow, creating an operationally inefficient hybrid architecture.
What appears to be a pragmatic transition strategy is, in practice, a structural liability.
“A hybrid cooling approach is an operationally inefficient situation,” explains Hardeep Singh, thermal-mechanical hardware team manager at Solidigm. “You’re paying for and maintaining two entirely separate, expensive cooling infrastructures, and could be exposed to the worst-of-both-world's problems.”
While liquid cooling requires pumps, fluid manifolds, and coolant distribution units (CDUs), air-cooled components require CRAC units, cold aisles, and evaporative cooling towers. Organizations moving to a hybrid solution by just adding some liquid cooling are absorbing the cost premium without capturing the full TCO benefit.
The thermal physics makes things worse. Bulky liquid-cooling cold plates, thick hoses, and manifolds physically obstruct airflow inside the GPU server chassis. This concentrates thermal stress on the remaining air-cooled components, including storage drives, memory, and network cards, because server fans cannot push adequate airflow around the liquid plumbing. The components most reliant on fans end up in the worst possible thermal environment.
Water consumption is an all-but ignored, equally serious problem. Traditional air-cooled components rely on server fans to move heat into ambient air, which is then absorbed by a water loop and pumped to evaporative cooling towers. These systems can consume millions of gallons of water over time. As rack power densities continue to climb to support modern AI workloads, the evaporative water penalty becomes, as Singh puts it, “environmentally and economically indefensible.”
As AI infrastructure evolves toward liquid-cooled and fanless GPU systems, the true constraints on scale are shifting from compute performance to system-level thermal design. Modern AI platforms are no longer built server by server; they are engineered as tightly integrated rack- and pod-level systems where power delivery, cooling distribution, and component placement are inseparable.
In this environment, storage architectures designed for airflow-dependent data centers are becoming a limiting factor. As GPU platforms move fully into shared liquid-cooling domains, anchored by rack-level CDUs, every component in the system must operate natively within the same thermal and mechanical design. Storage can no longer rely on isolated cooling paths or bespoke thermal assumptions without introducing inefficiency, complexity, or density trade-offs at the system level.
Why storage is no longer a passive subsystem
For infrastructure leaders, this marks a fundamental transition. Storage is no longer a passive subsystem attached to compute, but instead an active participant in system-level cooling, serviceability, and GPU utilization. The ability to scale AI now depends on whether storage can integrate cleanly into liquid-cooled GPU systems, without fragmenting cooling architectures or constraining rack-level design.
And the race to scale AI is no longer just about who has the most GPUs, but instead about who can keep them cool, says Scott Shadley, director of leadership narrative and evangelist at Solidigm.
“Finding a way to enable liquid-cooled storage while still making it user serviceable has been one of the biggest challenges in designing fanless system solutions,” Shadley says. “As AI workloads evolve, the pressure on storage will only intensify.”
Techniques like KV cache offload, which move data between GPU memory and high-speed storage during inference, make storage latency and thermal performance directly relevant to model serving efficiency. In these architectures, a storage subsystem that throttles due to poor traditional airflow under thermal load slows down both reads and the model itself.
Moving to integrated liquid cooling
Moving from traditional air-cooled GPU servers to integrated liquid-cooled racks improves power usage efficiency (PUE) and reduces the operational cost for the datacenter. It also replaces the noisy computer room air handler (CRAH) and introduces a modern, efficient liquid CDU with potential scope to eliminate chillers if racks can be cooled to a liquid temperature of 45° Celsius.
When storage is cooled through liquid in absence of fans, it must also support serviceability with no liquid leakage. It also creates a new requirement that many infrastructure teams are only beginning to grapple with: every component in the rack must operate natively within the same cooling architecture.
Storage as an active participant in system design
Storage design is no longer an isolated engineering problem. It is a direct variable in GPU utilization, system reliability, and operational efficiency. The solution is to redesign storage from the ground up for liquid-cooled, fanless environments. This is harder than it sounds. Traditional SSD design assumes airflow for thermal management and places components on both sides of a thermally insulated PCB. Neither assumption holds in a CDU-anchored architecture.
“SSDs need to be designed with a best-in-class thermal solution to specifically conduct heat from internal components efficiently and transfer it to fluid,” says Singh. “The design must include a low-resistance path for heat to transfer to a single cold plate attached on one side.”
At the same time, drives must support serviceability without liquid leakage during insertion and removal, and without degrading the thermal interface between the drive and the cold plate.
Solidigm has worked with NVIDIA to address SSD liquid-cooling challenges, such as hot swap-ability and single-side cooling, reducing the thermal footprint of storage within the shared liquid loop, and ensuring GPUs receive their proportional share of coolant.
“If storage is not designed for a liquid-cooled environment efficiently, it will either throttle to lower performance or require more liquid volume,” he says. “Which directly and indirectly leads to under-utilization of GPU capability.”
Alignment on standards and the path to interoperability
Solidigm is not working on this in isolation. The broader industry is coalescing around standards to ensure liquid-cooled AI systems are interoperable rather than a patchwork of custom solutions. The SNIA and the Open Compute Project (OCP) are the primary bodies driving this work.
Solidigm led the industry standard for liquid cooling in SFF-TA-1006 for the E1.S form factor and is an active participant in OCP work streams covering rack design, thermal management, and sustainability. Custom, bespoke cooling solutions for storage are giving way to standards-aligned, production-ready designs that integrate cleanly into liquid-cooled GPU platforms.
“There are several organizations involved in this work,” says Shadley, who is also a SNIA board member. “They started with component-level solutions, driven heavily by SNIA and the SFF TA TWG. The next level is solution-level work, which is currently being heavily driven by OCP.”
Solidigm’s roadmap is leading the way
The design rules for system level architectures have changed due to the advent of liquid and immersion cooling technologies that allow for more unique design rules and removal of some barriers. The ability for systems to drive NVMe SSD-only platforms also allows for the removal of the platter-based box constraint that exists with HDD solutions, Shadley says.
“Solidigm customers have an active and lead role in roadmap decisions for our products due to their deep technical alignment with the ecosystem,” he says. “We do not simply make and sell products, we integrate, co-design, co-develop, and innovate with and alongside our partners, customers, and their customers.”
Adds Singh: “Solidigm’s key strength is innovation and customer-inspired system level engineering. This will continue to aggressively lead the way for liquid cooling adoption for storage.”
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact [email protected].
Share
What's Your Reaction?
Like
0
Dislike
0
Love
0
Funny
0
Angry
0
Sad
0
Wow
0
