AI is driving demand for high-performance data centers. It’s not only about scale and capabilities but also about how they are being built. New server form factors offer greater density and compute power with no signs of slowing down. We need to rethink data center design.
One issue that merits special attention: thermal management. Every new generation of CPUs and GPUs have rising thermal density properties. When NVIDIA unveiled its latest roadmap (Vera Rubin > Rubin Ultra) it set the stage for data center racks requiring up to 600 kilowatts of power by 2027. This is quite a jump from 10 to 20 kilowatts for average cloud deployments or 40 to 100+ kWs for today’s AI and HPC workloads.
AI Authority Trend: GenAI Meets Storage: RAG Turns into a Must-Have for Virtually Every Enterprise AI Project
The stark reality is that at those power densities, liquid cooling is the only solution. You simply cannot cool GPUs effectively without it.
Cloud service providers (CSPs), hyper-scalers, and enterprises are making unprecedented investments in AI compute to secure a competitive edge. But those investments start to lose value if it’s not possible to harness 100 percent utilization from your compute. Throttling due to hitting a thermal ceiling, not only for GPUs and CPUS but other components in the rack, can have a significant impact on ROIC.
Without incorporating next-generation cooling solutions, these massive capital expenditures may yield only ‘good enough’ compute power instead of delivering the full promise of faster and more efficient performance. Put another way, given the scale of these investments, thermal management isn’t just a performance issue, it can also directly impact capital efficiency.
Tank immersion offers impressive thermal efficiency and cooling, but it’s challenging to operate in a production environment and requires considerable space. So far, direct-to-chip solutions haven’t demonstrated the ability to scale for the power and heat generation requirements of next-generation silicon.
New liquid cooling technologies are tackling these thermal management challenges head-on. Building on the best of what’s available today between direct-to-chip and tank immersion cooling systems, next generation liquid cooling takes cooling GPUs and CPUs to the next level – in a hybrid form factor that is more flexible and easier to install. These new designs in liquid cooling remove the tank but deliver that same level of cooling efficiency.
This new approach to cooling AI data centers goes beyond the chip and cools all the IT in the chassis while using significantly less power and nearly no water. For example, a hybrid approach to liquid cooling can increase compute density four to six times over other cooling methods. Even more compelling, it can allow servers to run with zero throttling, enabling compute workloads to operate at 100 percent utilization and support overclocking.
AI Authority Trend: AITech Top Voice: Interview with Brian Fitzgerald, Growth Strategist at Augury
Designed for scalability and ease of maintenance, new advancements in hybrid liquid cooling racks can also reduce capital burden. Essentially, if they work with every standard server there’s no need to replace what you have or find more space as you do with tank immersion cooling. For colocation data centers, it means tenants have flexible solutions to diverse hardware setups with access to liquid cooling that is adaptable to meet a wide range of needs.
For CSPs, hyper-scalers, and enterprises maximizing power per watt is the common denominator. For the enterprise it’s harder to measure and varies across industry and application. For example, hyper-scalers are dealing with massive demand and need to deliver the most compute at the best value. If they have to invest in a build out, it eats into time to market. And the CSP may have supply constraints yet still needs to maximize ability to deliver compute at a lower cost per flop. Across the board, even with varying factors, the goal is best compute power at the lowest wattage.
It’s more important than ever to think beyond cooling just the chip and prepare a thermal management strategy that won’t hit the ceiling. Data center leaders looking to move beyond the limitations of immersion and direct-to-chip cooling should consider taking a hybrid approach. Every watt wasted on inefficient cooling is power lost for compute.
AI Authority Trend: AITech Top Voice: Interview with Sensori.Ai CEO Dr. A.K. Pradeep
To share your insights, please write to us at news@intentamplify.com