For years, AI progress was fueled by two predictable forces: faster chips and denser transistors. But those tailwinds have weakened. As Moore’s Law fades and models scale into the hundreds of billions of parameters, the limits of today’s software stack have become the dominant barrier to performance. The way we build, optimize and deploy AI systems simply wasn’t designed for this world.
The result is a paradox in which AI hardware grows more powerful while utilization often drops. Many teams report low utilization across clusters containing thousands of graphics processing units (GPUs), with a significant share of compute capacity remaining idle due to data movement, memory bottlenecks and brittle, hand-tuned kernels.
To understand how we got here and what comes next, we need to examine the structural issues in the modern AI stack.
AI Authority Trend: Melding AI and Enterprise Storage for Top Value
The Fragmented Stack Beneath AI
Today’s AI software ecosystem has become a sprawling, brittle tower of abstractions. Each hardware vendor builds its own CUDA-like environment with hundreds of libraries. On top of that sit PyTorch, JAX, TensorFlow, MXNet, vLLM, SGLang, serving engines and orchestration layers. The integration surface area is massive.
Worse still, coverage is mathematically impossible to maintain. For every chip vendor, every library and every model architecture, the combinations explode. Even with extreme automation, it’s not feasible to hand-tune kernels for every permutation.
This is how we ended up with a two-tier developer ecosystem:
- A small group of kernel specialists who can extract performance
- Everyone else, who must accept suboptimal defaults
That divide is widening as hardware grows more heterogeneous.
Three Forces Breaking the Current Model
A closer look at the state of AI infrastructure reveals three forces that have pushed the current software stack to its limits.
1. The End of Moore’s Law
Transistors are no longer becoming meaningfully cheaper or faster, and this has changed how performance gains are achieved. Chipmakers now combine many specialized compute units inside a single package, including tensor cores, matrix engines, and other accelerators. This level of heterogeneity requires far more complex software just to operate the hardware effectively.
2. The Memory Wall
Data movement now costs orders of magnitude more than computation. Because modern AI workloads rarely fit inside the memory of a single device, data must constantly flow across interconnects and memory hierarchies. Developers spend most of their execution time managing that movement instead of performing useful work. In some cases, even a well-written kernel can expose latency and make performance appear worse.
AI Authority Trend: Beyond GPUs: How Network Infrastructure Determines AI Success
3. Extreme Scaling
Training requirements have grown exponentially. What once ran on two GPUs now demands tens or even hundreds of thousands. This level of scale is unprecedented in computing; yet, the software designed to coordinate these clusters was never intended to handle systems of this size.
These forces have collectively turned kernels into the new assembly language. They remain essential for performance, but they are deeply tied to specific devices, memory layouts and topologies. Any meaningful change in hardware often requires a rewrite. That approach cannot support the future of AI at scale.
Why Kernels Are the Wrong Abstraction for the AI Era
Hand-tuned kernels provide peak performance only under narrow conditions. They were designed for workloads that are compute-bound, limited to a single device and largely embarrassingly parallel.
Modern AI workloads look very different. They are memory-bound, span hundreds or thousands of GPUs, rely on overlapping communication and compute, and must accommodate dynamic graphs, sparse access patterns and multiple forms of model parallelism.
Most organizations spend months rewriting and refining kernels just to deploy a model. By the time a system is optimized for one hardware stack, the hardware may have already changed.
This is why so many AI teams report low utilization. In many large AI systems, the majority of execution time is consumed by moving data through the memory hierarchy rather than performing actual computation. Memory operations and synchronization often dominate the workload, leaving only a small fraction devoted to the mathematical work that produces results.
The software stack has become the bottleneck, not the GPUs.
Toward a New Software Foundation
The limitations of hardware-specific kernels have made it clear that future performance gains must come from the software layer itself. Many teams are shifting toward models in which the compiler, rather than the developer, becomes the primary engine of optimization. This approach focuses on building a hardware-aware, parallel optimizing compiler architecture that can adapt to any system.
A next-generation model of this kind has several defining characteristics.
Hardware Abstraction
Instead of hard-coding kernels for each device, developers describe intent and rely on software that can understand the underlying hardware. The system evaluates parameters such as core counts, bandwidth, memory capacity and topology, then automatically selects the best execution strategy.
Graph-Level Optimization
Rather than viewing computation as a sequence of isolated kernels, the entire workload is represented as a parallel graph. This enables fusion, partitioning, operator reuse and communication-aware scheduling. Each of these steps is tailored to the actual characteristics of the hardware.
Communication-Optimal Code Generation
Performance at scale depends less on faster arithmetic and more on smarter data movement. The software layer must generate code that stages data efficiently, minimizes synchronization, overlaps communication with compute and adapts to both multi-device and multi-node environments. These decisions cannot be made reliably by hand, especially in large systems.
Dynamic Runtime
Large, heterogeneous clusters require a runtime that can make fine-grained decisions in real time. Static scheduling breaks down at scale, so the runtime must respond to bandwidth availability, topology differences and memory pressures as they occur.
AI Authority Trend: Why Cooling Is the Front Line of the AI Sovereignty Battle
Together, these elements point toward a new abstraction for AI compute. Developers need a single software layer that can program any system, at any scale, through a unified and hardware-aware interface. This is the direction the industry must pursue to overcome the limits of traditional kernel-based performance models.
The Path Forward for AI Infrastructure
The AI industry is rapidly approaching an inflection point. Hardware diversity will continue to grow as new accelerators enter the market. Models will get larger, more dynamic and more specialized. And data movement, not raw floating-point operations per second (FLOPs), will continue to be the dominant constraint on scaling.
The future will not be won through more kernel libraries or deeper vendor lock-in. It will be won through software that can understand hardware, reason about it and optimize for it in ways humans cannot.
If the last decade of AI was about accelerated computing, the next decade will be about accelerated software—software that is intelligent, adaptable and capable of translating high-level intent into optimal execution across any device.
That shift is already underway. And it may be the key to unlocking the next generation of AI performance.
AI Authority Trend: Backroom to Boardroom: How AI is Offering CISOs a Seat at the Table
To share your insights on AI for inclusive education, please write to us at info@intentamplify.com
