Inside Aegaeon, how Alibaba Cloud cut GPU needs for AI
Alibaba Cloud says it has sharply reduced the number of Nvidia graphics processors required to serve large language models at scale, thanks to a new system called Aegaeon. In a production beta across its model marketplace, the company reports that Aegaeon cut the number of Nvidia H20 GPUs needed to handle dozens of models, including ones up to 72 billion parameters, from 1,192 to 213. That is an 82 percent reduction in deployed accelerators for the same level of service. The results were presented at the Symposium on Operating Systems Principles in Seoul, a leading venue for computer systems research. For cloud buyers, the claim translates to potential savings on cost and power, better availability of scarce chips, and a more consistent experience when calling different models through the same platform.
- Inside Aegaeon, how Alibaba Cloud cut GPU needs for AI
- What problem is Aegaeon built to solve?
- How the system works
- Measured results from production beta
- What it means for cloud customers and developers
- Nvidia, China and the AI chip backdrop
- How Aegaeon compares with other approaches
- Open technical questions and next steps
- Quick Facts
The achievement targets a growing pain point in the era of model marketplaces. Cloud providers now serve thousands of models to developers at once. A few models, such as Alibabas Qwen family and the open model DeepSeek, receive most of the traffic. Many others sit idle for long stretches, then receive brief bursts of requests. Traditional serving stacks dedicate full GPUs or large GPU slices to each model instance. When a model is quiet, its reserved GPU sits underused. When calls arrive without enough warm replicas, the platform spins up capacity and pays the price in load time and lower throughput. Aegaeon attempts to rewrite this math by pooling GPU resources across many models, then scheduling work in finer time slices so each accelerator stays busy.
Alibaba Cloud frames Aegaeon as a way to serve concurrent LLM workloads without waste. The system blends aggressive multi model scheduling with mechanisms to keep memory hot and to switch work between models with minimal delay. The result, according to the research team from Peking University and Alibaba Cloud, is higher throughput, lower queuing, and fewer chips parked on lightly used services.
What problem is Aegaeon built to solve?
Modern AI platforms face a long tail of demand. A small set of popular models attracts most inference traffic. Less popular models still matter to developers, but they receive sporadic calls. Alibaba Cloud measured that 17.7 percent of GPUs in its marketplace had been allocated to serve only 1.35 percent of requests. With models pinned to dedicated accelerators, the cloud ends up paying for idle time and for frequent cold starts, especially when traffic patterns are bursty or unpredictable.
Serving inference looks simple from the outside, but the internals are intricate. A large language model holds weights and runtime state in GPU memory. During generation, the system maintains a key value cache that stores intermediate activations for already produced tokens. Loading model weights from disk or across the network is slow. Warming the key value cache takes time. Switching a GPU from one model to another traditionally meant expensive unloading and reloading. That overhead discouraged frequent sharing, which kept most accelerators dedicated to one or two models at best. Aegaeon is designed to manage these transitions far more efficiently so that GPUs can time share across many models without a heavy penalty.
How the system works
Aegaeon is a multi model serving stack that treats the GPU as a shared pool and decides what to run at very fine granularity. Instead of scaling at the request level or at the model replica level, it makes decisions per token. The scheduler can interleave the generation of tokens for different users and different models on the same GPU, while keeping the high value data structures ready in memory. The goal is to maximize useful work per second without hurting response time beyond a tight budget.
Token level auto scaling
The core idea is to scale capacity at the token level. That means Aegaeon can start serving a request for one model, then opportunistically schedule tokens for another model when there is headroom, all within the same batching and kernel launches. The paper reports that Aegaeon sustained 2 to 2.5 times higher request arrival rates, or delivered between 1.5 and 9 times more goodput, compared with alternative pools that only multiplex two or three models per GPU. By operating at token resolution, the system avoids the long pauses associated with starting new replicas and can respond quickly to spikes on any one model.
Memory reuse and KV cache coordination
Aegaeon pushes overhead down by reusing components, managing memory explicitly, and synchronizing the key value cache across models at a fine level. The researchers say this reduces auto scaling overhead by 97 percent compared with naive approaches. With smarter memory management and cache handling, the same GPU can keep multiple models ready to run without constant reloading. That enables one accelerator to support up to seven models concurrently, compared with a typical limit of two or three in earlier multi model systems.
Measured results from production beta
Alibaba Cloud says it beta tested Aegaeon in its model marketplace for more than three months. The deployment served tens of models, including models up to 72 billion parameters. Over that period, the team reports that the GPUs required to meet the same service load dropped from 1,192 Nvidia H20 units to 213, resulting in an 82 percent resource saving. The system also reduced the latency penalty for switching between models by 97 percent, which helps when workloads alternate between popular and long tail models throughout the day.
The marketplace environment is a strong test bed because it sees real world concurrency, a mix of request sizes, and a wide range of models. In this setting, Aegaeon kept accelerators busier, cut idle time on rarely used services, and preserved responsiveness on high demand models by scheduling at finer granularity. The combination of per token scheduling and careful memory reuse appears to be the key to earning more useful work per GPU hour.
The research team from Peking University and Alibaba Cloud argued that industry practice had been leaving money on the table by dedicating GPUs to models with sporadic demand. They presented their conclusion directly in the paper.
Aegaeon is the first work to reveal the excessive costs associated with serving concurrent LLM workloads on the market.
That contention aligns with patterns seen across model marketplaces. Providers have long recognized that the long tail ties up capacity and drives up cost for everyone. Aegaeon’s approach demonstrates that a scheduling and memory design tuned for multi model service can tackle that waste.
What it means for cloud customers and developers
For customers, fewer GPUs per unit of throughput can translate into lower prices, better quotas, and more predictable performance. If a cloud can serve the same traffic with roughly one fifth the accelerators, it can unlock capacity for peak events and shorten waitlists for new deployments. Developers who stitch together several models in one application could also see fewer bottlenecks from cold starts and replica churn when traffic shifts across endpoints.
There are practical concerns that matter to buyers. Multi model sharing must preserve isolation so one tenant’s workload cannot degrade the experience or leak data to another. The scheduler must honor priority and fairness rules so popular models keep their snappy latency while slower models still make progress. The paper indicates that Aegaeon’s per token choices are designed with service quality targets in mind. Even so, clouds will need robust observability and throttling to keep tail latency in check as they turn up pooling across larger fleets.
Nvidia, China and the AI chip backdrop
Aegaeon arrives amid tight global supply for high end accelerators and ongoing export controls on advanced AI chips to China. Nvidia built the H20 for the Chinese market to comply with those rules, and many Chinese providers rely on that device for inference. At the same time, Chinese designers such as Huawei and Cambricon are investing in domestic alternatives. Software efficiency that lets each GPU serve more work is valuable in any market, but it is especially useful where access to the most advanced chips is constrained.
Nvidia’s leadership has communicated that its near term financial guidance assumes minimal or zero revenue from China’s advanced chip segment because of policy limits. That backdrop makes Alibaba Cloud’s results more consequential. If large providers can stretch their installed base with software, they can continue scaling services even when the supply of top tier accelerators is limited.
How Aegaeon compares with other approaches
Until recently, most multi model serving systems could safely multiplex only two or three models on one GPU before overheads outweighed the benefits. Each additional model added more context switching, more cache misses, and more risk of jitter that can break latency targets. Aegaeon’s design focuses on shrinking those switching costs and keeping core data structures resident across models, which is why one accelerator can now sustain up to seven models under the reported conditions.
Dedicated instances remain appropriate for the very highest throughput endpoints or for workloads with strict isolation requirements. At the other extreme, serverless inference that spins up full model replicas on demand can be economical for low traffic models, but it suffers from longer start times. Aegaeon takes a middle path. It pools capacity and keeps models ready, then schedules work in small slices so overall utilization rises without a steep latency trade off.
Open technical questions and next steps
The beta ran in Alibaba Cloud’s model marketplace, which is a substantial environment but not the entire service footprint. Rolling Aegaeon into more regions and into varied product lines, such as enterprise dedicated clusters or on premises stacks, would test the design under different constraints. Another question is how the scheduler performs as models grow in context length and as multimodal models mix language with vision and audio. Those scenarios will stress memory and cache policies in new ways.
There is also a research path in sharing across heterogeneous accelerators and across nodes. If a scheduler can pool not just one GPU but a set of GPUs across a fabric, it might unlock additional gains, provided that interconnect overhead stays low and memory placement remains predictable. The published evidence so far points to large wins on single device pooling, which is already a major practical advance for today’s inference workloads.
Quick Facts
- Aegaeon reduced GPUs for Alibaba Cloud’s marketplace service from 1,192 Nvidia H20 units to 213, an 82 percent cut.
- The system served dozens of models, including models up to 72 billion parameters, during a more than three month beta.
- A single GPU supported up to seven models in the reported tests, compared with two or three for prior systems.
- Per token scheduling, memory reuse, and key value cache synchronization cut auto scaling overhead by 97 percent.
- Experiments showed 2 to 2.5 times higher arrival rates sustained, or 1.5 to 9 times more goodput, versus alternatives.
- The approach targets the long tail problem in model marketplaces, where a minority of requests tie up a large share of GPUs.
- The results arrive as export controls limit access to advanced AI chips in China and as domestic GPU efforts accelerate.