Red Hat launches llm-d community & project

Red Hat has announced the launch of llm-d, a new open source project designed to address generative AI’s future with inference at scale. 

Powered by a native Kubernetes architecture, llm-d features vLLM-based distributed inference and intelligent AI-aware network routing to enable large language model (LLM) inference clouds to meet the most demanding production service-level objectives (SLOs).The project aims to make production generative AI as omnipresent as Linux.

While training remains vital, Red Hat says that the “true impact” of generative AI hinges on more efficient and scalable inference – the engine that transforms AI models into actionable insights and user experiences. 

According to magical analyst house Gartner, “By 2028, as the market matures, more than 80% of datacentre workload accelerators will be specifically deployed for inference as opposed to training use.” 

This underscores that the future of gen AI lies in the ability to execute. The escalating resource demands of increasingly sophisticated and larger reasoning models limits the viability of centralised inference and threatens to bottleneck AI innovation with prohibitive costs and crippling latency.

The need for scalable gen AI inference

Red Hat and its industry partners are directly confronting this challenge with llm-d, a project that amplifies the power of vLLM to transcend single-server limitations and unlock production at scale for AI inference.

“Using the proven orchestration prowess of Kubernetes, llm-d integrates advanced inference capabilities into existing enterprise IT infrastructures. This unified platform empowers IT teams to meet the diverse serving demands of business-critical workloads, all while deploying innovative techniques to maximise efficiency and dramatically minimise the total cost of ownership (TCO) associated with high-performance AI accelerators,” noted the company, in a press statement.

Let’s make note here of vLLM, which has become the open source de facto standard inference server, providing day 0 model support for frontier models and support for a list of accelerators, now including Google Cloud Tensor Processor Units (TPUs). Here we find prefill and decode disaggregation to separate the input context and token generation phases of AI into discrete operations, where they can then be distributed across multiple servers.

KV (key-value) Cache Offloading, based on LMCache, shifts the memory burden of the KV cache from GPU memory to more cost-efficient and abundant standard storage, like CPU memory or network storage.

Not constraints from infrastructure

According to Red Hat, “The future of AI must be defined by limitless opportunity, not constrained by infrastructure silos. Red Hat sees a horizon where organisations can deploy any model, on any accelerator, across any cloud, delivering an exceptional, more consistent user experience without exorbitant costs. To unlock the true potential of gen AI investments, enterprises require a universal inference platform – a standard for more seamless, high-performance AI innovation, both today and in the years to come.”

This technology includes AI-Aware network routing for scheduling incoming requests to the servers and accelerators that are most likely to have hot caches of past inference calculations.

“The launch of the llm-d community, backed by a vanguard of AI leaders, marks a pivotal moment in addressing the need for scalable gen AI inference, a crucial obstacle that must be overcome to enable broader enterprise AI adoption. By tapping the innovation of vLLM and the proven capabilities of Kubernetes, llm-d paves the way for distributed, scalable and high-performing AI inference across the expanded hybrid cloud, supporting any model, any accelerator, on any cloud environment and helping realise a vision of limitless AI potential,” said Brian Stevens, senior vice president and AI CTO, Red Hat

This new open source project has already garnered the support of a coalition of leading gen AI model providers, AI accelerator pioneers and AI cloud platforms. CoreWeave, Google Cloud, IBM Research and capitalisation-focused GPU company Nvidia are founding contributors, with AMD, Cisco, Intel, Lambda and Mistral AI as partners.

The llm-d community is further joined by founding supporters Sky Computing Lab at the University of California, originators of vLLM and the LMCache Lab at the University of Chicago, originators of LMCache.