Snowflake goes massive on Meta LLM for open source inference difference

Snowflake says it will now host the Llama 3.1 collection of multilingual open source large language models (LLMs) in Snowflake Cortex AI for developers to build AI applications.

The Data Cloud company that is now the AI Data Cloud company has detailed developments that sees it work with an offering that includes what is Meta’s largest open source LLM in the shape of Llama 3.1 405B.

Going further, Snowflake makes note of its work developing and open sourcing the inference system stack with the intention of enabling real-time, high-throughput inference.

It’s all about the drive to democratise natural language processing and generative applications. Snowflake’s AI research team says it has optimised Llama 3.1 405B for both inference and fine-tuning, supporting a massive 128K context window from day one, while enabling real-time inference with up to 3x lower end-to-end latency and 1.4x higher throughput.

Cortex AI

It allows for fine-tuning on the massive model using a single GPU node within Cortex AI.

Snowflake’s AI research team promises its work is dedicated to giving regular contributions to the AI community and transparency around how it is building LLM technologies.

In tandem with the launch of Llama 3.1 405B, Snowflake’s AI Research Team is now open sourcing its Massive LLM Inference and Fine-Tuning System Optimization Stack in collaboration with DeepSpeed, Hugging Face, vLLM and the broader AI community.

This, says Snowflake, establishes a new state-of-the-art for open source inference and fine-tuning systems for multi-hundred billion parameter models.

“We’re not just bringing Meta’s models directly to our customers through Snowflake Cortex AI. We’re arming enterprises and the AI community with new research and open source code that supports 128K context windows, multi-node inference, pipeline parallelism, 8-bit floating point quantization to advance AI for the broader ecosystem,” said Vivek Raghunathan, VP of AI engineering at Snowflake.

Massive model scale and memory requirements pose challenges for users aiming to achieve low-latency inference for real-time use cases, high throughput for cost effectiveness and long context support for various enterprise-grade generative AI use cases. The memory requirements of storing model and activation states also make fine-tuning extremely challenging, with the large GPU clusters required to fit the model states for training often inaccessible to data scientists.

Massive LLM Inference

Snowflake is certain that its Massive LLM Inference and Fine-Tuning System Optimization Stack addresses these challenges. By using advanced parallelism techniques and memory optimizations, Snowflake enables fast and efficient AI processing, without needing complex and expensive infrastructure. For Llama 3.1 405B, Snowflake’s system stack delivers real-time, high-throughput performance on just a single GPU node and supports a massive 128k context windows across multi-node setups.

This flexibility is said to extend to both next-generation and legacy hardware, making it accessible to a broader range of businesses.

Moreover, data scientists can fine-tune Llama 3.1 405B using mixed precision techniques on fewer GPUs, eliminating the need for large GPU clusters. As a result, organizations can adapt and deploy powerful enterprise-grade generative AI applications easily, efficiently, and safely.

“Safety and trust are a business imperative when it comes to harnessing generative AI, and Snowflake provides us with the assurances we need to innovate and leverage industry-leading large language models at scale,” said Ryan Klapper, an AI leader at data infrastructure company Hakkoda. “The combination of Meta’s Llama models within Snowflake Cortex AI unlocks even more opportunities for us to service internal RAG-based applications. These applications empower our stakeholders to interact seamlessly with comprehensive internal knowledge bases, ensuring they have access to accurate and relevant information whenever needed.”

Snowflake’s AI Research Team has also developed optimized infrastructure for fine-tuning inclusive of model distillation, safety guardrails, retrieval augmented generation (RAG), and synthetic data generation so that enterprises can easily get started with these use cases within Cortex AI.