Report: NVIDIA to unveil 'new inference chip' at next month's GTC conference, incorporating Groq LPU design

wallstreetcn · Feb 28 16:07

NVIDIA is set to launch a new inference chip integrating Groq's LPU technology, featuring SRAM integration and 3D stacking, optimized for latency and bandwidth bottlenecks in large model inference. This may be based on the next-generation Feynman architecture. Meanwhile, this month has seen the first large-scale deployment of a pure CPU solution for Meta to run specific inference tasks, providing differentiated support. OpenAI has committed to purchasing and investing $30 billion.

$NVIDIA (NVDA.US)$ The plan is to unveil a new inference chip incorporating Groq’s 'Language Processing Unit' (LPU) technology at next month’s GTC Developer Conference, signaling NVIDIA's accelerated shift towards inference computing to meet customers’ urgent demand for high-performance, cost-effective computing solutions.

According to The Wall Street Journal, this entirely new system, which NVIDIA CEO Jensen Huang described as 'unprecedented in the world,' is specifically designed to accelerate query responses for AI models. Its launch is expected to reshape the current AI computational power market landscape and directly impact cloud service providers and enterprise-level investors seeking more cost-efficient alternatives.

As a significant indicator of initial market recognition for this technology, OpenAI, the developer behind ChatGPT, has agreed to become one of the largest customers for this new processor and announced plans to purchase large-scale 'dedicated inference capacity' from NVIDIA. This move not only solidifies NVIDIA’s core customer base but also sends a clear signal to the market: the underlying infrastructure supporting autonomous AI agents is shifting from extensive pre-training to efficient inference.

Amid fierce competition from Google, Amazon, and numerous startups, NVIDIA is breaking away from its sole reliance on traditional Graphics Processing Units (GPUs). By introducing new architectural designs and exploring deployment models based on pure Central Processing Units (CPUs), the company aims to consolidate its market dominance as the AI industry enters its next evolutionary phase.

Integration of LPU Design: Tackling Bottlenecks in Large Model Inference

As the AI sector transitions from model training to practical application deployment, inference computing has become the central focus. AI inference primarily involves two stages: pre-fill and decode, with the decoding process being particularly slow for large AI models. To address this technical bottleneck, NVIDIA has chosen to push beyond physical limits through external technological integration.

According to The Wall Street Journal, NVIDIA spent $20 billion at the end of last year to acquire critical technology licenses from the startup Groq and absorbed an executive team, including founder Jonathan Ross, through a large-scale 'core hiring' deal. Groq's 'Language Processing Unit' (LPU) adopts an architecture fundamentally different from traditional GPUs, demonstrating exceptionally high efficiency in handling inference tasks.

Industry analysis suggests that the upcoming release may involve the potentially disruptive next-generation Feynman architecture. According toWall Street Newsa previous article, the Feynman architecture may adopt a broader SRAM integration approach, even deeply integrating LPUs through 3D stacking technology. This will specifically optimize the two major inference bottlenecks of latency and memory bandwidth, significantly reducing the energy consumption and costs associated with running AI agents.

Expanding Pure CPU Deployment: Offering Diversified Computing Options

While introducing the LPU architecture, NVIDIA is also flexibly adjusting its traditional processor usage patterns. NVIDIA’s conventional approach involved bundling Vera CPUs with its powerful Rubin GPUs in data center servers. However, this configuration has proven too costly and insufficiently energy-efficient when handling certain specific AI agent workloads.

Some large enterprise clients have found that a pure CPU environment is more efficient when running specific AI tasks. In response to this trend, NVIDIA announced earlier this month the expansion of its collaboration with Meta Platforms, marking its first large-scale deployment of a pure CPU setup to support Meta's ad-targeting AI agents. This partnership is seen by the market as an early indication of a strategic shift for NVIDIA, demonstrating that the company is moving beyond a singular focus on GPU sales and attempting to secure different segments of the AI market through a diversified hardware portfolio.

Market demand is shifting gears, and competitive pressures continue to intensify.

This evolution in underlying hardware design stems directly from the explosive growth in demand for AI agent applications within the tech industry. Many companies building and operating AI agents have discovered that traditional GPUs are too costly and are not always the optimal choice for running models in practice.

OpenAI’s recent moves highlight this trend. In addition to committing to purchase NVIDIA's new systems to enhance its rapidly growing Codex tool, OpenAI also entered into a multi-billion-dollar computing partnership last month with the startup Cerebras. According to Cerebras CEO Andrew Feldman, their inference-focused chips outperform NVIDIA’s GPUs in terms of speed. Furthermore, OpenAI has signed a major agreement to utilize Amazon's Trainium chips.

It is not only startups but also major cloud service providers that are accelerating their efforts in developing proprietary chips. Anthropic Claude Code, widely regarded as the leader in the auto-coding market, currently relies primarily on chips designed by Amazon Web Services (AWS) and Alphabet's Google Cloud rather than NVIDIA’s products. Facing competition from all sides, Jensen Huang emphasized in an interview with wccftech that NVIDIA is transitioning from being a mere chip supplier to becoming a builder of a comprehensive AI ecosystem encompassing semiconductors, data centers, cloud services, and applications. For investors, next month’s GTC conference will be a critical juncture to assess whether NVIDIA can sustain its 90% market share dominance in the era of inference.

Editor/Doris

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Airstar Bank. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.