IT Home, February 18 — Nvidia announced in a blog post on February 16 that its Blackwell Ultra AI architecture (GB300NVL72) has achieved significant breakthroughs in energy efficiency and cost. Through testing with the DeepSeek-R1 model, compared to the previous Hopper GPU generation, it offers a 50-fold increase in throughput per megawatt and reduces the cost per million tokens to one thirty-fifth.
In addition, Nvidia previewed the next-generation Rubin platform, which is expected to deliver a tenfold increase in throughput per megawatt over Blackwell, further advancing AI infrastructure development.
IT Home note: Throughput per megawatt (Tokens/Watt) is a key metric for measuring AI chip efficiency, indicating how many tokens (text units) can be processed per watt of power consumed. The higher the value, the better the efficiency and the lower the operating costs.
Nvidia stated in the blog that the key to the performance leap is the upgraded technical architecture. Blackwell Ultra connects 72 GPUs into a unified computing unit via NVLink technology, with an interconnection bandwidth of up to 130TB/s, far surpassing the 8-chip design of the Hopper era. Additionally, the new NVFP4 precision format, combined with an optimized collaborative design structure, further consolidates its dominance in throughput performance.
In terms of AI inference costs, the new platform reduces the cost per million tokens to one thirty-fifth compared to the Hopper architecture; even compared to the previous Blackwell (GB200), GB300 lowers token costs in long-context tasks to 1.5 times less, with attention mechanism processing speed doubled, suitable for high-load scenarios such as codebase maintenance.
OpenRouter’s “Inference Status Report” indicates that AI query volume related to software programming has surged over the past year, increasing from 11% to about 50%. These applications typically require AI agents to maintain real-time responses in multi-step workflows and possess long-context reasoning capabilities across code repositories.
To address this challenge, Nvidia has further improved the inference throughput of mixture-of-experts models (MoE) through continuous optimization by teams working on TensorRT-LLM, Dynamo, and others. For example, improvements to the TensorRT-LLM library have increased GB200’s performance in low-latency workloads by five times in just four months.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
DeepSeek R1 AI Test: NVIDIA Blackwell's throughput per megawatt is 50 times that of Hopper
IT Home, February 18 — Nvidia announced in a blog post on February 16 that its Blackwell Ultra AI architecture (GB300NVL72) has achieved significant breakthroughs in energy efficiency and cost. Through testing with the DeepSeek-R1 model, compared to the previous Hopper GPU generation, it offers a 50-fold increase in throughput per megawatt and reduces the cost per million tokens to one thirty-fifth.
In addition, Nvidia previewed the next-generation Rubin platform, which is expected to deliver a tenfold increase in throughput per megawatt over Blackwell, further advancing AI infrastructure development.
IT Home note: Throughput per megawatt (Tokens/Watt) is a key metric for measuring AI chip efficiency, indicating how many tokens (text units) can be processed per watt of power consumed. The higher the value, the better the efficiency and the lower the operating costs.
Nvidia stated in the blog that the key to the performance leap is the upgraded technical architecture. Blackwell Ultra connects 72 GPUs into a unified computing unit via NVLink technology, with an interconnection bandwidth of up to 130TB/s, far surpassing the 8-chip design of the Hopper era. Additionally, the new NVFP4 precision format, combined with an optimized collaborative design structure, further consolidates its dominance in throughput performance.
In terms of AI inference costs, the new platform reduces the cost per million tokens to one thirty-fifth compared to the Hopper architecture; even compared to the previous Blackwell (GB200), GB300 lowers token costs in long-context tasks to 1.5 times less, with attention mechanism processing speed doubled, suitable for high-load scenarios such as codebase maintenance.
OpenRouter’s “Inference Status Report” indicates that AI query volume related to software programming has surged over the past year, increasing from 11% to about 50%. These applications typically require AI agents to maintain real-time responses in multi-step workflows and possess long-context reasoning capabilities across code repositories.
To address this challenge, Nvidia has further improved the inference throughput of mixture-of-experts models (MoE) through continuous optimization by teams working on TensorRT-LLM, Dynamo, and others. For example, improvements to the TensorRT-LLM library have increased GB200’s performance in low-latency workloads by five times in just four months.