Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Launchpad
Be early to the next big token project
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Sugon has released a "standard edition" super node—what is the future form of AI inference computing power?
Source: Titanium Media
OpenClaw suddenly went viral—this is not only an inevitable breakout for the AI Agent track, but also a stress test for the AI inference compute market.
At the 2026 Zhongguancun Forum, Inspur (formerly Dawning) released the world’s first wireless cable, rack-tower node-scaleX40 supernode. Before this, supernodes were massive machines, often on the scale of hundreds of cards or even thousands, including Inspur’s scaleX640, NVIDIA’s NVL72, Huawei’s Ascend 384, and others.
These top-tier supernodes are built specifically for training ultra-large-scale models. They deliver strong performance, but their deployment threshold is extremely high—customized server racks, complex cable connections, and professional operations teams—requiring investments of tens of millions or even over a hundred million, which means they can only serve a small number of leading players, such as internet giants or large state-owned enterprises (including central and local SOEs).
On the “opposite side” of supernodes is the traditional and mainstream 8-GPU GPU server used in the inference market. These products are flexible to deploy and cost-controllable, but when facing rapidly upgrading AI inference compute demand, their performance falls a bit short.
“From today’s perspective, 8-GPU machines are already far behind. Even if you scale the interconnect from an internet scale to 16 GPUs, it still can’t keep pace with the development of model inference services.” Li Bin, Senior Vice President of Inspur, said, “The compute infrastructure supporting AI development is gradually shifting from the old ‘compute factory’ to a ‘Token factory.’ The primary service targets of compute systems have shifted—from mainly supporting model training in the past, to now mainly serving inference.”
In the training era, the core metric for evaluating a compute system is how much compute power it has; but in the inference era, the key metric becomes “how to produce Tokens at the most economical cost.”
Image from AI generation
AI demand is diverging; inference compute is far from meeting it
Based on current market demand, the structure of AI compute is undergoing a tiered shift. According to industry organizations’ forecasts, global investment in AI infrastructure will continue to grow at a relatively fast pace, but new demand is gradually moving away from ultra-large-scale clusters and toward enterprise-level and industry application scenarios.
Under this trend, the focus of compute allocation is no longer simply chasing the maximum scale limit, but instead more on balancing performance, cost, and flexibility. It’s broadly agreed in the industry that a few tens of cards are already sufficient to meet the compute needs for most industry-scenario model training, inference, and development testing—this is the “maximum common ground” range that balances efficiency and investment.
However, demand evolution at the AI application layer is moving too quickly. With the explosive popularity of AI Agents represented by OpenClaw—while changing traditional industry application workflows—there is also a need on the compute supply side to restructure the entire system.
First is the communication bottleneck. Now that MoE models are in the mix, communication has become the core chokepoint for improving compute utilization, especially due to the uncertainty in expert distribution, which leads to large amounts of cross-card and cross-machine communication, directly breaking through the compute architecture of traditional 8-GPU servers.
Second is the memory bottleneck. As context windows keep expanding, OpenClaw’s need for long-context memory capabilities has also driven a growing demand for large memory and KV Cache, which is another limitation that traditional 8-GPU servers can’t easily overcome.
Third is the compute utilization bottleneck. Compute utilization and inference deployment cost are almost inversely proportional. Traditional clusters commonly suffer from insufficient compute utilization. The core challenge is not simply piling on hardware, but achieving a dual improvement in both system efficiency and effective compute through coordinated breakthroughs in hardware architecture—paired with supporting system engineering and optimization engineering.
Fourth is the ecosystem bottleneck. China’s domestic compute ecosystem is complex, involves many vendors, and has a long industrial chain, making industrial collaboration not easy. This requires leveraging open compute architecture to connect the entire upstream and downstream of the whole industry chain—chips, models, applications, and more—so as to build a compute foundation that is open, easy to use, ready to plug in (“out of the box”), and economically accessible to all.
Inspur wants to respond to the market with a 40-card “standard” supernode. “That 40-card sweet spot is something we explored through surveys and research with all kinds of customers.” said Li Liu, Vice President of Inspur. “Given the parameter scale and usage scenarios of mainstream models, 32–40 cards can already cover most industry needs, while also balancing cost and performance.”
The scaleX40 integrates 40 GPUs per single node, with total compute power exceeding 28 PFLOPS (FP8 precision), HBM memory capacity exceeding 5TB, and memory access bandwidth exceeding 80TB/s. System reliability has been improved to 99.99%.
With its scale configuration, scaleX40 has the capability to support large-model training and inference without bringing excessive investment pressure. It scales downward to support 32 cards, meeting the needs of small- and medium-scale training, inference, and development testing; scale upward, and it can be expanded to form larger clusters.
Li Bin calculated the numbers: “The traditional investment of piling together five 8-GPU machines and all kinds of associated costs is basically on par with scaleX40, but scaleX40 can improve training performance by 120%, and inference performance can be increased by up to 330%.”
From DeepSeek to OpenClaw: a new compute inflection point
“Tokens need compute power to generate output, but the dimensions and metrics for evaluation have become more numerous.” Li Bin believes, “For ordinary users, they care about response speed—if you ask a question, can it respond and give feedback quickly? For operators of compute systems, they need to consider how many users’ concurrent access can be supported at the same time, while also meeting basic user experience requirements.”
Ao Yulong, Head of the AI Framework R&D Department at the Zhiyuan Research Institute, also proposed: “For compute supply-side in the future, the key metric is how to convert compute power into effective Tokens, not ineffective Tokens. Whoever can bring that cost down is the real winner.”
scaleX40’s design is built around these new demands. A 144G large-memory setup supports long context windows; a multi-level KV Cache caching mechanism meets the large-memory needs of inference scenarios; and the 40-card high-bandwidth interconnect domain confines expert communication out-to-out traffic within a single node. These features all aim to maximize Token output efficiency per unit of compute while controlling costs.
The wireless cable, enclosure-like rack design is also a major differentiator of scaleX40. A core pain point of traditional supernodes lies in deployment complexity. Taking NVIDIA’s NVL72 as an example: it uses a copper cable connection scheme, requiring a large number of cables to interconnect between racks. This not only imposes harsh requirements on the data center environment and leads to long deployment cycles, but also results in a high failure rate in later operations and maintenance.
scaleX40’s solution is similar to the latest approach NVIDIA announced at this year’s GTC conference. It achieves Scale-up expansion through bus technology, using a wireless cable orthogonal interconnect where compute nodes and switch nodes are directly plugged in.
This design brings multiple benefits. First, the bus technology achieves performance more than 10 times that of traditional NDR networks and supports memory semantics and unified memory addressing. Second, in a one-layer network setup, P2P one-way latency is reduced to within one hundred nanoseconds; compared with a two-layer setup, latency is reduced by more than 30%, and the failure rate drops by 30%–50%.
Second, scaleX40 uses a standard 19-inch enclosure design, with a single unit height of only 16U. It can be placed directly into mainstream server racks and is compatible with existing data center environments, without requiring additional renovation.
“Many products in the past were either too big for their cabinets, or non-standardized, or required very complex data center renovations.” Li Liu said, “scaleX40 can go into standard racks. It can connect to standard data center power supply and cooling equipment, greatly lowering the deployment and usage threshold.”
Wang Zixiao, Head of Intelligent Computing Network Technology at China Telecom Research Institute, also said: “Providing inference services in a supernode form factor improves performance by about 2.6 times compared with a traditional single 8-GPU machine. The ‘out of the box’ capability of supernodes is significantly enhanced. The configuration complexity of scale-out networks is reduced by an order of magnitude, which is of great significance for the industry’s large-scale applications.”
Looking even deeper, the launch of scaleX40 also reflects the maturity of China’s domestic compute ecosystem. From chips to system software, from storage to networks, from operator libraries to communication libraries, a complete industrial chain is taking shape. As Li Bin said: “Across China’s domestic compute AI ecosystem, from chips to system software to top-layer models and applications, we are doing vertical cross-layer collaboration—leveraging improved efficiency through coupling and coordination in the vertical direction.”
When supernodes begin to be deployed and used in simpler ways, and when thousands of industries can obtain high-end compute capabilities at reasonable costs, large-scale applications of China’s AI may truly have taken the key step forward. (The author of this article | Zhang Shuai; Editor | Yang Lin)
Special Statement: The above content only represents the author’s personal views or position, and does not represent the views or position of Sina Finance Headline. If you need to contact Sina Finance Headline regarding issues such as the work content, copyright, or other matters, please do so within 30 days after the above content is published.
Abundant information and precise interpretation—available in the Sina Finance APP