An Interview with Huawei: Launch of The Industry’s First CCIX-Enabled Host Platform

By CCIX Consortium / Huawei • March 15, 2019

CCIX Members Huawei Kunpeng 920 TaiShan Server

This blog is the first in a series where we ask our CCIX Consortium members about the benefits they see in incorporating the CCIX Standard in their own products.

In early January, CCIX Consortium member and promoter Huawei announced the Kunpeng 920 CPU and TaiShan Server, the industry’s first CCIX-enabled Arm-based Host platform. We took the opportunity to connect with them and learn more about Huawei’s plans and their vision for CCIX deployment.

CCIX Consortium – The CCIX standard is relatively new, what were two or three key factors in the decision to include support for the CCIX interface when designing the Kunpeng 920 CPU?

Huawei – There were three factors in our decision. First and foremost, Cache Coherent Acceleration. Huawei has a broad array of technologies and services ranging from AI, clouds, storage and server workloads that may involve accelerators. While many workloads and accelerators work well in existing hardware and software environments, there are some classes of workloads can benefit from cache coherent acceleration. Some applications, for example, graph traversal, can exhibit small granularity, random traffic pattern which lends itself to operate better in cache coherent load/store model than DMA model, where the latter is working well for bulk transfer of data set and results. Also, imagine an application that has 100s or 1000s of parallel accelerated threads that need to be synchronized, managing such synchronization is complicated and slower using a DMA model, while the load/store model in CC-SVM is much simpler and faster.

The second factor is that the CCIX Consortium has made a sensible decision to build the CCIX protocol to run over PCIe. This substantially lowers the chip and platform design challenges, as well as the software investment. It does not add new pins to our SoC and it allows customers to decide if a workload is more suitable to run in PCIe mode or in CCIX mode with a seamless migration path in the same hardware environment.

The third factor is that CCIX supports Memory Expansion that targets SCM (Storage Class Memory) use cases. While SCM is still an emerging technology itself, it is a closely-monitored technologies for datacenter, storage, OS/filesystem and database developers. This CCIX Memory Expansion use case can be easily supported with potential value-add in enabling SCM in the future.

What were the industry challenges you were trying to overcome by using CCIX in the Kunpeng 920 CPU?

The industry needs a way to achieve open, ISA-independent, highly scalable, heterogeneous acceleration platforms to meet the growing and accelerating compute performance needs for AI, cloud, multimedia, IoT and many emerging and disruptive 5G use cases. The industry needs to address this problem fast before the data size, data traffic and workloads grow beyond the computing throughput, without hitting the power wall.

We see CCIX as one of the enablers to support the above vision. So we built CCIX on the Kunpeng 920 and TaiShan Servers as one of the enablers for CCIX ecosystem. With a lot of diligent effort from our engineering teams, we are proud to see Kunpeng 920 and TaiShan server as the industry’s first CCIX-enabled Arm-based Host platform to help drive CCIX deployment.

The industry has progressed without an open cache coherent interconnect standards in the past. Why now?

Indeed, many have asked why do we need CCIX? Let’s look at it this way. All advanced CPUs have multi-core CPU designs and no one would question the importance and the need of an on-chip cache coherent fabric between the CPU cores and other processing elements such as SIMD, Encryption and Compression accelerators closely coupled to the CPU cores. This is because when multiple cores are running the same application, they share virtual memory address (SVM), share data set and results, needs inter-process communication and synchronization etc.

CCIX is effectively an extension of “internal cache coherent fabric” to the external interface that enables an open platform to work with a broad array of external accelerators. As long as you agree multi-core CPU + on-die Accelerators need a cache coherence fabric, it should be easy to see why we want a multi-core CPU + external Accelerators also connected in a cache coherent fashion as if they are “closely coupled on-die” with each other. CCIX calls this “Seamless Acceleration” and we like that vision.

It is also much more natural to migrate applications running on a CPU with multi-core and on-die accelerators to a CCIX-based CPU + external accelerators since it retains the CC-SVM (Cache-Coherent Shared Virtual Memory) load/store and cache coherent programming model. Without CC-SVM, application developers need to substantially change the programming and synchronization models using DMA and use API which involves OS/drivers with latencies in multi-milliseconds range.

Many readers may realize CCIX is not the first such cache coherent interconnect standard. Many technology players have their own CPU-to-accelerator cache coherent interconnects, but we opted to support CCIX which is open, ISA-independent and it is layered over PCIe making the implementation, validation, deployment and application developments much easier.

From a design implementation point of view, how easy was it to incorporate CCIX technology?

It is much easier than compared to building an entirely new hardware/software stack from scratch. As I mentioned above, since CCIX runs over PCIe, our designers can leverage the matured PCIe infrastructure including design tools, verification tools and test equipment. CCIX shares same PCIe 4.0 CEM connector so CCIX Accelerator products can be designed in parallel. With PCIe as baseline, engineers can bring up the interface in PCIe mode, which substantially shortens the debug bring-up time. As the CCIX TLP is fully compliant with PCIe, after initial bring up the engineer can quickly focus on CCIX protocol validation and not worry about debugging the transport itself. The ease of implementation should be quite obvious as the CCIX Rev 1.0 version 0.9 were only made available in June of 2018, while Huawei teams and several other CCIX members have started designs during the specification development phase and be able to get working chips in very short time and achieving great results.

The Huawei Kunpeng 920 CPU is the first-to-market CCIX-enabled Arm64 server CPU. Can you discuss Huawei’s views on the growing CCIX ecosystem?

Like all new, especially disruptive technologies, it will take some time to grow the ecosystem. This is because the market has built working solutions around existing hardware and software architectures, and there are strong infrastructure, knowledge base, lots of OS/Drivers/API and software tools available. On the plus side, CCIX is also leveraging some of these aspects.

We see the CCIX ecosystem growth starting with wide spread CCIX hardware availability, e.g., Huawei TaiShan server and Xilinx ACAP FPGAs, together with open-source enablement which the CCIX Software workgroup is driving. Then accelerator vendors, applications developers and academia researchers will drive cache coherent workloads algorithms and optimization. The Kunpeng and TaiShan family of products will play our role in driving CCIX and heterogeneous acceleration architecture forward.

How about your customers? How is CCIX support going to benefit them?

Huawei customers have diverse application requirements and their applications can range from Big Data, Storage, Web, Data Management, AI, Edge Computing and many more. Some workloads will continue to work very well in the current hardware environment. Some may be able to enhance with CCIX CC-SVM model and we believe Kunpeng 920 and TaiShan servers offer our customers a flexible choice and seamless migration path when our customers are ready to use CCIX for their accelerated workloads.