Home Is Where the Memory Is
In traditional homogeneous computing systems with symmetrically multiprocessing, CPU Home Agents (HAs) are used to maintain coherent memory for a given address space and manage data and ownership responses required by a given transaction. When expanding the computing domain to heterogeneous computing systems with accelerators, the Home Agents are limited to the CPU multiprocessing domain. While this can help streamline cache coherency as there’s only one centralized HA for the system to work with, it also means that all requests are funneled through the CPU’s precious resources.
CCIX technology, however, enables a peer-processing data sharing model where the HA may reside on an accelerator. By putting the HA near the memory and on the CCIX device, the CCIX protocol is able to reduce latency. While this may require more work to be done from a software POV, CCIX has ensured that this isn’t a burden for developers. (For more information on how CCIX makes things “boring” for software developers, see our previous blog.)
To understand how enabling a HA to reside on an accelerator is beneficial, we need to first define CCIX Home Agents and Request Agents and explore the operation of a CCIX system.
CCIX devices are comprised of the following key components: the Acceleration Function (AF) which we have covered in a previous blog, the Request Agent (RA), the Home Agent (HA), the Slave Agent (SA), Ports and Links.
In a CCIX system, the RA is responsible for issuing coherent read and write addresses to physical addresses, enabling the CCIX device to participate in the cache coherent network and act as an accelerator peer of the CPU. Its counterpart, the HA, is responsible for managing coherency and memory access and sending snoop transactions to the RA as requested. The SA also provides physical memory, but is an extension of the memory managed by the HA and must be homed by a parent HA. The SA allows memory associated with an HA in one device to be physically provided by a separate device, enabling support for building memory expansion devices. This enables the memory controlled by the SA to look just like regular system memory and become part of the standard NUMA memory map.
In order to manage the physical memory, or memory pools, the HA must participate in the coherency domain protocol with RAs. When the HA receives a memory request, it will send snoops to additional RAs to ensure it obtains the latest copy of the requested memory data. The latest memory copy may reside in a cache belonging to a different RA. If this is the case, the HA will update its own memory with the data and then service the additional request. This process results in overall memory coherency. It is important to note that the HA may have many memory pools, and a memory pool may be managed by multiple HAs.
The CCIX protocol uses Agent Identifiers, or Agent IDs, to keep track of the unique RAs, SAs and HAs in the system. CCIX protocol messages exchanged between agents are associated by their Agent IDs, allowing message carriers and recipients to identify the transmitting agent, as well as the intended recipient of the message. As a result, agent communication is sent in a distinct, globally unique manner.
In a CCIX device, an RA sends memory requests to an HA in a distant device using address-routed messages. Address-routed messages are directed to their intended destination based on the destination address of the message.
To help manage all of this, CCIX implements a System Address Map (SAM) table (Figure 11), which helps to enable address-routed messages. The CCIX device must provide a RSAM or HSAM table, depending if it has an RA or HA. These tables make sure that each port lies on a distinct route between the local device and the destination device containing the targeted memory pool.
It is this structure that finally enables the significant difference between other computing protocols we discussed earlier. The CCIX architecture permits the HA, known as the CCIX Remote HA, to reside on the accelerator, as opposed to residing on the CPU.
With the HA on the accelerator, the RA can implement near memory processing and access data local to the accelerator. Since the RA is no longer required to access the host CPU’s system memory, but can access the local data on the accelerator, CCIX technology results in significant improvements for low-latency application.
Memory is at the heart of every application. By allowing the Home Agent to be placed closer to the memory, CCIX is able to reduce the amount of time needed for applications like 5G, machine learning and cloud computing to access latency-sensitive data.
With CCIX, home is where the memory is – and, there’s no place like home.