One can argue that “processing or compute in or near memory” (I’ll refer to this as PIM going forward) is much like “emerging memory” in that it has been researched for years and there have been numerous attempts at productization. Success in both cases to date has really only occurred in niche applications. That said, I believe there is a strong argument to be made that the era of emerging memory and PIM is upon us.
The stage is set first by the relatively flat single-thread performance growth in general purpose CPUs (cache size and special instructions aside) along with the overall slowing of litho scaling – e.g. Intel’s recent admissions. On the other side of the Moore’s Law shakeup is the never-ending hunger of hyperscalers and enterprise providers to improve data center compute density (goes directly to capacity/Total Cost Of Ownership/Cloud Offering Costs) and differentiate OEM product lines. CPU performance is now largely restrained by the realities of von Neumann architecture (good reading: Kanev et al., Profiling a warehouse-scale computer, in Proceedings ISCA’15 (2015)), so a disruptive stage is set for anything that can break the memory induced logjam.
A second part of the disruptive landscape is that we are arguably at the end of effective DRAM scaling, at least as we know it. Long gone are the days of anything but ~2nm trending to 1nm or less litho “improvements” (made fuzzy by how each manufacturer now individually defines their 1X, 1Y, 1Z, etc. process node). Process advances from 1Y to 1Z for example bring the higher production cost of more and more multi-patterning or the amortization of EUV capital equipment expenditures on one end, and on the other design level changes to support scaling such as DDR5’s on-die ECC (~7% bits plus logic overhead). Given this, it is not a stretch to hypothesize that the $/bit improvement over 1Y->1Z->? is quite flat. An emerging memory no longer has to chase the juggernaut that was DRAM scaling (and in parallel NAND) and the massive cost efficiencies driven by world-wide production levels. This opens the door to DRAM replacement technologies which have the potential to provide a renewed scaling and $/bit improvement road map.
Adding other memory tier(s) composed of moderately fast (vs NAND), but relatively cheap (vs DRAM) memory is also a way to address the DRAM scaling wall at the system level. Generally described as Storage Class Memory (SCM), this memory could be used to provide large memory footprints at a moderate capital cost compared to DRAM, or perhaps even supplant a portion of the typical DRAM footprint. Intel/Micron’s 3D-xpoint is an example of this class of memory.
Digressing for a moment, to some degree I was motivated to write this post after reading an article in Semiconductor Engineering – Moving Data and Computing Closer Together. The author does a nice job of laying out some of the potential advantages and pitfalls of processing in/near memory; however, I think it is worth expanding on this in order to introduce the third part of our disruptive landscape.
The drawings in the article (and implications in the text) show the PIM units attached via the DDR bus. This is natural as efficiency of data movement, bandwidth, and the desire for true memory semantics make PCIe less than ideal. The article points out the extra difficulty that programmers will have partitioning data and functionality to best match a combination of CPU plus PIM units, but the difficulty goes beyond just the programming model. Assuming a more general purpose model where both the CPU and PIM units need to access and process the data, we are presented with the difficulty of how data is actually laid out in a typical server. The DDR subsystem is composed of multiple channels, each containing DIMMs with multiple nibble or byte wide discrete memory/PIM devices. A good example of this is shown on UPMEM’s Technology Page – one can see the individual devices, each getting a slice of the DDR bus. The functional problem is that data is widely interleaved across the memory devices of a DIMM (width plus burst length equating to a CPU cache line), then across channels. This means the typical way data is read/written by the CPU would leave each PIM unit with only partial slices of contiguous data objects to try and operate on. Obviously this can be accounted for in the data organization, but it adds another axis of complexity and potential inefficiency – e.g. are all the DIMMs PIM units; what are the interleave capabilities of the CPU in question; how much have you impacted the aggregate bandwidth of the non-PIM memory channels possibly starving CPU threads of data; do I need to efficiently operate on the PIM data objects with CPU thread as well? Given publicized results from UPMEM (and misc academic works) there are use cases and data sets that can benefit from (in spite of?) this architecture.
The need for a much better way to interface to processing in/near memory sets the stage for the third part of the disruptive arena – CXL™, or Compute Express Link. CXL emerged (along with other technologies such as OpenCAPI and CCIX) to fill a widening gap between the DDR subsystem and PCIe, enabling coherent, low latency memory and accelerator attach in a way neither existing interface could. CXL allows a block of system memory addresses to be mapped onto the interface so that the CXL memory is a true “first class” memory citizen, with load/store semantics. Given this and its latency optimizations versus PCIe it is an ideal attach point for emerging SCM. Using CXL for a 2nd tier memory (SCM) does not incur the same DDR-bus interactions/interference issues created by putting SCM on the DDR bus (e.g. NVDIMM-P). Additionally, CXL is intended to provide a “closely attached”, coherent interface for accelerators. PIM is an embodiment of just such an accelerator, just with the memory “behind” the offload engine (PIM processor) vs an accelerator operating on data in local or system memory. The CXL controller can be designed to provide the “near memory processing” functionality, or synergize with the PIM units in how data blocks are allocated and accessed.
In summary, we have a situation where CPU improvements are bounded by architecture and process limitations, DRAM no longer enjoys the generation to generation litho advantages, bit growth, and cost-down of the past, and a new paradigm for system interconnect has been opened with the apparently wide adoption of CXL. Pit this backdrop against the needs of compute users from the massive hyperscalers to consumers, and one can argue there has never been a better time for emerging disruptive memory technology, or system-level shifts to compute in/near memory!