If you ask ChatGPT about “composable hardware infrastructure” you get something like this: Composable hardware infrastructure represents a significant evolution in data center architecture, providing a more flexible, efficient, and scalable approach to managing IT resources. Fair enough, but what does it mean?
Traditionally, composable hardware calls for the pooling of various hardware resources into Resource Pools (efficiency, scalability), connecting the pools via some network, and provisioning a subset of resources from each pool as needed using the Software Defined Infrastructure approach (flexibility, elasticity).
We want to pool hardware into distinct resource types (CPU, memory, IO, GPUs, various accelerators for e.g. networking, ML/DL like TPUs/DPUs, security, etc.). So instead of slicing a single piece of hardware into multiple virtual instances (like we do with “server virtualization” via type-1/2 hypervisors, or IBM partitioning real hardware with LPARs on mainframes). We DO NOT want to build a monster machine with huge resources in order to enable hardware pools (because we learned that “dinosaurs” eventually die out, but “lizards” adapt and survive). We actually want the reverse – cluster “regular-size” (e.g. 1/2RU) commodity and domain-specific nodes (as mentioned) to create resource pools. This allows for much more fluidity and efficiency with resource management (increase utilization by avoiding bin-packaging problems typical with monolithic computer designs and allowing for resource sharing, avoiding fate-sharing for hardware failures).
Let’s imagine that we already have these kind of hardware building blocks. We need to interconnect the resource pools together somehow, and for that we could use generic Ethernet, specific high-performance (lossless, low-latency) Infiniband fabric, or something in between like augmented ethernet (lossless, low-latency; driven by Ultra Ethernet Consortium or something proprietary like Broadcom Jericho3-based Ethernet Fabric for AI/ML). You can even go crazy and instead of electronic switching throw in something like photonics/optical switching to get close to warp speed. 🙂 What is certain is that the network is here extremely important since it needs to replace a model where memory and each periphery device are directly connected to the CPU, which is a tall order to achieve because of extremely low-latency (dozens/hundreds of nanoseconds) and high-bandwidth (hundreds/thousands of Gbps) needed.
Disaggregated hardware architecture (similar to Hyper-Converged Infrastructure) requires a fundamental architecture change and specifically moves IO further away from compute instances (IO ops are not local). This usually creates resistance for adopting these architectures due to lack of trust/transparency that IO performance for mission-critical and IO heavy workloads will not suffer. This further emphasizes the importance of networking, and maybe requires introducing additional mechanisms for ensuring network QoS (PFC, ECN, WRED/AFD, …).
One more fundamental requirement is that this new paradigm needs to be transparent from user/application perspective, i.e. it must not require changes on the OS level. Remember how paravirtualization (even though it was powerful in its applications) was run over by emulation&translation server virtualization techniques (and later supported directly in hardware by CPU manufacturers) – this is a clear lesson for future designs.
The issue now becomes how do we actually allow for the creation of a (virtual) computer, i.e. how does a CPU talk to (now remote) RAM, or to IO (storage) memory, or GPU memory, or whatever… and not break cache-consistency (because multiple different and physically separated processors (CPU, accelerators, etc.) are accessing and caching locally the same memory locations). Within the safe and known boundaries of a single computer, we have figured this out a decades ago by using cache snooping techniques. So the question is, how do we enable an abstraction of a “motherboard” in this brave new disagregated hardware world?
Answer? CXL! OK, what is CXL? Let’s consult ChatGPT once more – Compute Express Link (CXL) is an open standard interconnect technology designed to enhance the performance and efficiency of data center systems by enabling high-speed, low-latency communication between CPUs, memory, accelerators, and other peripherals. CXL aims to address the limitations of traditional interconnects by providing a coherent interface that improves resource sharing and scalability. CXL is a set of (three) protocols working on top of a PCIe bus to enable various disjoint CPU/accelerator/memory hardware scenarios. It basically enables a node (e.g. regular compute node) to access (mount if you will) remote hardware resources like GPU to make it look like it was a locally attached device (similar to what FC or iSCSI did for IO).
So, we can twist a famous adage from Sun and state that “Network is the … computer MOTHERBOARD!” :). OK, we now have the virtual motherboard for a disaggregated, virtualized hardware world. So we have smashed the traditional monolithic computer/server model in pieces, pooled the hardware resources per type in our data center, and recreated the virtual machine (actually we haven’t covered this part – we do need a new kind of OS/hypervisor, something like LegoOS (academic research) or TidalScale (commercial vendor).
Cool. But why are we doing this? As I’ve covered “the why” already in the previous post “Why and how to smash the data center in pieces?”, I won’t repeat but here’s the summary:
- Isolate and minimize the impact of hardware failures (in monolithic computer design almost every hardware failure brings down the entire machine and all its resources);
- More efficient resource utilization and scaling as you can manage and scale resource pools independently as needed;
- Load-balance the resource utilization and avoid resource congestion which would lead to performance degradation.
- We have enabled heterogeneous computing, i.e. it’s easier to adopt new hardware innovations in an elegant way.
Let’s add one more “why” before we log off. The main motivation behind the composable/disaggregated data center infrastructure trend is that we are trying (as always) to match and right-size hardware infrastructure to application resource requirements and deployment model. And application needs are always evolving: more distributed, memory-hungry, and cache-heavy apps; mixed workloads (from batch-like to real-time, and everything in-between), including “AI workloads” now; and traffic patterns (east-west traffic, high network bandwidth).
So disaggregated/heterogeneous compute enables us to compose (build, expand, renew, ..) the resource pools in an elegant and efficient way. It allows us to continuously evolve the same hardware clusters and completely avoid rip-and-replace cycles. Think about for example what Google has done – Gmail has been running for more than 20 years now, and who knows how many times has Google completely replaced the entire underlying infrastructure without any global Gmail service downtime. Of course, Gmail’s microservice-based application architecture and SRE deployment&operation practices help :), but disaggregated/heterogeneous computing is the most promising next-generation infrastructure design to look forward to!
Note: There are as many flavors of composable/disaggregate data center architectures as there are shades of gray. This is but one attempt to describe one of the approaches. Special shout-out to Yizhou Shan for his work (http://lastweek.io/) on the systems-approach to disaggregated computing (also the author of some of the ilustrations used).
If you wanna stay tuned with the myths, trends, and old news from the magical world of computing, subscribe to a newsletter!