VAST Data's storage system architecture

February 26, 2019, 9:23 pm

≪ Previous: A week in the life of an SC attendee

VAST Data, Inc, an interesting new storage company, unveiled their new all-flash storage system today amidst a good amount of hype and fanfare. There's no shortage of marketing material and trade press coverage out there about their company and the juiciest features of their storage architecture, so to catch up on what all the talk has been about, I recommend taking a look at

The VAST "Universal Storage" datasheet
The Next Platform's article, "VAST Data Clustered Flash Storage Bans The Disk From The Datacenter"
Chris Mellor's piece, "VAST Data: The first thing we do, let’s kill all the hard drives"

The reviews so far are quite sensational in the literal sense since VAST is one of very few storage systems being brought to market that have been designed from top to bottom to use modern storage technologies (containers, NVMe over Fabrics, and byte-addressable non-volatile memory) and tackle the harder challenge of file-based (not block-based) access.

In the interests of grounding the hype in reality, I thought I would share various notes I've jotted down based on my understanding of the VAST architecture. That said, I have to make a few disclaimers up front:

I have no financial interests in VAST, I am not a VAST customer, I have never tested VAST, and everything I know about VAST has come from just a few conversations with a limited number of people in the company. This essentially means I have no idea what I'm talking about.
I do not have any NDAs with VAST and none of this material is confidential. Much of it is from public sources now. I am happy to provide references where possible. If you are one of my sources and want to be cited or credited, please let me know.
These views represent my own personal opinions and not those of my employer, sponsors, or anyone else.

With that in mind, what follows is a semi-coherent overview of the VAST storage system as I understand it. If you read anything that is wrong or misguided, rest assured that it is not intentional. Just let me know and I will be more than happy to issue corrections (and provide attribution if you so desire).

Relevant Technologies

A VAST storage system is comprised of two flavors of building blocks:

JBOFs (VAST calls them "d boxes" or "HA enclosures"). These things are what contain the storage media itself.
I/O servers (VAST calls them "cnodes,""servers,""gateways" or, confusingly, "compute nodes"). These things are what HPC cluster compute nodes talk to to perform I/O via NFS or S3.

Tying these two building blocks together is an RDMA fabric of some sort--either InfiniBand or RoCE. Conceptually, it would look something like this:

Conceptual diagram of how VAST Data's storage system (IOS, storage fabric, and JBOFs) might fit into a canonical HPC system. Interestingly, strongly resembles old-school block-based SAN architectures.

For the sake of clarity, we'll refer to the HPC compute nodes that run applications and perform I/O through an NFS client as "clients" hereafter. We'll also assume that all I/O to and from VAST occurs using NFS, but remember that VAST also supports S3.

JBOFs

JBOFs are dead simple and their only job is to expose each NVMe device attached to them as an NVMe over Fabrics (NVMeoF) target. They are not truly JBOFs because they do have (from the VAST spec sheet):

2x embedded active/active servers, each with two Intel CPUs and the necessary hardware to support failover
4x 100 gigabit NICs, either operating using RoCE or InfiniBand
38x 15.36 TB U.2 SSD carriers. These are actually carriers that take multiple M.2 SSDs.
18x 960 GB U.2 Intel Optane SSDs

However they are not intelligent. They are not RAID controllers nor do they do any data motion between the SSDs they host. They literally serve each device out to the network and that's it.

I/O Servers

I/O servers are where the magic happens, and they are physically discrete servers that

share the same SAN fabric as the JBOFs and speak NVMeoF on one side, and
share a network with client nodes and talk NFS on the other side

These I/O servers are completely stateless; all the data stored by VAST is stored in the JBOFs. The I/O servers have no caches; their job is to turn NFS requests from compute nodes into NVMeoF transfers to JBOFs. Specifically, they perform the following functions:

Determine which NVMeoF device(s) to talk to to serve an incoming I/O request from an NFS client. This is done using a hashing function.
Enforce file permissions, ACLs, and everything else that an NFS client would expect.
Transfer data to/from SSDs, and transfer data to/from 3D XPoint drives.
Transfer data between SSDs and 3D XPoint drives. This happens as part of the regular write path, to be discussed later.
Perform "global compression" (discussed later), rebuilds from parity, and other maintenance tasks.

It is also notable that I/O servers do not have an affinity to specific JBOFs as a result of the hash-based placement of data across NVMeoF targets. They are all simply stateless worker bees that process I/O requests from clients and pass them along to the JBOFs. As such, they do not need to communicate with each other or synchronize in any way.

System Composition

Because I/O servers are stateless and operate independently, they can be dynamically added (and removed) from the system at any time to increase or decrease the I/O processing power available to clients. VAST's position is that the peak I/O performance to the JBOFs is virtually always CPU limited since the data path between CPUs (in the I/O servers) and the storage devices (in JBOFs) uses NVMeoF. This is a reasonable assertion since NVMeoF is extremely efficient at moving data as a result of its use of RDMA and simple block-level access semantics.

At the same time, this design requires that every I/O server be able to communicate with every SSD in the entire VAST system via NVMeoF. This means that each I/O server mounts every SSD at the same time; in a relatively small two-JBOF system, this results in 112x NVMe targets on every I/O server. This poses two distinct challenges:

From an implementation standpoint, this is pushing the limits of how many NVMeoF targets a single Linux host can effectively manage in practice. For example, a 10 PB VAST system will have over 900 NVMeoF targets mounted on every single I/O server. There is no fundamental limitation here, but this scale will exercise pieces of the Linux kernel in ways it was never designed to be used.
From a fundamental standpoint, this puts tremendous pressure on the storage network. Every I/O server has to talk to every JBOF as a matter of course, resulting in a network dominated by all-to-all communication patterns. This will make performance extremely sensitive to topology, and while I wouldn't expect any issues at smaller scales, high-diameter fat trees will likely see these sensitivities manifest. The Lustre community turned to fine-grained routing to counter this exact issue on fat trees. Fortunately, InfiniBand now has adaptive routing that I expect will bring much more forgiveness to this design.

This said, VAST has tested their architecture at impressively large scale and has an aggressive scale-out validation strategy.

Shared-everything consistency

Mounting every block device on every server may also sound like anathema to anyone familiar with block-based SANs, and generally speaking, it is. NVMeoF (and every other block-level protocol) does not really have locking, so if a single device is mounted by two servers, it is up to those servers to communicate with each other to ensure they aren't attempting to modify the same blocks at the same time. Typical shared-block configurations manage this by simply assigning exclusive ownership of each drive to a single server and relying on heartbeating or quorum (e.g., in HA enclosures or GPFS) to decide when to change a drive's owner. StorNext (formerly CVFS) allows all clients to access all devices, but it uses a central metadata server to manage locks.

VAST can avoid a lot of these problems by simply not caching any I/Os on the I/O servers and instead passing NFS requests through as NVMeoF requests. This is not unlike how parallel file systems like PVFS (now OrangeFS) avoided the lock contention problem; not using caches dramatically reduces the window of time during which two conflicting I/Os can collide. VAST also claws back some of the latency penalties of doing this sort of direct I/O by issuing all writes to nonvolatile memory instead of flash; this will be discussed later.

For the rare cases where two I/O servers are asked to change the same piece of data at the same time though, there is a mechanism by which an extent of a file (which is on the order of 4 KiB) can be locked. I/O servers will flip a lock bit for that extent in the JBOF's memory using an atomic RDMA operation before issuing an update to serialize overlapping I/Os to the same byte range.

VAST also uses redirect-on-write to ensure that writes are always consistent. If a JBOF fails before an I/O is complete, presumably any outstanding locks evaporate since they are resident only in RAM. Any changes that were in flight simply get lost because the metadata structure that describes the affected file's layout only points to updated extents after they have been successfully written. Again, this redirect-on-complete is achieved using an atomic RDMA operation, so data is always consistent. VAST does not need to maintain a write journal as a result.

It is not clear to me what happens to locks in the event that an I/O server fails while it has outstanding I/Os. Since I/O servers do not talk to each other, there is no means by which they can revoke locks or probe each other for timeouts. Similarly, JBOFs are dumb, so they cannot expire locks.

The VAST write path

I think the most meaningful way to demonstrate how VAST employs parity and compression while maintaining low latency is to walk through each step of the write path and show what happens between the time an application issues a write(2) call and the time that write call returns.

First, an application on a compute node issues a write(2) call on an open file that happens to reside on an NFS mount that points to a VAST server. That write flows through the standard Linux NFS client stack and eventually results in an NFS RPC being sent over the wire to a VAST server. Because VAST clients use the standard Linux NFS client there are a few standard limitations. For example,

There is no parallel I/O from the client. A single client cannot explicitly issue writes to multiple I/O servers. Instead, some sort of load balancing technique must be inserted between the client and servers.
VAST violates POSIX because it only ensures NFS close-to-open consistency. If two compute nodes try to modify the same 4 KiB range of the same file at the same time, the result will be corrupt data. VAST's server-side locking cannot prevent this because it happens at the client side. The best way around this is to force all I/O destined to a VAST file system to use direct I/O (e.g., open with O_DIRECT).

Pictorially, it might look something like this:

Step 1 of VAST write path: client issues a standard NFS RPC to a VAST I/O server

Then the VAST I/O server receives the write RPC and has to figure out to which NVMeoF device(s) the data should be written. This is done by first determining on which NVMe device the appropriate file's metadata is located. This metadata is stored in B-tree like data structures with a very wide fan-out ratio and whose roots are mapped to physical devices algorithmically. Once an I/O server knows which B-tree to begin traversing to find a specific file's metadata algorithmically, it begins traversing that tree to find the file, and then find the location of that file's extents. The majority of these metadata trees live in 3D XPoint, but very large file systems may have their outermost levels stored in NAND.

A key aspect of VAST's architecture is that writes always land on 3D XPoint first; this narrows down the possible NVMeoF targets to those which are storage-class memory devices.

Pictorially, this second step may look something like this:

Step 2 of VAST write path: I/O server forwards write to 3D XPoint devices. Data is actually triplicated at this point for reasons that will be explained later.

VAST uses 3D XPoint for two distinct roles:

Temporarily store all incoming writes
Store the metadata structures used to describe files and where the data for files reside across all of the NVMe devices

VAST divides 3D XPoint used for case #1 into buckets. Buckets are used to group data based on how long that data is expected to persist before being erased; incoming writes that will be written once and never erased go into one bucket, while incoming writes that may be overwritten (erased) in a very short time will go into another. VAST is able to make educated guesses about this because it knows many user-facing features of the file (its parent directory, extension, owner, group, etc) to which incoming writes are being written, and it tracks file volatility over time.

Data remains in a 3D XPoint bucket until that bucket is full. The bucket is full when its size can be written to the NAND SSDs such that entire SSD erase blocks (which VAST claims can be on the order of a gigabyte in size) can be written down to NAND at once. Since JBOFs are dumb, this actually results in I/O servers reading back a full bucket out of 3D XPoint:

Step 3 of VAST write path: Once sufficient writes have been received to fill a bucket and create a full stripe, the I/O server must read it from 3D XPoint. Note that this diagram may be misleading; it is unclear if a single bucket resides on a single 3D XPoint device, or if a bucket is somehow sharded. My guess is the former (as shown).

The I/O server then bounces that bucket back out to NAND devices:

Step 4 of VAST write path: Once a full stripe has been formed in 3D XPoint and the I/O node has read it into DRAM, it actually writes that stripe down across many NAND devices. Again, this diagram is probably inaccurate as a result of my own lack of understanding; the relationship between a bucket (which maps to a single SSD's erase block) and a stripe (which must touch N+M SSDs) is unclear to me.

By writing out an entire erase block at once, VAST avoids the need for the SSD to garbage collect and amplify writes, since erase blocks are never only partially written. Erase blocks are also presumably rarely (or never?) only partially erased either; this is a result of

the combined volatility-based bucketing of data (similarly volatile data tends to reside in the same erase block), and
VAST's redirect-on-write nature (data is never overwritten; updated data is simply written elsewhere and the file's metadata is updated to point to the new data).

Because VAST relies on cheap consumer NAND SSDs, the data is not safe in the event of a power loss even after the NAND SSD claims the data is persisted. As a result, VAST then forces each NAND SSD to flush its internal caches to physical NAND. Once this flush command returns, the SSDs have guaranteed that the data is power fail-safe. VAST then deletes the bucket contents from 3D XPoint:

Step 5 of the VAST write path: Once data is truly persisted and safe in the event of power loss, VAST purges the original copy of that bucket that resides on the 3D XPoint.

The metadata structures for all affected files are updated to point at the version of the data that now resides on NAND SSDs, and the bucket is free to be filled by the next generation of incoming writes.

Data Protection

These large buckets also allow VAST to use extremely wide striping for data protection. As writes come in and fill buckets, large stripes are also being built with a minimum of 40+4 parity protection. Unlike in a traditional RAID system where stripes are built in memory, VAST's use of nonvolatile memory (3D XPoint) to store partially full buckets allow very wide stripes to be built over larger windows of time without exposing the data to loss in the event of a power failure. Partial stripe writes never happen because, by definition, a stripe is only written down to flash once it is full.

Bucket sizes (and by extension, stripe sizes) are variable and dynamic. VAST will opportunistically write down a stripe as erase blocks become available. As the number of NVMe devices in the VAST system increases (e.g., more JBOFs are installed), stripes can become wider. This is advantageous when one considers the erasure coding scheme that VAST employs; rather than use a Reed-Solomon code, they have developed their own parity algorithm that allows blocks to be rebuilt from only a subset of the stripe. An example stated by VAST is that a 150+4 stripe only requires 25% of the remaining data to be read to rebuild. I don't fully understand how this works.

To summarize, parity-protected stripes are slowly built in storage-class memory over time from bits of data that are expected to be erased at roughly the same time. Once a stripe is fully built in 3D XPoint, it is written down to the NAND devices. As a reminder, I/O servers are responsible for moderating all of this data movement and parity generation; the JBOFs are dumb and simply offer up the 3D XPoint targets.

To protect data as stripes are being built, the contents of the 3D XPoint layer are simply triplicated. This is to say that every partially built stripe's contents appear on three different 3D XPoint devices.

Performance Expectations

This likely has a profound effect on the write performance of VAST; if a single 1 MB write is issued by an NFS client, the I/O server must write 3 MB of data to three different 3D XPoint devices. While this would not affect latency by virtue of the fact that the I/O server can issue NVMeoF writes to multiple JBOFs concurrently, this means that the NICs facing the backend InfiniBand fabric must be able to inject data three times as fast as data arriving from the front-end, client-facing network. Alternatively, VAST is likely to carry an intrinsic 3x performance penalty to writes versus reads.

There are several factors that will alter this in practice:

Both 3D XPoint SSDs and NAND SSDs have higher read bandwidth than write bandwidth as a result of the power consumption associated with writes. This will further increase the 3:1 read:write performance penalty.
VAST always writes to 3D XPoint but may often read from NAND. This closes the gap in theory, since 3D XPoint is significantly faster at both reads and writes than NAND is at reads in most cases. However the current 3D XPoint products on the market are PCIe-attached and limited to PCIe Gen3 speeds, so there is not a significant bandwidth advantage to 3D XPoint writes vs. NAND reads.

It is also important to point out that VAST has yet to publicly disclose any performance numbers. However, using replication to protect writes is perhaps the only viable strategy to deliver extremely high IOPS without sacrificing data protection. WekaIO, which also aims to deliver extremely high IOPS, showed a similar 3:1 read:write performance skew in their IO-500 submission in November. While WekaIO uses a very different approach to achieving low latency at scale, their benchmark numbers indicate that scalable file systems that optimize for IOPS are likely to sacrifice write throughput to achieve this. VAST's architecture and choice to replicate writes is in line with this expectation, but until VAST publishes performance numbers, this is purely speculative. I would like to be proven wrong.

Other Bells and Whistles

The notes presented above are only a small part of the full VAST architecture, and since I am no expert on VAST, I'm sure there's even more that I don't realize I don't know or fully understand. That said, I'll highlight a few examples of which I am tenuously aware:

Because every I/O server sees every NVMe device, it can perform global compression. Typical compression algorithms are designed only to compress adjacent data within a fixed block size, which means similar but physically disparate blocks cannot be reduced. VAST tracks a similarity value for extents in its internal metadata and will group these similar extents before compressing them. I envision this to work something like a Burrows-Wheeler transformation (it is definitely not one though) and conceptually combines the best features of compression and deduplication. I have to assume this compression happens somewhere in the write path (perhaps as stripes are written to NAND), but I don't understand this in any detail.

The exact compression algorithm is one of VAST's own design, and it is not block-based as a result of VAST not having a fixed block size. This means that decompression is also quite different from block-based compression; according to VAST, their algorithm can decompress only a local subset of data such that reads do not require similar global decompression. The net result is that read performance of compressed data is not significantly compromised. VAST has a very compelling example where they compressed data that was already compressed and saw a significant additional capacity savings as a result of the global nature of their algorithm. While I normally discount claims of high compression ratios since they never hold up for scientific data, the conceptual underpinnings of VAST's approach to compression sounds promising.

VAST is also very closely tied to byte-addressable nonvolatile storage from top to bottom, and much of this is a result of their B-tree-based file system metadata structure. They refer to their underlying storage substrate as an "element store" (which I imagine to be similar to a key-value store), and it sounds like it is designed to store a substantial amount of metadata per file. In addition to standard POSIX metadata and the pointers to data extents on various NVMe devices, VAST also stores user metadata (in support of their S3 interface) and internal metadata (such as heuristics about file volatility, versioning for continuous snapshots, etc). This element store API is not exposed to customers, but it sounds like it is sufficiently extensible to support a variety of other access APIs beyond POSIX and S3.

Take-away Messages

VAST is an interesting new all-flash storage system that resulted from taking a green-field approach to storage architecture. It uses a number of new technologies (storage-class memory/3D XPoint, NAND, NVMe over fabrics) in intellectually satisfying ways, and builds on them using a host of byte-granular algorithms. It looks like it is optimized for both cost (in its intelligent optimization of flash endurance) and latency (landing I/Os on 3D XPoint and using triplication) which have been traditionally difficult to optimize together.

Its design does rely on an extremely robust backend RDMA fabric, and the way in which every I/O server must mount every storage device sounds like a path to scalability problems--both in terms of software support in the Linux NVMeoF stack and fundamental sensitivities to topology inherent in large, high-diameter RDMA fabrics. The global all-to-all communication patterns and choice to triplicate writes make the back-end network critically important to the overall performance of this architecture.

That said, the all-to-all ("shared everything") design of VAST brings a few distinct advantages as well. As the system is scaled to include more JBOFs, the global compression scales as well and can recover an increasing amount of capacity. Similarly, data durability increases as stripes can be made wider and be placed across different failure domains. In this sense, the efficiency of the system increases as it gets larger due to the global awareness of data. VAST's choice to make the I/O servers stateless and independent also adds the benefit of being able to scale the front-end capability of the system independently of the back-end capacity. Provided the practical and performance challenges of scaling out described in the previous paragraph do not manifest in reality, this bigger-is-better design is an interesting contrast to the mass storage systems of today which, at best, do not degrade as they scale out. Unfortunately, VAST has not disclosed any performance or scaling numbers, so the proof will be in the pudding.

However, VAST has hinted that the costs are "one fifth to one eighth" of enterprise flash today; by their own estimates of today's cost of enterprise flash, this translates to a cost of between $0.075 and $0.12 per gigabyte of flash when deployed in a VAST system. This remains 3x-5x more expensive than spinning disk today, but the cost of flash is dropping far faster than the cost of hard drives, so the near-term future may truly make VAST cost-comparable to disk. As flash prices continue to plummet though, the VAST cost advantage may become less dramatic over datacenter flash, but their performance architecture will remain compelling when compared to a traditional disk-oriented networked file system.

As alluded above, VAST is not the first company to develop a file-based storage system designed specifically for flash, and they share many similar architectural design patterns with their competition. This is creating gravity around a few key concepts:

Both flash and RDMA fabrics handle kilobyte-sized transfers with grace, so the days of requiring megabyte-sized I/Os to achieve high bandwidth are nearing an end.
The desire to deliver high IOPS makes replication an essential part of the data path which will skew I/O bandwidth towards reads. This maps well for read-intensive workloads such as those generated by AI, but this does not bode as well for write-intensive workloads of traditional modeling and simulation.
Reserving CPU resources exclusively for driving I/O is emerging as a requirement to get low-latency and predictable I/O performance with kilobyte-sized transfers. Although not discussed above, VAST uses containerized I/O servers to isolate performance-critical logic from other noise on the physical host. This pattern maps well to the notion that in exascale, there will be an abundance of computing power relative to the memory bandwidth required to feed computations.
File-based I/O is not entirely at odds with very low-latency access, but this file-based access is simple one of many interfaces exposed atop a more flexible key-value type of data structure. As such, as new I/O interfaces emerge to serve the needs of extremely latency-sensitive workloads, these flexible new all-flash storage systems can simply expose their underlying performance through other non-POSIX APIs.

Finally, if you've gotten this far, it is important to underscore that I am in no way speaking authoritatively about anything above. If you are really interested in VAST or related technologies, don't take it from me; talk to the people and companies developing them directly.

↧

ISC'19 Recap

June 26, 2019, 5:31 pm

≫ Next: SC'19 Recap

≪ Previous: VAST Data's storage system architecture

I was fortunate enough to attend the ISC HPC conference this year, and it was a delightful experience from which I learned quite a lot. For the benefit of anyone interested in what they have missed, I took the opportunity on the eleven-hour flight from Frankfurt to compile my notes and thoughts over the week.

I spent most of my time in and around the sessions, BOFs, and expo focusing on topics related to I/O and storage architecture, so that comprises the bulk of what I’ll talk about below. Rather than detail the conference chronologically as I did for SC’18 though, I’ll only mention a few cross-cutting observations and trends here.

I’ll also not detail the magnificent HPC I/O in the Data Center workshop here, but anyone reading this who cares about storage or I/O should definitely flip through the slides on the HPC-IODC workshop website! This year HPC-IODC and WOPSSS merged their programs, resulting in a healthy mix of papers (in both CS research and applied research), expert talks, and fruitful discussion.

High-level observations

As is often the case for ISC, there were a few big unveilings early in the week. Perhaps the largest was the disclosure of several key architectural details surrounding the Aurora exascale system to be deployed at Argonne in 2021. TACC’s Frontera system, a gigantic Dell cluster stuffed with Intel Cascade Lake Xeons, made its debut on the Top500 list as well. In this sense, Intel was in good form this year. And Intel has to be, since only one of the handful of publicly disclosed pre-exascale (Perlmutter and Fugaku) and exascale systems (Frontier) will be using Intel parts.

The conference had also had an anticipatory undertone as these pre-exascale and exascale systems begin coming into focus. The promise of ARM as a viable HPC processor technology is becoming increasingly credible as Sandia’s Astra machine, an all-ARM cluster integrated by HPE, appeared throughout the ISC program. These results are paving the way for Fugaku (the “post-K” machine), which will prove ARM and its SVE instruction set at extreme scale.

Also contributing to the anticipatory undertone was a lot of whispering that occurred outside of the formal program. The recently announced acquisition of Cray by HPE was the subject of a lot of discussion and conjecture, but it was clear that the dust was far from settled and nobody purported to have a clear understanding of how this would change the HPC market. There was also some whispering about a new monster Chinese system that was on the cusp of making this year’s ISC Top500. Curiously, the Wuxi supercomputer center (where Tianhe-2 is housed) had a booth on the show floor, but it was completely vacant.

Also noticeably absent from the show floor was NVIDIA, although they certainly sent engineers to participate in the program. By comparison, AMD was definitely present, although they were largely promoting the impending launch of Rome rather than their GPU lineup. A number of HPC solutions providers were excited about Rome because of both high customer demand and promising early performance results, and there wasn’t a single storage integrator with whom I spoke that wasn’t interested in what doors will open with an x86 processor and a PCIe Gen4 host interface.

Intel disclosures about Aurora 2021

Perhaps the biggest news of the week was a “special event” presentation given by Intel’s Rajeeb Hazra which disclosed a number of significant architectural details around the Aurora exascale system being deployed at Argonne National Laboratory in 2021.

Nodes will be comprised of Intel Xeon CPUs and multiple Intel GPUs

Intel has confirmed that Aurora will be built on Intel-designed general-purpose GPUs based on the “Xe” architecture with multiple GPUs per node. With this disclosure and the knowledge that nodes will be connected with Cray’s Slingshot interconnect, it is now possible to envision what a node might look like. Furthermore, combining the disclosure of a high GPU:CPU ratio, the Aurora power budget, and some vague guessing at the throughput of a 2021 GPU narrows down the number of nodes that we may expect to see in Aurora.

Although no specific features of the Intel GPUs were disclosed, Intel was also promoting their new AVX512-VNNI instructions to position their latest top-bin Xeon cores as the best option for inference workloads. Coupled with what we can assume will be highly capable GPUs for training acceleration, Intel is building a compelling story around their end-to-end AI portfolio. Interestingly, news that NVIDIA is partnering with ARM dropped this past week, but NVIDIA’s noted absence from ISC prevented a comparable ARM-NVIDIA AI solution from shining through.

System will have over 10 PB of system memory

Aurora will have a significant amount of memory presumably comprised of a combination of HBM, DDR, and/or Optane persistent memory. The memory capacity is markedly higher than that of the AMD-based Frontier system, suggesting that Intel may be leveraging Optane persistent memory (which has a lower cost per bit than DDR) to supplement the HBM that is required to feed such a GPU-heavy architecture.

The storage subsystem will deliver over 230 PB of capacity at over 25 TB/sec

Perhaps the most interesting part of Aurora is its I/O subsystem, which will use an object store and an all-solid-state storage architecture instead of the traditional parallel file system. This will amount to 230 PB of usable flash capacity that can operate in excess of 25 TB/sec. Although I’ll describe this storage architecture in more depth below, combining the performance point of 25 TB/sec with the aforementioned high GPU:CPU ratio suggests that each compute node will be able to inject a considerable amount of I/O traffic into the fabric. This points to very capable Xeon cores and very capable NICs.

The programming model for the system will utilize SYCL

Intel has announced that its “One API’ relies on the Khronos Group’s SYCL standard for heterogeneous programming in C++ rather than the incumbent choices of OpenMP, OpenACC, or OpenCL. This does not mean that OpenMP, OpenACC, and/or OpenCL won’t be supported, but it does reveal where Intel intends to put all of its efforts in enabling its own GPUs and FPGAs for HPC. They further emphasized their desire to keep these efforts open, standards-based, and portable, undoubtedly demonstrating stark contrast with the incumbent GPU vendors. This is an interesting long-term differentiator, but time will tell whether SYCL is able to succeed where OpenCL has failed and gain a foothold in the HPC ecosystem.

DAOS will be HPC's gateway drug to object stores

DAOS (the “Distributed Asynchronous Object Store,” pronounced like it’s spelled) is an object store that Intel has been developing for the better part of a decade in collaboration with the US Department of Energy. The DAOS name has become overloaded in recent years as a result of it changing scope, focus, and chief architects, and the current version is quite different from the original DAOS that was prototyped as a part of the DOE Fast Forward program (e.g., only one of three original DAOS components, DAOS-M, survives). A few key features remain the same, though:

It remains an object store at its core, but various middleware layers will be provided to expose alternate access APIs and semantics
It is specifically designed to leverage Intel Optane persistent memory and NAND-based flash to deliver extremely high IOPS in addition to high streaming bandwidth
It relies on user-space I/O via Mercury and SPDK to enable its extreme I/O rates
Its storage architecture is still based on a hierarchy of servers, pools, containers, and objects

Object stores have historically not found success in HPC due to HPC apps’ general dependence on POSIX-based file access for I/O, but the Aurora DAOS architecture cleverly bridges this gap. I was lucky enough to run into Johann Lombardi, the DAOS chief architect, at the Intel booth, and he was kind enough to walk me through a lot of the details.

DAOS will provide seamless integration with a POSIX namespace by using Lustre’s new foreign layout feature which allows an entity in the Lustre namespace to be backed by something that is not managed by Lustre. In practice, a user will be able to navigate a traditional file namespace that looks like any old Lustre file system using the same old ls and cd commands. However, some of the files or directories in that namespace may be special DAOS objects, and navigating into a DAOS-based object transparently switches the data path from one that uses the traditional Lustre client stack to one that uses the DAOS client stack. In particular,

Navigating into a directory that is backed by a DAOS container will cause the local DAOS agent to mount that DAOS container as a POSIX namespace using FUSE and junction it into the Lustre namespace. Files and subdirectories contained therein will behave as regular POSIX files and subdirectories for the most part, but they will only honor a subset of the POSIX consistency semantics.
Accessing a file that is backed by a DAOS container (such as an HDF5 file) will cause the client to access the contents of that object through whatever API and semantics the DAOS adapter for that container format provides.

DAOS also includes a preloadable library which allows performance-sensitive applications to bypass the FUSE client entirely and map POSIX API calls to DAOS native API calls. For applications that use middleware such as HDF5 or MPI-IO, I/O will be able to entirely bypass the POSIX emulation layer and get the highest performance through DAOS-optimized backends. In the most extreme cases, applications can also write directly against the DAOS native object API to control I/O with the finest granularity, or use one of DAOS's addon APIs that encapsulate other non-file access methods such as key-value or array operations.

A significant amount of this functionality is already implemented, and Intel was showing DAOS performance demos at its booth that used both IOR (using the DAOS-native backend) and Apache Spark:

The test hardware was a single DAOS server with Intel Optane DIMMs and two Intel QLC NAND SSDs and demonstrated over 3 GB/sec on writes and over a million read IOPS on tiny (256-byte) transfers. Johann indicated that their testbed hardware is being scaled up dramatically to match their extremely aggressive development schedule, and I fully expect to see performance scaling results at SC this November.

This is all a far cry from the original Fast Forward DAOS, and this demo and discussion on the show floor was the first time I felt confident that DAOS was not only a good idea, but it was a solution that can realistically move HPC beyond the parallel file system. Its POSIX compatibility features and Lustre namespace integration provide enough familiarity and interoperability to make it something usable for the advanced HPC users who will be using the first exascale machines.

At the same time, it applies a number of new technologies in satisfying ways (Mercury for user-space network transport, GIGA+ for subtree sharding, Optane to coalesce tiny I/Os, ...) that, in most ways, puts it at technological parity with other high-performance all-flash parallel storage systems like WekaIO and VAST. It is also resourced at similar levels, with DOE and Intel investing money and people in DAOS at levels comparable to the venture capital that has funded the aforementioned competitors. Unlike its competitors though, it is completely open-source and relies on standard interfaces into hardware (libfabric, SPDK) which gives it significant flexibility in deployment.

As with everything exascale, only time will tell how DAOS works in practice. There are plenty of considerations peripheral to performance (data management policies, system administration, and the like) that will also factor into the overall viability of DAOS as a production, high-performance storage system. But so far DAOS seems to have made incredible progress in the last few years, and it is positioned to shake up the HPC I/O discussion come 2021.

The Cloud is coming for us

This ISC also marked the first time where I felt that the major cloud providers were converging on a complete HPC solution that could begin eroding campus-level and mid-range HPC. Although application performance in the cloud has historically been the focus of most HPC-vs-cloud debate, compute performance is largely a solved problem in the general sense. Rather, data—its accessibility, performance, and manageability—has been the single largest barrier between most mid-range HPC users and the cloud. The convenience of a high-capacity and persistent shared namespace is a requirement in all HPC environment, but there have historically been no painless ways to produce this environment in the cloud.

AWS was the first to the table with a solution in Amazon FSx, which is a managed Lustre-as-a-service that makes it much easier to orchestrate an HPC workflow that relies on a high-performance, high-capacity, shared file system. This has prompted the other two cloud vendors to come up with competing solutions: Microsoft Azure’s partnership with Cray is resulting in a ClusterStor Lustre appliance in the cloud, and Google Cloud will be offering DDN's EXAScaler Lustre appliances as a service. And Whamcloud, the company behind Lustre, offers its own Lustre Cloud Edition on all three major cloud platforms.

In addition to the big three finally closing this gap, a startup called Kmesh burst on to the I/O scene at ISC this year and is offering a cloud-agnostic solution to providing higher-touch parallel file system integration and management in the cloud for HPC. Vinay Gaonkar, VP of Products at Kmesh, gave insightful presentations at several big I/O events during the week that spoke to the unique challenges of designing Lustre file systems in a cloud ecosystem. While architects of on-prem storage for HPC are used to optimizing for price-performance on the basis of purchasing assets, optimizing price-performance from ephemeral instance types often defies conventional wisdom; he showed that instance types that may be considered slow on a computational basis may deliver peak I/O performance at a lower cost than the beefiest instance available:

Vinay's slides are available online and offer a great set of performance data for high-performance storage in the public clouds.

The fact that there is now sufficient market opportunity to drive these issues to the forefront of I/O discussion at ISC is an indicator that the cloud is becoming increasingly attractive to users who need more than simple high-throughput computing resources.

Even with these sorts of parallel file systems-as-a-service offerings though, there are still non-trivial data management challenges when moving on-premise HPC workloads into the cloud that result from the impedance mismatch between scientific workflows and the ephemeral workloads for which cloud infrastructure is generally designed. At present, the cost of keeping active datasets on a persistent parallel file system in the cloud is prohibitive, so data must continually be staged between an ephemeral file-based working space and long-term object storage. This is approximately analogous to moving datasets to tape after each step of a workflow, which is unduly burdensome to the majority of mid-scale HPC users.

However, such staging and data management issues are no longer unique to the cloud; as I will discuss in the next section, executing workflows across multiple storage tiers is no longer a problem unique to the biggest HPC centers. The solutions that address the burdens of data orchestration for on-premise HPC are likely to also ease the burden of moving modest-scale HPC workflows entirely into the cloud.

Tiering is no longer only a problem of the rich and famous

Intel started shipping Optane persistent memory DIMMs earlier this year, and the rubber is now hitting the road as far as figuring out what I/O problems it can solve at the extreme cutting edge of HPC. At the other end of the spectrum, flash prices have now reached a point where meat-and-potatoes HPC can afford to buy it in quantities that can be aggregated into a useful tier. These two factors resulted in a number of practical discussions about how tiering can be delivered to the masses in a way that balances performance with practicality.

The SAGE2 project featured prominently at the high-end of this discussion. Sai Narasimhamurthy from Seagate presented the Mero software stack, which is the Seagate object store that is being developed to leverage persistent memory along with other storage media. At a distance, its goals are similar to those of the original DAOS in that it provides an integrated system that manages data down to a disk tier. Unlike the DAOS of today though, it takes on the much more ambitious goal of providing a PGAS-style memory access model into persistent storage.

On the other end of the spectrum, a number of new Lustre features are rapidly coalescing into the foundation for a capable, tiered storage system. At the Lustre/EOFS BOF, erasure coded files were shown on the roadmap for the Lustre 2.14 release in 2Q2020. While the performance of erasure coding probably makes it prohibitive as the default option for new files on a Lustre file system, erasure coding in conjunction with Lustre’s file-level replication will allow a Lustre file system to store, for example, hot data in an all-flash pool that uses striped mirrors to enable high IOPS and then tier down cooler data to a more cost-effective disk-based pool of erasure-coded files.

In a similar vein, Andreas Dilger also discussed future prospects for Lustre at the HPC I/O in the Data Center workshop and showed a long-term vision for Lustre that is able to interact with both tiers within a data center and tiers across data centers:

Many of these features already exist and serve as robust building blocks from which a powerful tiering engine could be crafted.

Finally, tiering took center stage at the Virtual Institute for I/O and IO-500 BOF at ISC with the Data Accelerator at Cambridge beating out OLCF Summit as the new #1 system. A key aspect of Data Accelerator’s top score arose from the fact that it is an ephemeral burst buffer system; like Cray DataWarp, it dynamically provisions parallel file systems for short-term use. As a result of this ephemeral nature, it could be provisioned with no parity protection and deliver a staggering amount of IOPS.

Impressions of the industry

As I’ve described before, I often learn the most by speaking one-on-one with engineers on the expo floor. I had a few substantive discussions and caught on to a few interesting trends.

No winners in EDSFF vs. NF.1

It’s been over a year since Samsung’s NF.1 (formerly M.3 and NGSFF) and Intel’s EDSFF (ruler) SSD form factor SSDs, and most integrators and third-party SSD manufacturers remain completely uncommitted to building hardware around one or the other. Both form factors have their pros and cons, but the stalemate persists by all accounts so far. Whatever happens to break this tie, it is unlikely that it will involve the HPC market, and it seems like U.2 and M.2 remain the safest bet for the future.

Memory Landscape and Competition

The HBM standard has put HMC (hybrid memory cube) in the ground, and I learned that Micron is committed to manufacturing HBM starting at the 2e generation. Given that SK Hynix is also now manufacturing HBM, Samsung may start to face competition in the HBM market as production ramps up. Ideally this brings down the cost of HBM components in the coming years, but the ramp seems to be slow, and Samsung continues to dominate the market.

Perhaps more interestingly, 3DXPoint may be diversifying soon. Although the split between Intel and Micron has been well publicized, I failed to realize that Intel will also have to start manufacturing 3DXPoint in its own fabs rather than the shared facility in Utah. Micron has also announced its commitment to the NVDIMM-P standard which could feasibly blow open the doors on persistent memory and non-Intel processor vendors to support it. However, Micron has not committed to an explicit combination of 3DXPoint and NVDIMM-P.

Realistically, the proliferation of persistent memory based on 3DXPoint may be very slow. I hadn’t realized it, but not all Cascade Lake Xeons can even support Optane DIMMs; there are separate SKUs with the requisite memory controller, suggesting that persistent memory won’t be ubiquitous, even across the Intel portfolio, until the next generation of Xeon at minimum. Relatedly, none of the other promising persistent memory technology companies (Crossbar, Everspin, Nantero) had a presence at ISC.

China

The US tariffs on Chinese goods are on a lot of manufacturers’ minds. Multiple vendors remarked that they are either

thinking about moving more manufacturing from China into Taiwan or North America,
already migrating manufacturing out of China into Taiwan or North America,
under pressure to make shorter-term changes to their supply chains (such as stockpiling in the US) in anticipation of deteriorating conditions

I was not expecting to have this conversation with as many big companies as I did, but it was hard to avoid.

Beyond worrying about the country of origin for their components, though, none of the vendors with whom I spoke were very concerned about competition from the burgeoning Chinese HPC industry. Several commented that even though some of the major Chinese integrators have very solid packaging, they are not well positioned as solutions providers. At the same time, customers are now requiring longer presales engagements due to the wide variety of new technologies on the market. As a result, North American companies playing in the HPC vertical are finding themselves transitioning into higher-touch sales, complex custom engineering, and long-term customer partnerships.

Concluding thoughts

This year's ISC was largely one of anticipation of things to come rather than demonstrations that the future has arrived. Exascale (and the pre-exascale road leading to it) dominated most of the discussion during the week. Much of the biggest hype surrounding exascale has settled down, and gone are the days of pundits claiming that the sky will fall when exascale arrives due to constant failures, impossible programming models, and impossible technologies. Instead, exascale is beginning to look very achievable and not unduly burdensome: we know how to program GPUs and manycore CPUs already, and POSIX file-based access will remain available for everyone. Instead, the challenges are similar to what they've always been--continuing to push the limits of scalability in every part of the HPC stack.

I owe my sincerest thanks to the organizers of ISC, its sessions, and the HPC-IODC workshop for putting together the programs that spurred all of the interesting discourse over the week. I also appreciate the technical staff at many of the vendor booths with whom I spoke. I didn't name every person with whom I drew insights on the expo floor, but if you recognize a comment that you made to me in this post and want credit, please do let me know--I'd be more than happy to. I also apologize to all the people with whom I spoke and sessions I attended but did not include here; not everything I learned last week fit here.

↧

SC'19 Recap

November 27, 2019, 1:59 am

≫ Next: Understanding random read performance along the RAIDZ data path

≪ Previous: ISC'19 Recap

Last week was the annual Supercomputing conference, held this year in Denver, and it was its usual whirlwind of big product announcements, research presentations, vendor meetings, and catching up with old colleagues. As is the case every year, SC was both too short and too long; there is a long list of colleagues and vendors with whom I did not get a chance to meet, yet at the same time I left Denver on Friday feeling like I had been put through a meat grinder.

All in all it was a great conference, but it felt like it had the same anticipatory undertone I felt at ISC 2019. There were no major changes to the Top 500 list (strangely, that mysterious 300+ PF Sugon machine that was supposed to debut at ISC did not make an appearance in Denver). AMD Rome and memory-channel Optane are beginning to ship, but it seems like everyone's got their nose to the grindstone in pursuit of achieving capable exascale by 2021.

As with every major HPC conference, I approached SC this year with the following broad objectives:

Sharing knowledge and ideas by contributing to the technical program and its workshops, tutorials, and BOFs with the goal of getting more momentum behind good ideas and steering research and roadmaps in a direction best aligned with where I think the HPC industry needs to go
Gathering intelligence across different technologies and market verticals to stay ahead of where technology and the community may be driving as a result of other parallel industries
Contributing to community development amongst storage and I/O researchers and practitioners with the goal of broadening the community and bringing more people and ideas to the table
Building and maintaining relationships with individual vendor representatives and peers so that I know to whom I can turn when new opportunities or challenges come up

The things I took away from the conference are colored by these goals and the fact that I mostly work in high-performance storage systems design. If I missed any major themes or topics in this recap post, it was likely a reflection of the above goals and perspective.

Before the conference

SC'19 started back in the early spring for me since I served on the technical papers committee and co-chaired the Parallel Data Systems Workshop this year. That all amounted to a predictable amount of work throughout the year, but there were two surprises that came up in October with respect to SC that are worth mentioning before we dive into the technical contents of the conference.

The "I am HPC Guru" campaign

Jim Cownie had the brilliant idea in early October to launch a covert campaign to create "I am HPC Guru" pins for SC, and he enlisted a group of willing members of the HPC Twitter community to pitch in. I was fortunate enough to be invited to participate in the fun, and judging by the reach of the #IAmHPCGuru tag on Twitter during the conference, it was a wild success.

An allotment of "I am HPC Guru" pins. People who pitched in also got a commemorative larger-sized pin (shown outside the bag above) which was a calling card for members of the secret society.

Hats off to Jim for conceiving this great idea, seeing through the design and shipment of the pins, and being so inclusive with the whole idea. There are now hundreds of HPC_Guru pins all over the world thanks to Jim's efforts (and a couple dozen still with me here in California...), and I think it was a really positive way to build the Twitter-HPC community.

The new job

Life also threw me a bit of a curve ball in late October when I took on a new set of responsibilities at NERSC and changed from contributing to an R&D group to leading an operational storage team. This meant that, in addition to all the pre-conference commitments I had made with an eye towards longer-term storage technology strategy, I suddenly had to contextualize my goals with respect to a completely new role in tactical planning and deployment.

Whereas I’ve historically written off sales-oriented meetings at SC, having good relationships with vendor sales teams in addition to their engineers and product managers is now an essential component of my new position. As a result of wearing these two hats instead of one, the number of hard commitments I had over the course of the conference about doubled over what it usually had been. About half of these meetings were private (and not things about which I could write), and they also reduced the time I could've otherwise getting into the weeds about upcoming technologies.

Because the conference was so broken up into private and public meetings for me this year, a chronological recounting of the conference (as I did for my SC'18 recap) would be full of odd gaps and not make a whole lot of sense. Instead, I will focus around a few of the juiciest topics I took away from the conference:

High-level trends

It's difficult to group together all of the disparate things I heard and learned over the week into crisp bundles that I would consider emerging trends, but there were a few broad topics that kept popping up that suggested the following:

#1 - Memory-channel 3D XPoint is now out in the wild at sufficient scale that a picture is beginning to form around where it fits in the I/O stack. The NEXTGenIO project and Intel DAOS both demonstrated the performance achievable when 3D XPoint is integrated into larger systems this year, and the acceleration it offers can be staggering when a sensible software framework is built upon around persistent memory to bridge it with other media (like flash) and higher-level functionality (like parallel storage). Michèle Weiland and Adrian Jackson presented their successes with the NEXTGenIO project throughout the week, most notably in the technical papers track (see "An early evaluation of Intel's Optane DC persistent memory module and its impact on high-performance scientific applications") and across several smaller events (e.g., Adrian presented performance results, detailed in his EPCC blog post, at the Multi-Level Memory BOF). DAOS also made a splash on IO-500; more on this below.

#2 - The I/O ecosystem developed in preparation for the manycore era is making the transition from pure research to practical engineering effort. As the first generation of 7nm CPUs hit the market with KNL-like core counts and massive scale-up GPU node architectures are being announced by every major HPC silicon provider, latency-hiding techniques for I/O are becoming a hot topic. Asynchronous I/O—that is, techniques that allow an application to continue computing while a write I/O operation is still happening—came up a few times, and this technique is also moving up in the software stack from system software (such as DAOS, WekaIO, and VAST) into middleware (MPI-IO and HDF5). I touch on this in the PDSW section below.

#3 - Innovation in HPC storage is moving away from the data plane and towards full data life cycle. Whereas focus in HPC I/O has traditionally revolved around making I/O systems as fast as possible, research and product announcements this year seemed to gravitate towards data management—that is, how to manage the placement of data before, during, and after I/O. Proprietary frameworks for data migration, policy management, tiering, and system-level analytics and intelligence (backed by serious vendor investment; see Cray ClusterStor Data Services and DDN STRATAGEM) are popping up across the storage appliance market as a differentiator atop open-source software like Lustre, and research around applying AI to optimize data placement is maturing from novel research into product engineering.

#4 - Scientific workflows—and the parallels they have with enterprise and hyperscale markets—are starting to be taken seriously by technology providers. Vendors have begun to take ownership of the data movement challenges that exist between bursts of compute-intensive jobs. Advances aimed at edge computing are becoming surprisingly relevant to HPC since decentralized data that is far away from compute is, in a sense, how HPC has done storage for decades. Whether they be sensors distributed across billions of cell phones, thousands of non-volatile storage media distributed across an exascale computing system, or detectors deployed at giant telescopes relying on a supercomputer for image processing, there are a common set of data management, movement, and remote processing challenges whose solutions can be applied across the board.

Intel's big splash

Following on their big system-level disclosures at ISC'19, Intel's disclosure of the ALCF exascale system node architecture and the unveiling of their software strategy seemed to be the biggest splash of SC'19. I was not actually at the Intel DevCon keynote where Raja Koduri made the announcements, but his slides on Xe and oneAPI are available online.

The node architecture is, at a glance, very similar to the Summit node architecture today:

Aurora #supercomputer @argonne will have nodes with 2 Sapphire Rapids CPUs and 6 Ponte Vecchio GPUs with unified memory architecture#SC19 #HPC #AI #Exascale #GPU pic.twitter.com/HTGMnYh7AY
— HPC Guru (@HPC_Guru) November 18, 2019

From the slide and accompanying discussion on Twitter, there was quite a lot unveiled about the node architecture. Each node will have:

Two Sapphire Rapids Xeons (which appear to have 8 channels of DDR in the aforementioned slide) and six Ponte Vecchio Intel GPUs
A CXL-based "Xe Link" router provides all-to-all connectivity between the GPUs, presumably comparable to (but more standards-based than) NVLink/NVSwitch, for a unified memory space
Eight Slingshot NIC ports per node, which is 1.6 Tbit/sec of injection bandwidth
A "Rambo Cache" that sits between HBM, GPU, and CPU that presumably reduces NUMA effects for hot data that is being touched by many computing elements
A "matrix engine" (which sounds an awful lot like NVIDIA's tensor cores) in each GPU

This was an extremely daring release of information, as Intel has now publicly committed to a 7nm GPU part (comparable to TSMC's 5nm process), along with a high-yield EMIB process (their chiplet interconnect for HBM integration) and Foveros (their 3D die stacking for Rambo integration), in 2021.

Intel also released the beta version of their Intel oneAPI which appears to be a mixture of re-branded Intel developer products (Fortran and C++ compilers, TBB, MKL, DAL, MPI, VTune, etc) with their new SYCL-based Data Parallel C++ compiler. The novelty here is that Intel is committing to supporting this entire stack for CPUs, GPUs, FPGAs, and matrix accelerators so that, for example, you could feasibly write a single application with a single set of tools that runs across all accelerator types.

There was a lot of interest in SYCL at the Performance Portability and Productivity workshop, P3HPC, on Friday. There were two talks of particular interest in the parts I attended; the first, presented by Balint Joo of Jefferson Lab, presented the performance of a quantum chromodynamics kernel when implemented using Kokkos, accelerator-specific libraries, and SYCL:

SYCL vs. Kokkos vs. native on NVIDIA and Intel architectures

These early results are promising, and with the exception of KNL, the SYCL ecosystem is already showing promise as a performance-portable framework. The same is generally true for more complex computational kernels as well, as presented by Istvan Reguly from Pázmány Péter Catholic University:

Performance portability figure of merit for a complex kernel using different performance-portable parallel runtimes.

Intel's choice to back an open standard rather than develop its own proprietary APIs for each accelerator type was a very smart decision, as it looks like they are already making up lost ground against NVIDIA in building a robust software ecosystem around their accelerator technologies. The fact that these presentations were given by application scientists, not Intel engineers, really underscores this.

Strangely, AMD kept a low profile at SC by comparison despite the fact that Rome is beginning to enter the market and, by all accounts I heard on the show floor, selling like gangbusters. One major procurement I heard about switched from an Intel CPU-based plan of record to AMD processor as a result of a schedule slip by Intel; this wound up resulting the system obtaining 50% more cores at the same cost (plus the added benefit of PCIe Gen4) which is a testament to the advantage that AMD currently has in the near term.

By comparison, very few large HPC centers seem to be biting on Intel's Cascade Lake-AP despite Intel's very aggressive marketing against Rome. Combined with the above observation that the Aurora architecture's Sapphire Rapids processors will only have eight memory channels per socket suggests that Cascade Lake-AP's 12-channel socket was likely released as a stopgap to have an answer to Rome while 10nm Xeon part production is scaling up.

PDSW 2019

This year I had the great honor of co-chairing the Parallel Data Systems Workshop, the premiere data and storage workshop at SC, along with the esteemed Phil Carns (creator of Darshan and PVFS2/OrangeFS, among other things). We tried to broaden the scope of the workshop to be more inclusive of "cloudy" storage and data topics, and we also explicitly tried to build the program to include discussion about data management that ran tangential to traditional HPC-focused storage and I/O.

The proceedings are already online in an interim location hosted by ACM, and the full proceedings will be published by IEEE TCHPC. Slides are available on the PDSW website, and I tried to tag my realtime thoughts using #pdsw19 on Twitter.

Alluxio Keynote

Our keynote speaker was Haoyuan Li, founder of Alluxio, who gave a brilliant talk about the data orchestration framework he developed at AMPLab and went on to commercialize. It is an abstraction that stitches together different storage resources (file systems, object stores, etc) into a single namespace that applications can use to read and write data in a way that hides the complexity of tiered storage. It was designed towards the beginning of the "Big Data revolution" with a specific eye towards providing a common interface for data accessibility; by writing an application against the Alluxio API, it would be made future-proof if the HDFS or S3 APIs fizzled since Alluxio normalizes the specific API and semantics of a native storage interface from user applications.

Had something like this existed in the early days of HPC, there's a good chance that we would not be stuck using POSIX I/O as the least common denominator for data access. That said, Alluxio does solve a slightly easier problem in that it targets analytics workloads that are read-intensive—for example, it does not provide a means for applications to do random writes, and so it provides only a subset of the full semantics that some more general-purpose I/O interfaces (such as file access) may provide. In making this trade-off though, it is able to aggressively cache data from any storage backend in a distributed memory space, and Alluxio has a configurable cache eviction policy for predictable workflows.

In describing the motivation for the Alluxio design, Haoyuan had some interesting insights. In particular, he pointed out that there is a growing movement away from the hyperconverged hardware architecture that motivated Hadoop and HDFS:

The whole "move compute to where the data is!" model for Hadoop has always struck me as rather fanciful in practice; it only works in single-tenant environments where there's no chance of someone else's compute already existing where your data is, and it imposes a strict coupling between how you scale data and analytics. As it turns out, the data analytics industry is also waking up to that, and as Haoyuan's slide above shows, separating storage from compute gives much more flexibility in how you scale compute with respect to data, but at the cost of increased complexity in data management. The whole point of Alluxio is to minimize that cost of complexity by making data look and feel local by (1) providing a single namespace and API, and (2) using distributed memory caching to make data access perform as well as if compute and memory were colocated.

This is a bit ironic since HPC has been disaggregating storage from compute for decades; HPC systems have tended to scale compute capability far faster than storage. However, the HPC community has yet to address the added complexity of doing this, and we are still struggling to simplify storage tiering for our users. This is only getting worse as some centers slide back into hyperconverged node designs by incorporating SSDs into each compute node. This causes different tiers to spread data across multiple namespaces and also further complicate data access since the semantics across those namespaces differ. For example, it's not sufficient to know that

/local is the fastest tier
/scratch is less fast
/home is slow

since

/local is only coherent with other processes sharing the same physical compute node
/scratch is globally coherent
/home is globally coherent

Alluxio is not the solution to this problem at present because it is optimized for write-once, read-many workloads whereas HPC does have to support random writes. That said, HPC storage systems that incorporate the same design goals as Alluxio (connecting many types of storage under a single namespace, providing a restricted set of semantics, and applying aggressive caching to deliver local-like performance) hold a lot of promise. Perhaps it's no surprise that every serious parallel file system on the market is beginning to implement features like this—think Lustre File-Level Redundancy (FLR) and Persistent Client Caching (LPCC), Spectrum Scale AFM, and the core two-tier design of WekaIO.

Haoyuan also presented a few case studies that showcased the ability of Alluxio to ease the transition from on-premise infrastructure (like Hadoop with HDFS) to hybrid cloud (e.g., run Presto across datasets both in older on-prem HDFS and newer S3 buckets). It seems to be very fashionable to run analytics directly against data in object stores in industry, and Alluxio essentially gives such data more dynamism by being the place where active data can be staged for processing on demand. Because it is a stateless orchestration layer rather than a storage system itself, Alluxio also seems nicely compatible with dynamic provisioning of compute resources. In this sense, it may be an interesting internship project to see if Alluxio could be deployed on an HPC system to bridge a large data analytics job with an off-system object store. Get in touch with me if you know a student who may want to try this!

Asynchronous I/O

Middleware for asynchronous I/O came up in two different papers this year. The first, "Enabling Transparent Asynchronous I/O using Background Threads" by Tang et al., described a new pluggable runtime for HDF5 that processes standard HDF5 I/O requests asynchronously. It does this by copying I/O requests and their metadata into a special buffer, putting those requests on a queue that is managed by the asynchronous runtime, building a directed graph of all requests' dependencies, and dispatching I/Os alongside regular application execution using a lightweight (Argobots-based) asynchronous worker pool.

What this amounts to is that a standard HDF5 write call wouldn't block until the I/O has been committed to disk somewhere; instead, it returns immediately after the async runtime makes a copy of the data to be written into its own private memory buffer. The application is then free to continue computing, while an Argobots thread begins buffering and dispatching outstanding asynchronous I/O calls. The performance that results from being able to overlap I/O with computation is remarkable:

I/O speedup at scale as a result of the asynchronous runtime backend for HDF5 presented by Tang et al.

What's more impressive, though, is that this backend is almost entirely transparent to the user application; in its simplest form, it can be enabled by setting a single environment variable.

Later in the day, Lucho Ionkov presented a much more novel (research-y?) asynchronous I/O runtime in his paper, "A Foundation for Automated Placement of Data" which glued together DRepl (an abstraction layer between scientific applications and storage architectures, vaguely similar to what Alluxio aims to do), TCASM (a Linux kernel modification that allows processes to share memory), and Hop (an expressive key-value store with tunable performance/resilience requirements). The resulting runtime provides a high-level interface for applications to express I/O and data placement as a series of attach, publish, and re-attach operations to logical regions of memory. The runtime then manages the actual data movement (whether it be between nodes or to persistent storage) asynchronously.

Again, the net result in speedup as the problem size scales up is impressive:

I/O speedup at scale using the asynchronous I/O runtime presented by Iokov in Otstott et al.

As with the asynchronous HDF5 paper, performance gets better with scale as the increasing costs of doing I/O at scale are amortized by overlapping it with computation. In contrast to HDF5 though, this runtime comes with a completely new application API, so one would need to convert an application's critical I/O routines to use this framework instead of POSIX I/O. The runtime is also pretty heavyweight in that it requires a separate global data placement "nameserver," a custom Linux kernel, and buy-in to the new memory model. In that sense, this is a much more research-oriented framework, but the ideas it validates may someday appear in the design of a fully framework that incorporates both an application runtime and a storage system.

Why is this important? These asynchronous I/O runtimes are making a lot more sense in the era of heterogeneous computing where accelerators (think GPUs) really aren't good at driving a full kernel-based I/O pipeline. Instead of running a full I/O stack and enforcing strict consistency (i.e., serializing I/O) on a lightweight accelerator core, having an asynchronous runtime running on a fat core that simply copies an I/O buffer from accelerator memory to slower memory before releasing program control back to the accelerator allows the accelerator tp spend less time doing what it's terrible at doing (ordering I/O operations) and more time computing. At the same time, the fat core that is running the asynchronous I/O runtime can then operate on that copied I/O buffer on its own time, reorder and serialize operations to ensure consistency, and jump into and out of the kernel to enforce file permissions without interrupting the accelerator:

Sketch of how an asynchronous I/O runtime might map to a heterogeneous node architecture

Ron Oldfield did raise a really great consideration during PDSW about this though: at the end of the day, the asynchronous I/O runtime still has to share network resources with the application's message passing runtime (e.g., MPI). He alluded to work done a decade ago that found that asynchronous I/O was often stomping on MPI traffic since both MPI and I/O could happen at the same time. Without some kind of awareness or coordination between the asynchronous I/O runtime and the application communication runtime, this sort of scheme is prone to self-interference when running a real application.

Given this, the right place to integrate an asynchronous I/O runtime might be inside the message passing runtime itself (e.g., MPI-IO). This way the asynchronous I/O scheduler could consider outstanding asynchronous messages it must pass as well and be smart about dispatching too many competing network transfers at the same time. Unfortunately this then places a complex burden of serialization and synchronization on the runtime, and this starts to look a lot like just throwing messages at the NIC and letting it figure out the correct ordering. The principal advantage here would be that the runtime has a lot more visibility into user intent (and may have more spare processing capacity if most of the application time is spent on an accelerator), so it could afford to be smarter about how it builds its dependency graph.

Analytics for Runtime and Operations

No computing-related workshop would be complete without a smattering of artificial intelligence and machine learning, and PDSW was no different this year. Two papers were presented that attempted to use machine learning to predict parallel I/O performance in slightly different ways.

Suren Byna presented "Active Learning-based Automatic Tuning and Prediction of Parallel I/O Performance" where the authors developed an approach for autotuning parallel I/O (specifically using MPI-IO hints and Lustre striping parameters) using active learning to predict the optimal values for their tuning parameters. They used two different approaches, and the faster one uses predicted performance to infer optimal tuning values. Given how many factors actually come to play in parallel I/O performance on production systems, their model was able to predict I/O performance quite well under a range of I/O patterns:

Bing Xie et al presented "Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems" which pursued a similar line of work—using machine learning to predict I/O performance—but with a slightly different goal. Xie's goal was to identify the factors which most strongly affect predicted I/O performance, and she found that write performance was most adversely affected by metadata load and load imbalance on Blue Gene/Q and GPFS, whereas Cray XK7 and Lustre were more affected by aggregate file system load and load imbalance. This system-centric work laid out a more sophisticated blueprint for identifying causal relationships between poor I/O performance and system-level health events, and I think applying these approaches to the dataset I published last year with my Year in the Life of a Parallel File System paper might identify some interesting emergent relationships between bad performance and the subtle factors to which they can be attributed.

Why is this important? Industry is beginning to take notice that it is no longer sufficient to just report there here-and-now of how parallel file systems are behaving, and more sophisticated analytics engines are being co-deployed with very large systems. For example, the Summit system at Oak Ridge made a splash in October by announcing the real-time analytics engine that was implemented on top of it, and Cray View is a similar analytics-capable engine built atop Lustre that Cray offers as a part of its ClusterStor lineup. I'm not sure if DDN has something comparable, but their recent purchase of Tintri and its robust, enterprise-focused analytics engine means that they hold IP that can be undoubtedly be applied to its HPC-focused storage product portfolio.

Being able to predict performance (and the conditions that cause it to degrade!) is the holy grail of parallel I/O systems management, and it's a sure bet that all the HPC storage vendors are watching research in this area very closely to see what ideas they can pluck from the community to add value to their proprietary analytics engines. The fact that AI is being applied to production system data and yielding useful and actionable outcomes gives legs to this general idea of AI for self-driving systems. The talks at PDSW this year were only demonstrations, not hardened products, but these ad-hoc or small-scale demonstrations are moving us in the right direction.

My Talk on Data Motion

I also coauthored and presented a paper at PDSW this year that was an exploratory study of how we can understand data movement throughout an entire data center. The goal of the entire paper, "Understanding Data Motion in the Modern HPC Data Center," was to generate this diagram that shows how much data flows between different systems at NERSC:

I won't recount the technical content of the talk here, but the paper is open access for those interested. The essence of the study is that we showed that it is possible to examine data motion beyond the context of individual jobs and begin tying together entire workflows, but there's a lot of supporting work required to shore up the tools and telemetry from which this analysis draws. The paper was very much a long-form work in progress, and I'd be interested in hearing from anyone who is interested in pursuing this work further.

Scale-up highly available NVMe hardware

Although it didn't make a many headlines (as storage rarely does), Cray announced its new ClusterStor E1000 platform shortly before SC and had some of their E1000-F all NVMe enclosures on display at a few booths. I normally don't care too much about storage enclosures (it's all just sheet metal, right?), but this announcement was special to me because it is the hardware platform that is going into NERSC's Perlmutter system in 2020, and I've been involved with the different iterations of this hardware design for over a year now.

It's very gratifying to see something start out as a CAD drawing and a block diagram and grow up into actual hardware:

The E1000-F all-NVMe enclosure

Torben Kling Petersen gave a talk at the Exhibitor Forum disclosing the details of the hardware design on behalf of Cray, and it looks like they've made just about everything surrounding the E1000 public:

The foundation for this platform is the E1000-F high-availability enclosure as shown in the above slide. It has two separate Rome-based servers ("controllers") and 24 U.2 NVMe slots capable of PCIe Gen4. Each Rome controller has slots for up to three 200 Gbit NICs; doing the math, this gives a very nicely balanced design that is implemented entirely without PCIe switches:

Cartoon block diagram for one half of the E1000-F chassis. Note that the NVMe read rates (violet text) are assumed based on Samsung PM1733 specs and performance projections that Petersen presented. Also note that each NVMe drive is 2x2 PCIe Gen4 with multipath to the other Rome controller (not shown).

I visited the booth of the ODM with whom Cray worked to develop this node design and was fortunate enough to meet the node architects from both sides who gave me a really helpful breakdown of the design. Physically, the 2U chassis is laid out something like this:

Just about everything is both hot-swappable and fully redundant. The entire system can be powered and cooled off of a single 1.2 kW(?) power supply, and all the fans are hot-swappable and configured in a 5+1:

Fans are all individually replaceable and configured in 5+1. You can also see the NVMe backplanes, attached to an active midplane (not shown), through the open fan slot.

All the fans are on the same pulse-width modulator (PWM), so they all operate at the same speed and provide even airflow as long as they are properly powered. My recollection from what the architect told me is that the PWM signal is provided by an FPGA on the midplane which also handles drive power-up. Because there is only a single midplane and this power/cooling controller lives on it, this power/cooling FPGA is also configured redundantly as 1+1. Thus, while the midplane itself is not redundant or field-replaceable, the active components on it are, and it would take physical damage (e.g., someone punching a hole through it and breaking the PCB traces) to knock the whole chassis offline.

Each chassis has two independent node boards that are hot-pluggable and self-contained:

One of the E1000-F node sleds with its cover popped off at the Cray booth

Each node board is wrapped in a sheet metal sled and has a screwed-on lid. The whole node sled was designed by the ODM to be a field-replaceable unit (FRU), so doing something like a DIMM swap does require a screwdriver to remove the top cover. However it's ultimately up to OEMs to decide how to break down FRUs.

The ODM had a bare controller board at its booth which looks like this:

E1000-F bare controller board

There are two M.2 PCIe Gen4 slots for mirrored boot drives and a pair of big hot-plug block connectors in the front of the board for redundant power and 48 lanes of PCIe Gen4 for the 24x U.2 drives hanging off the midplane. There's a single riser slot for two standard HHHL PCIe add-in cards where two NICs plug in, and a third OCP-form factor slot where the third NIC can slot in. The rear of the controller sled shows this arrangement:

Rear view of a single Rome controller

It looks like there's a single RJ45 port (for LOM?), a power and reset button, a single USB-3, and a mini DisplayPort for crash carting.

When Cray announced the E1000-F, HPCwire ran a block diagram of the complete chassis design that suggested that heartbeating would be done through a non-transparent bridge (NTB) implemented on the AMD Rome host interface. This was a little worrisome since AMD has yet to release the proper drivers to enable this NTB for Linux in a functional way; this simple fact is leading other ODMs towards a more conservative node design where a third-party nonblocking PCIe switch is added simply to provide a functioning NTB. When I asked the architect about this, though, he revealed that the E1000-F also has an internal gigabit Ethernet loop between both controllers for heartbeating which completely obviates the need to rely on any NTB for failover.

Another interesting thing I learned while talking to the E1000-F designers is that the power supply configuration gives a lot of runway for the overall system design:

One of the two power supply sleds for the E1000-F chassis. Lots of free real estate remains and is currently occupied by bus bars.

The current power supply is (I believe) ~1200 W, and the carrier sled on which it is mounted is mostly empty space taken up by two fat bus bars that reach all the way to the front of it. In leaving all of this space in the sled, it will be fully possible to build a physically compatible PSU sled that delivers significantly more power to the U.2 NVMe drives and host controllers if the power consumption of the controllers or the NVMe drives increases in the future. The ODM confirmed that the cooling fans have similar headroom and should allow the whole enclosure to support a higher power and thermal load by just upgrading the power and controller FRUs.

This point is important because the performance of PCIe Gen4 SSDs are actually capped by their power consumption—if you look at product sheets for ruler SSDs (M.2, NF1, and E1.S), you will find that their performance is universally lower than their U.2 and HHHL variants due to the fact that the ruler standards limit power to 8-12W compared to U.2/HHHL's ~25W. This E1000-F chassis is designed as-is for 25W U.2 drives, but there are already proposals to push individual SSD power up to 40W and beyond. Given this trend and the high bandwidth available over a PCIe Gen4 x4 connector, it's entirely possible that there will be a demand for higher-power NVMe enclosures as Gen4 matures and people want to drive Gen4 NVMe at line rate.

DAOS User Group

The 2019 DAOS User Group was held on Wednesday in a hotel adjacent to the main convention center. Contrary to previous years in which I attended, this meeting felt like a real user group; there were presenters from several different organizations, none of whom directly contribute to or are contractual customers of DAOS. There were also real performance data which largely centered around the insanely high IO-500 benchmark score that DAOS posted earlier in the week:

Bandwidth spread on the IO-500's IOR test suite

These numbers are using a pretty modest server environment and client count (24 DAOS servers, 26 client nodes, 28 ranks per client, dual-rail OPA100) and use the native DAOS API. What I didn't snap a photo of are the crazy metadata rates which posted a geometric mean of 4.7 million IOPS; by comparison, the 250 PB Alpine file system attached to the Summit supercomputer at Oak Ridge posted 1.2 million IOPS using more than 500 clients. To the extent that it was meant to address the IOPS limitations intrinsic to traditional parallel file systems, the DAOS design is looking like a resounding success.

According to the speaker, the metadata performance of this IO-500 run was not limited by any server-side resources, so adding more clients (like WekaIO's top-scoring run with 345 clients) could have pushed this number higher. It was also stated that the staggering IOR read performance was limited by the aggregate Optane DIMM bandwidth which is a testament to how highly optimized the data path is.

Actually using DAOS

This is all using the DAOS native API though, and unless you intend to rewrite all your open()s and write()s as daos_pool_connect() + daos_cont_open() + daos_array_open()s and daos_array_write()s, it's hard to tell what this really means in terms of real-world performance. Fortunately there was a great set of talks about the DAOS POSIX compatibility layer and related middleware. I described the POSIX middleware a little in my recap of ISC'19, but it's much clearer now exactly how a POSIX application may be adapted to use DAOS. Ultimately, there are three options that DAOS provides natively:

libdfs, which is a DAOS library that provides a POSIX-like (but not POSIX-compatible) API into DAOS. You still have to connect to a pool and open a container, but instead of reading and writing to arrays, you read and write arbitrary buffers to byte offsets within file-like objects. These objects exist in a hierarchical namespace, and there are functions provided by libdfs that map directly to POSIX operations like mkdir, rmdir, statfs, etc. Using libdfs, you would still have to rewrite your POSIX I/O calls, but there would be a much smaller semantic gap since POSIX files and directories resemble the files and directories provided by libdfs. A great example of what libdfs looks like can be found in the IOR DFS backend code.
dfuse, which is a FUSE client written on top of libdfs. With this, you literally get a file system mount point which POSIX applications can interact with natively. Because this uses FUSE though, such accesses are still generating system calls and memory copies which come with steep latency penalties.
libioil, which is a POSIX interception library. This is what you'd LD_PRELOAD in front of a standard application, and it does the remapping of genuine POSIX API calls into libdfs-native calls without ever going through the kernel.

Cedric Milesi from HPE presented benchmark slides that showed that using the DFS (file-based) API over the native (array-based) API has no effect on performance:

Performance scaling of the native DAOS API (which encodes array objects) to the DAOS DFS API (which encodes file and directory objects). No discernible performance difference.

Thus, there is no performance difference whether you treat DAOS like an array store (its original design) or a file/directory store (through the libdfs API) as far as bandwidth is concerned. This is excellent news, as even though libdfs isn't a drop-in replacement for POSIX I/O, it implements the POSIX data model (data is stored as streams of bits) which is a more comfortable look and feel for a storage system than storing typed arrays. And since libioil is a shim atop libdfs, the above performance data suggests that POSIX applications won't pay significant bandwidth overheads by preloading the POSIX intercept library to get DAOS compatibility out of the box.

What's less clear is what the metadata overheads of libdfs are. Because the whole metadata model of DFS (files and directories) is very different from native DAOS (arrays), it's impossible to do a head-to-head comparison of metadata performance. That said, DFS metadata is only a subset of the full POSIX metadata so it should be faster even on identical hardware. For example, DAOS only enforces permissions when opening a container, so I would not expect DFS to have any notion of file-level or directory-level ownership or permissions bits. As such, DFS would not incur the cost of doing an expensive recursive permission check on dfs_open(), and the open rate should be much higher than something that adheres to POSIX.

Kevin Harms from ALCF also presented a really enlightening slide containing very early performance tests from their internal DAOS testbed using dfuse and libioil:

This slide is a treasure trove of interesting information:

It implicitly confirms that the verbs provider for libfabric not only works, but works well. Recall that the Intel testbed from which IO-500 was run used Intel OmniPath 100, whereas the Argonne testbed uses a competitor's fabric, InfiniBand.
Single-stream performance of DAOS using the dfuse interface is 450 MB/sec which isn't terrible. For comparison, single-stream performance of Lustre on Cray Aries + FDR InfiniBand is about the same.
Using the libioil POSIX interface dramatically increases the single-stream performance which shines a light on how costly using the Linux VFS kernel interface (with FUSE on top) really is. Not using FUSE, avoiding an expensive context switch into kernel mode, and avoiding a memcpy from a user buffer into a kernel buffer gives a 3x performance boost.

Again, in the sense that DAOS was meant to address the performance impacts of using a kernel-based storage system for I/O, it looks like DAOS is meeting expectation.

Finally, Mohamad Chaarawi also spent some time talking about the Lustre/DAOS integration which uses DAOS dfuse to stitch together a Lustre namespace with DAOS DFS namespaces. I mentioned this in my ISC recap, but there's now a pretty detailed slide about how this will look in practice:

This Lustre integration won't be quite as rosy as I described earlier since DFS namespaces don't seamlessly merge into the Lustre namespace. Instead, it looks like DFS namespaces will be mounted in a separate directory hierarchy governed by their pool UUID ("PUUID" in above slide) and container UUID ("CUUID"), and the Lustre namespace will contain symlinks to the DFS mounts. What exactly creates and destroys these symlinks is unclear; in July it had sounded like Lustre foreign layouts would dynamically stitch DAOS objects into Lustre using the Lustre control plane, but now it sounds like DAOS will behave more like autofs on top of Lustre.

The burgeoning DAOS community

Although the progress and increasing tangibility of DAOS is impressive, I was most struck by the diversity of stakeholders represented at the DAOS User Group meeting. In particular, the participation of HPE (the non-Cray part, no less!) and Lenovo was a surprise to me since neither has an immediate interest in the Argonne exascale system which has been the biggest driver for DAOS development. Lenovo in particular made the bold statement that they want to sell a DAOS appliance in 4Q2020/1Q2021 called the "DSS-D Integrated Solution with DAOS."

Oddly enough, the Cray part of HPE was not obviously present at the DAOS User Group despite their involvement in Argonne's Aurora system and activity on the DAOS mailing lists. This may just be a reflection of Cray's historic reluctance to send engineering staff to SC, but their absence was quite notable in contrast to Lenovo's head-first dive into announcing a DAOS appliance. There were also no loud voices supporting all of the work that DAOS has put into integrating with Apache Spark, nor were there any vocal supporters of Intel's newly stated ambition to create a native SEG-Y interface (a format used by oil and gas) for DAOS.

Everything else

There were some interesting tidbits that I picked up at SC this year don't fit neatly anywhere else in this post but are worth writing down.

Technical tidbits - the Cray Shasta cabinet

Much like the Cray E1000-F storage enclosure, I have also watched the Cray Shasta cabinet design evolve from a set of CAD diagrams to living, breathing behemoth of sheet metal and coolant tubing. SC'19 was the debut of a finished Cray Shasta compute cabinet, and it's a sight to behold:

The front end of the new Cray Shasta compute cabinet

These new cabinets are all direct liquid cooled, and the water tubing to each blade from the center manifold is all done up in the above photo. Compute blades slot in vertically, and each cabinet has French doors that open in directions opposite to each other. The back end is a little less neat at a glance:

The back end of the new Cray Shasta compute cabinet

As with the front end, it opens up with French doors, and interestingly, the rear doors look identical to the front doors. Although I didn't ask explicitly, my guess is that this means that both the front and rear of the cabinets could feature giant cabinet graphics if so desired.

The rear cabling is almost all copper 200 Gb/s:

Cray Slingshot switch blade and Cray chassis management module

And, in a departure from the XC and XT/XE lines, all of this copper cabling uses a standard QSFP-DD connectors to carry 2x200 Gb. In the above photo, you can see a genuine Cray Slingshot switch blade slotted in horizontally (cf. the vertically slotted compute blades) and the water coupling for the liquid-cooled switch blade and management module. There are no fancy coolant waterfalls with Shasta, but that's probably not a bad thing. As I've heard it told, the Cray-2 waterfall was a case of making lemonade from lemons; apparently fluorinert reacts corrosively with curved plastic surfaces.

Less-technical tidbits

SC isn't purely about the technology, and truth be told, the personalities and community are the principal reason I attend every year. It follows that a number of personal highlights for me weren't directly related to HPC at all but were nevertheless very valuable bits of information that I took away from Denver.

For example, I met two of the big marketing minds behind a major HPC company who really floored me by attributing value to my support of the HPC industry and community through social media. Social media is really how I got my start in this industry (I started as a hobbyist), so it's gratifying to hear that I might be contributing in a way that is meaningful to kindred spirits who also got into the HPC field from unconventional paths. It was also a reminder that there are always real people behind every corporate Twitter account, and you very well may meet them at a conference like SC. When that happens, it can be a really positive experience ("Great to meet the person behind the handle!") or an embarrassing one ("I really did say that three years ago, didn't I?"). This year was the first time it became clear that, in trying to avoid the latter case as a matter of course, the former becomes more prevalent without a whole lot of added effort.

I also met what may have been the world's slickest corporate sales team, whose brilliantly staged choreography of chance encounters over drinks only became apparent to me as I was walking back to my hotel. I know that plenty of people dislike interacting with sales, but being a great salesperson is really a craft in and of itself, and I respect people who are masters of their trade regardless of what it is. And now if I ever find myself in a situation where I need to win someone over cold, I know from whom I can draw inspiration to unleash my inner "customer success manager." It's a careful balance of drawing out concerns, driving open-ended complaints towards something actionable, and knowing where to cut through red tape and just get the right people talking.

Another non-technical area in which I was looking for information this year was management philosophy. I've had the pleasure of working with and for some very talented managers who recognize management as a distinct vocation in and of itself, and I made it a point to get time with a few such people who've consistently built me up over the years. One of the more pithy philosophies I took away from one colleague is that there are times when neither "asking for permission" nor "asking for forgiveness" is the right approach—rather, sometimes you have to "radiate intent." I'd never heard this before, but it makes sense in that it allows others the opportunity to say "no" and take explicit ownership of inaction, but it doesn't require the inverse of saying "yes" and taking responsibility for the outcomes.

Staying organized

Finally, I am always trying to figure out the optimal "workflow" for keeping organized at SC, and this year was no different. A few years ago I fully committed to simply not bringing my laptop to the conference venue every day in lieu of bringing a much lighter and more versatile iPad Pro, and this worked fine with two exceptions:

For the Parallel I/O in Practice tutorial I co-presented, I brought my laptop so that all four presenters could project from it and I could use my iPad for keeping realtime notes.
For PDSW, I brought my laptop just in case, knowing that I would be in the same room all day. I wound up presenting from it simply because it provided a better viewing angle from the podium; the room arrangements in Denver were such that it was impossible for a speaker at the podium to see the slides being projected, so he or she would have to rely on the device driving the projector to tell what content was actually being projected.

I did have to use the laptop at the hotel on Saturday night to make some final modifications to my PDSW talk (there are a few obscure features in PowerPoint that simply aren't exposed in the iOS version), but the rest of the conference (including a couple of BOF talks) that were iPad-only.

For notetaking, I started storing all of my notes in Agenda, and where appropriate, used Agenda's feature to create a single note for each calendar entry corresponding to a formal meeting. For unstructured conversations on the expo floor or between sessions, I kept one catch-all note per day in which I typed everything I could remember as soon as the conversation ended. For example, the conversation I had with the designers of the E1000-F enclosure was saved as a combination of obscure written details I took as soon as I left the booth and photos I snapped during the conversation.

In places where typing on an iPad was not possible (e.g., in most technical sessions, where there were no tables), I used Nebo and an Apple Pencil to take handwritten notes. As it turns out, hand-writing on an iPad sitting on your knee is far more productive than either trying to type text letter-by-letter into the on-screen iPad keyboard or awkwardly balancing the folded-out iPad Pro keyboard on a lap or bag. Nebo is really good at converting handwriting into ASCII, and that ASCII easily copies out and into an Agenda note.

This workflow supplanted my approach last year which relied exclusively on using Notability and hand-written notes with OCR. In meetings where a table was available (i.e., vendor briefings), being able to type rather than handwrite was far more effective in capturing every nuance in spoken word. I've found that I rarely ever get a copy of the slides shown at SC briefings, so being able to quickly capture exact hardware specs or release dates as someone is trying to gloss over some unflattering details is really not possible when writing everything by hand.

For tracking action items, I've started used Things 3 (which is admittedly crazy expensive) but is really good at capturing to-do items in under five seconds so that they can be more formally sorted, assigned a start/complete date, etc at the end of the day or after the conference.

This all mostly worked, but I did run into a major issue with Agenda where all my ad-hoc notes vanished when I got home from Denver and my home computer decided to sync. The good news is that Agenda uses internal versioning so the notes' contents weren't truly lost, and their support team was extremely responsive in both recovering my lost notes and releasing a fix within a week. Not a great first experience with the app, but I'm not sure that'll stop me from using it.

Concluding thoughts

As always seems to be the case, the week of SC was over before I knew it. There's a lot I know that I didn't get to see in terms of colleagues, exhibitors, and technical program sessions. Of everything I did get to see, there's there's plenty that I wasn't sure I'd be allowed to write up. So if you happened to get this far and are wondering why I didn't write about the most interesting thing that you got out of the conference this year, odds are that I didn't see it, or if I did, I wasn't sure I was allowed to write about it. And if I did write about you and you won't get in trouble for being attributed by name, please let me know and I'd be happy to update this post to give you credit.

Denver was the city of the first SC I ever attended, so I was glad to be back. I was also happy to get to see snow at least once this year:

and the convention center did an excellent job of providing space, AV support, catering, and gigantic coffee urns:

I got less sleep on average this year than any SC prior (around 6 hours a night), and yet I feel like I accomplished less of what was on my list than ever before. I suppose that's just a sign that the conference (or perhaps my ambition!) continues to grow, and I should expect SC'20 to be even bigger, better, and exhausting.

↧

Understanding random read performance along the RAIDZ data path

April 1, 2020, 10:54 pm

≫ Next: Exascale's long shadow and the HPC being left behind

≪ Previous: SC'19 Recap

Although I've known a lot of the parameters and features surrounding ZFS since its relative early days, I never really understood why ZFS had the quirks that it had. ZFS is coming to the forefront of HPC these days though--for example, the first exabyte file system will use ZFS--so a few years ago I spent two days at the OpenZFS Developer Summit in San Francisco learning how ZFS works under the hood.

Two of the biggest mysteries to me at the time were

What exactly does a "variable stripe size" mean in the context of a RAID volume?
Why does ZFS have famously poor random read performance?

It turns out that the answer to these are interrelated, and what follows are notes that I took in 2018 as I was working through this. I hope it's all accurate and of value to some budding storage architect out there.

If this stuff is interesting to you, I strongly recommend getting involved with the OpenZFS community. It's remarkably open, welcoming, and inclusive.

The ZFS RAIDZ Write Penalty

Writing Data

When you issue a write operation to a file system on a RAIDZ volume, the size of that write determines the size of the file system block. That block is divided into sectors whose size is fixed and governed by the physical device sector size (e.g., 512b or 4K). Parity is calculated across sectors, and the data sectors + parity sectors are what get written down as a stripe. If the number of drives in your RAIDZ group (e.g., 10) is not an even multiple of (D+P)/sectorsize, you may wind up with a stripe that has P parity sectors but fewer than D data sectors at the end of the stripe. For example, if you have a 4+1 but write down six sectors, you get two stripes that are comprised of six data sectors and two parity sectors:

ZFS RAIDZ variable stripe for a six-sector block

These definitions of stripes, blocks, and sectors are mostly standardized in ZFS parlance and I will try my best to use them consistently in the following discussion. Whether a block is comprised of stripes, or if a block is a stripe (or perhaps a block is just comprised of the data sectors of stripes?) remains a little clear to me. It also doesn't help that ZFS has a notion of records (as in the recordsize parameter) which determine the maximum size of blocks. Maybe someone can help completely disentangle these terms for me.

Rewriting Data

The ZFS read-modify-write penalty only happens when you try to modify part of a block; that happens because the block is the smallest unit of copy-on-write, so to modify part of a block, you need to read-modify-write all of the D sectors. The way this works looks something like:

Read-modify-write in RAIDZ

where

The data sectors of the whole block are read into memory, and its checksum is verified. The parity sectors are NOT read or verified at this point since (1) the data integrity was just checked via the block's checksum and (2) parity has to be recalculated on the modified block's data sectors anyway.
The block is modified in-memory and a new checksum is calculated, and new parity sectors are calculated.
The entire block (data and parity) is written to newly allocated space across drives, and the block's new location and checksum are written out to the parent indirect block.

This read-modify-write penalty only happens when modifying part of an existing block; the first time you write a block, it is always a full-stripe write.

This read-modify-write penalty is why IOPS on ZFS are awful if you do sub-block modifications; every single write op is limited by the slowest device in the RAIDZ array since you're reading the whole stripe (so you can copy-on-write it). This is different from traditional RAID, where you only need to read the data chunk(s) you're modifying and the parity chunks, not the full stripe, since you aren't required to copy-on-write the full stripe.

Implications of RAIDZ on Performance and Design

This has some interesting implications on the way you design a RAIDZ system:

The write pattern of your application dictates the layout of your data across drives, so your read performance is somewhat a function of how your data was written. This contrasts with traditional RAID, where your read performance is not affected by how your data was originally written since it's all laid out in fixed-width stripes.
You can get higher IOPS in RAIDZ by using smaller stripe widths. For example, a RAIDZ 4+2 would result in higher overall IOPS than a RAIDZ 8+2 since 4+2 is half as likely to have a slow drive as the 8+2. This contrasts with traditional RAID, where a sub-stripe write isn't having to read all 4 or 8 data chunks to modify just one of them.

How DRAID changes things

An entirely new RAID scheme, DRAID, has been developed for ZFS which upends a lot of what I described above. Rather than using variable-width stripes to optimize write performance, DRAID always issues full-stripe writes regardless of the I/O size being issued by an application. In the example above when writing six sectors worth of data to a 4+1, DRAID would write down:

Fixed-width stripes with skip sectors as implemented by ZFS DRAID

where skip sectors (denoted by X boxes in the above figure) are used to pad out the partially populated stripe. As you can imagine, this can waste a lot of capacity. Unlike traditional RAID, ZFS is still employing copy-on-write so you cannot fill DRAID's skip sectors after the block has been written. Any attempt to append to a half-populated block will result in a copy-on-write of the whole block to a new location.

Because we're still doing copy-on-write of whole blocks, the write IOPS of DRAID is still limited by the speed of the slowest drive. In this sense, it is no better than RAIDZ for random write performance. However, DRAID does do something clever to avoid the worst-case scenario of when a single sector is being written. In our example of DRAID 4+1, instead of wasting a lot of space by writing three skip sectors to pad out the full stripe:

DRAID doesn't bother storing this as 4+1; instead, it redirects this write to a different section of the media that stores data as mirrored blocks (a mirrored metaslab), and the data gets stored as

This also means that the achievable IOPS for single-sector read operations on data that was written as single-sector writes is really good since all that data will be living as mirrored pairs rather than 4+1 stripes. And since the data is stored as mirrored sectors, either sector can be used to serve the data, and the random read performance is governed by the speed of the fastest drive over which the data is mirrored. Again though, this IOPS-optimal path is only used when data is being written a single sector at a time, or the data being read was written in this way.

↧

Exascale's long shadow and the HPC being left behind

May 20, 2020, 10:33 am

≫ Next: The joy of buying a standing desk during the pandemic

≪ Previous: Understanding random read performance along the RAIDZ data path

The delivery of Japan's all-CPU Fugaku machine and the disclosure of the UK's all-CPU ARCHER 2 system amidst the news, solidly "pre-Exascale" machines with pre-exascale budgets, is opening old wounds around the merits of deploying all-CPU systems in the context of leadership HPC. Whether a supercomputer can truly be "leadership" if it is addressing the needs of today using power-inefficient, low-throughput technologies (rather than the needs of tomorrow, optimized for efficiency) is a very fair question to ask, and Filippo took this head-on:

Unfortunately take codes from Tier-2 with GPU to Tier-1 without GPU is a *huge* step backward. These calls are holding back the true potential of #GPU computing in accelerating scientific discovery! https://t.co/qVVEWFDXt1
— Filippo Spiga (@filippospiga) May 20, 2020

Of course, the real answer depends on your definition of "leadership HPC." Does a supercomputer qualify as "leadership" by definition if its budget is leadership-level? Or does it need to enable science at a scale that was previously unavailable? And does that science necessarily have to require dense floating point operations, as the Gordon Bell Prize has historically incentivized? Does simulation size even have anything to do with the actual impact of the scientific output?

While I do genuinely believe that the global exascale effort has brought nearly immeasurable good to the HPC industry, it's now casting a very stark shadow that brings contrast to the growing divide between energy-efficient, accelerated computing (and the science that can make use of it) and all the applications and science domains that do not neatly map to dense linear algebra. This growing divide causes me to lose sleep at night because it's splitting the industry into two parts with unequal share of capital. The future is not bright for infrastructure for long-tail HPC funded by the public, especially since the cloud is aggressively eating up this market.

Because this causes a lot of personal anxiety about the future of the industry in which I am employed, I submitted the following whitepaper in response to an NSCI RFI issued in 2019 titled "Request for Information on Update to Strategic Computing Objectives." To be clear, I wrote this entirely on my personal time and without the permission or knowledge of anyone who pays me--to that extent, I did not write this as a GPU- or DOE-apologist company man, and I did not use this as a springboard to advance my own research agenda as often happens with these things. I just care about my own future and am continually trying to figure out how much runway I've got.

The TL;DR is that I am very supportive of efforts such as Fugaku and Crossroads (contrary to accusations otherwise), which are looking to do the hard thing and advance the state of the art in HPC technology without leaving wide swaths of traditional HPC users and science domains behind. Whether or not efforts like Fugaku or Crossroads are enough to keep the non-Exascale HPC industry afloat remains unclear. For what it's worth, I never heard of any follow-up to my response to this RFI and expect it fell on deaf ears.

Response to “Request for Information on Update to Strategic Computing Objectives”

G. K. Lockwood
August 17, 2019

Preface

This document was written as a direct response to the Request for Information on Update to Strategic Computing Objectives (Document Number 2019-12866) published on June 18, 2019. All views expressed within are the personal opinion of its author and do not represent the views or opinions of any individuals or organizations with whom the author may or may not be associated in any professional or personal capacities. This document was authored without the support, knowledge, or input of any such individuals or organizations, and any similarity between the opinions expressed here and any other individuals or organizations is purely coincidental.

Question 1. What are emerging and future scientific and technical challenges and opportunities that are central to ensuring American leadership in Strategic Computing (SC), and what are effective mechanisms for addressing these challenges?

While the NSCI Strategic Plan identified four overarching principles which are undeniably required to maintain continued American leadership, its five strategic objectives are, in many ways, mutually incompatible with each other.

In the three years following the initial NSCI plan towards delivering capable exascale, the outcomes of the Aurora and CORAL-2 procurements within DOE have made undeniably clear that the definition of “capable exascale” necessarily requires the use of GPU technologies. Because GPUs are, in many ways, accelerators specifically suited for scientific problems that can be reduced to dense linear algebra, this has effectively signaled that scientific challenges which are not reducible to dense linear algebra (and therefore incompatible with GPU technologies) are, by definition, no longer of strategic significance.

By bifurcating science domains based on whether they are or are not compatible with GPU-based acceleration, we are now at a crossroads where entire classes of domain science research that have historically run at-scale on CPU-based leadership computing systems will be left behind. To be clear, this is not simply a matter of engineering—many important classes of scientific challenges are fundamentally incompatible with the GPU accelerator model of computation, and no amount of code modernization will change this fact. Yet these same science domains, which rely on complex multiphysics applications that are core to strategic areas such as stockpile stewardship and climate science, are of undeniably critical importance to both national security and society at large.

Thus, there is now a clear and growing gap between NSCI’s ambition to deliver capable exascale and the larger mission to maintain leadership in entirety of truly strategically important computing in the nation. There are technical challenges intrinsic in this growing gap which include pursuing research in hardware and software technologies that approach strategic computing more holistically rather than exclusively from a FLOPS perspective. The community has long acknowledged that the scope of HPC has surpassed simply performing floating point operations, and the definition of capability computing now includes enabling science that, for example, may require tremendous data analysis capabilities (e.g., moving, transforming, and traversing massive data sets) but have relatively low floating point requirements. The DOE Crossroads procurement and the Japanese leadership program and its Fugaku system embody this more balanced approach, and there is little doubt that both Crossroads and Fugaku will demonstrate a number of world’s-firsts and, by definition, demonstrate leadership in strategic computing without making all of the sacrifices required to meet today's definition of capable exascale.

Both Crossroads and Fugaku have required significant R&D investment to enable these dimensions of capability, and the NSCI would do well to explicitly call out the need for continued investment in such directions that are orthogonal to exaflop-level capability.

Question 2. What are appropriate models for partnerships between government, academia and industry in SC, and how can these partnerships be effectively leveraged to advance the objectives of SC?

The most impactful models for industry-government partnership in HPC have come in the form of close collaboration between the HPC facilities that deploy extreme-scale systems and the technology providers in industry that create and support the required hardware and software solutions. Strategy necessarily involves taking input from user requirements, workload characterization, and technology trends to inform future directions, and HPC facilities are uniquely qualified to speak to both user requirements (by virtue of the fact that they directly interact with users in support of HPC systems) and workload characterization (by virtue of the fact that they manage HPC systems). Complementarily, industry technology providers (vendors) are uniquely qualified to speak to technology directions, marketability, and sustainability in the larger technology market.

This effective collaboration can take the form of non-recurring engineering such as those contracts associated with large system procurements (often to address more tactical challenges towards strategic computing) or standalone programs such as DOE PathForward (which addresses longer-term technology development towards strategic computing). In both cases though, industry (not HPC facilities or academic researchers) propose the initial scope of work based on their own understanding of both (1) HPC-specific requirements and (2) larger market and profit prospects. This latter point is critical because the HPC market alone is simply not large enough to sustain purpose-built technologies, and sustaining new technologies and their peripheral enabling ecosystems requires buy-in from multiple markets.

The role of academia in research is more complex, as academic research in HPC can be either basic or applied in nature. Basic research (such as in applied mathematics and algorithm development) has stood on its own historically since such work results in a larger base of knowledge from which specific technology solutions (whether developed by industry or HPC facilities) can be composed both today and in the future. The federal agencies participating in NSCI can claim credit for funding the basic research outcomes that have been incorporated into innumerable software and hardware technologies in use today.

On the other hand, applied research (such as developing new software systems that may implement the outcomes of basic research) has had very mixed outcomes. It is often the case that applied researchers who have no direct relationship with neither HPC facilities nor technology providers formulate research projects based on second-hand HPC requirements and technology trends. It follows that their interpretation of such requirements is incomplete, and their research outcomes are misaligned with the actual needs of HPC facilities and industry. Barring cases where academic applied research outcomes are so valuable that they stand on their own (of which there are many examples including OpenMPI and Tau), applied research in the absence of such a sustainability path results in a tremendous amount of software that has virtually no long-term (i.e., strategic) value to SC.

This speaks to a gap between applied research in academia and those who apply research in practice that must be closed. This gap has been perpetuated by a lack of HPC practitioners (domain scientists and applied researchers directly attached to HPC facilities or technology providers) on the committees that evaluate the merit of research. Thus, a more effective engagement model would involve coupling the academic research pipeline to HPC facilities and industry more closely. This may be something as informal as increasing the diversity of review panels and program committees to include representatives from facilities and industry to a formal requirement that successful research proposals have a clearly defined connection to a specific industry or facility partner. Regardless of the solution though, funding applied research that will be "thrown over the wall" to HPC facilities and vendors without their input is not compatible with SC.

Question 3. How do we develop and nurture the capable workforce with the necessary skill and competencies to ensure American leadership in SC? What are effective nontraditional approaches to lowering the barriers to knowledge transfer?

Although virtually every report discussing strategic directions and future requirements of HPC call for knowledge transfer and building a larger workforce through training and outreach (e.g., see the complete set of DOE Exascale Requirements Reviews), such reports generally neglect two critical realities of employing and retaining a talented workforce at production HPC facilities and in industry.

The first reality is that the problems intrinsic to modern HPC (solving problems at extreme scales) are no longer exclusive to HPC. The ubiquity of technology in modern life now means that the entire technology industry must deal with problems at scale as a matter of course. As such, the HPC community is now competing with well-capitalized commercial entities that have increased the absolute value of a skilled engineer to levels that the scientific research community simply cannot afford.

Thus, the perceived lack of skilled workforce in HPC is not a failing of the workforce development strategy in place; in fact, it may be a great indicator of its success, as it has created a workforce whose skills have value that far outstrip the investment put into workforce development. However, this also means that the talented individuals who eschew the higher pay and amenities of working in the larger technology industry do so for non-monetary reasons (work-life balance, attraction to the science mission, geographic locality). It is therefore critically important that strategic computing identify these motivators and built upon them to the greatest possible degree to maintain an edge in an extremely competitive hiring landscape.

The second reality is that the key to an exceptional workforce is not simply a matter of technical knowledge. There is no shortage of individuals who understand parallel programming in the world, and it is of little strategic value to pursue workforce development strategies that prioritize knowledge transfer as the principal outcome. Rather, strategic computing requires a workforce that is capable of critical thinking and has a natural drive to solve problems that have never been solved before. These traits should be emphasized to a far greater degree than the current pedagogical emphasis on material that can be learned from a manual by anyone with a curious mind.

By definition, very few people in the world have prior experience in world-class HPC. There are very limited opportunities to build a credible work history in extreme-scale HPC for individuals who are ineligible for student internships or postdoctoral appointments. As a result, world-class HPC facilities rarely see qualified applicants for open positions when “qualified” is defined on the basis of relevant work experience; a mid-career developer or systems engineer working in a campus-scale HPC organization simply has no opportunities to demonstrate his or her intellectual capability in a way that is outstanding to the facilities that deliver strategic computing resources.

Thus, an integrative approach to workforce development that (1) emphasizes problem-based learning rather than rote reiteration of manuals and standards documents in an environment where (2) representatives from NSCI constituent agencies can engage with trainees (i.e., potential employees) in a fashion with less formality and pretense than a typical "CV-phone screen-interview" pipeline may reveal a much broader potential workforce whose strengths more closely align with strategic computing. Such an approach may manifest in the form of intensive boot camps such as the DOE ATPESC program, grants for mid-career retraining in partnership with a leadership computing facility, or sabbatical support for technical staff at the nation’s mid-scale computing facilities.

Question 4. How can technical advances in SC and other large government and private initiatives, including infrastructure advances, provide new knowledge and mechanisms for executing next generation research?

No response.

Question 5. What are the future national-level use cases that will drive new computing paradigms, and how will new computing paradigms yield new use cases?

It is easy to claim that artificial intelligence will be the most important future national use case to drive new computing paradigms. However, this is a very dangerous statement to make without qualification, as the actual level of readiness for applying AI to solve scientific problems is very low, and the actual scales, aggregate demand, and algorithmic motifs required by such workloads for scientific discovery are poorly undefined. More generally, the requirements of AI workloads at large remain uncertain; for example, the Facebook uses a variety of AI techniques in production and have found that each application area requires different computational, storage, and network resources (see Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective). Outside of the large hyperscale datacenters, industry consensus suggests that production AI workloads remain largely at single-server scales. As such, it is difficult to confidently assert what the rate of scale-out AI will be for strategic computing.

The current leading technique for AI at scale is deep learning, yet scientific discovery is at odds with the black-box nature of this method. Alternative methods such as decision trees offer much more insight into why a trained model behaves as it does and is more compatible with applying physical constraints to which physical systems being modeled (e.g., see Iterative random forests to discover predictive and stable high-order interactions). However, the relative importance of such non-block-box learning techniques in HPC are completely unknown, as are the general optimization points for such techniques in the context of scientific computing. There is a danger that the similarities between deep learning and many HPC problems (GEMM-heavy workloads) place an artificially high importance on the role of deep learning in SC. It may be the case that deep learning is the most effective method for applying AI to address problems in scientific computing, but caution must be taken to ensure that major challenges in SC not all look like deep-learning nails simply because GPUs are a very effective hammer.

From a domain science perspective, there are very few domain sciences where AI can replace traditional simulation-driven workflows wholesale. As such, the role of AI in SC will be largely supplementary; scientific workflows may integrate an AI component to generate starting conditions, replace humans in the loop during steering, or identify areas of interest in the results of a primary simulation. However, it is very unlikely that AI will grow to be of greater significance to scientific computing than modeling and simulation. Instead, it will be the source of new computational resource requirements that simply did not exist in the past because those tasks were carried out by humans. The road towards integrating AI into scientific workflows will also be a long and tortuous one, as the field is evolving far more rapidly in industry than scientific computing traditionally has. Care must be taken that SC not tie itself too closely to a method (and its associated hardware configurations) that may be deprecated in short order.

Question 6. What areas of research or topics of the 2016 NSCI Strategic Plan should continue to be a priority for federally funded research and require continued Federal R&D investments? What areas of research or topics of the 2016 Strategic Plan no longer need to be prioritized for federally funded research?

The five objectives outlined in the 2016 NSCI Strategic Plan all gravitate around elements of topics that require continued federal R&D investments, but they do require realignment with the technological, scientific, and economic landscape as it exists now.

Objective 1: accelerating the development of capable exascale by the mid-2020s

The 2016 NSCI report correctly stated that capable exascale technologies would not be available until the mid-2020s, but DOE pulled its exascale system deliveries into the early 2020s. As a result, the delivery of exascale had to be accelerated at significantly higher costs: there have been significant capital costs (the first US exascale systems will cost between 2x and 10x their immediate predecessors, either setting a new bar for the cost of future leadership HPC systems or resulting in a bubble in funding for all post-exascale machines), operational costs (the power budgets may exceed the original 20 MW goal by 50%), and opportunity cost (only two of the three CORAL labs actually deployed a CORAL-1 machine).

Notably absent here is a commensurate increase (2x-10x, 1.5x, or 1.3x as above) in R&D efforts towards making these exascale systems widely accessible to applications that do not fall under the umbrella of ECP funding. As such, NSCI must continue to emphasize the importance of funding R&D to enable the “capable” component of this objective through the mid-2020s at minimum.

Objective 2: Developing a coherent platform for modeling, simulation, and data analytics

The convergence of HPC and Big Data was a popular point of discussion when the 2016 report was written, but there has yet to be a compelling, quantitative analysis that demonstrates the difference between a “Big Data” system and an “HPC” system despite the best efforts of several leadership-scale HPC facilities. The challenge is not one of technology and system architecture; rather, the principal design point for “Big Data” systems outside of the HPC world has simply been one of cost (e.g., scaling out cheap hardware over a cheap network for a very well-defined bulk data access pattern) over performance. There is absolutely nothing that stops the typical “Big Data” application stacks, both old (e.g., Hadoop and Spark; see this paper) and new (e.g., TensorFlow; see this paper) from running at scale on any modern HPC systems, and both have been demonstrated at scale on systems that were sensibly designed.

As such, this objective need not be emphasized in the future. Rather, engineering work is required to enable the “Big Data” stacks in use outside of HPC to work efficiently on the HPC systems of tomorrow. This remains a software, not architectural, problem, and very much an engineering, not research, challenge.

Objective 3: R&D towards post-CMOS technologies and new paradigms

It is not the role of NSCI constituent agencies to fund the development of new materials systems explicitly for post-CMOS computing, because these agencies, their review committees, and the academic researchers they fund do not have the insight into the realities of logistics, material costs, and manufacturing required to predict what combination of materials and microarchitectures could actually be turned into a marketable product that can be sustained by the larger technology industry. In the absence of this insight, R&D towards post-CMOS technologies is likely to produce interesting demonstrations that are impractical for the purposes of actually developing leadership-scale computing systems. Instead, such research should be funded using facility-industry partnerships as discussed previously in Question 2.

Investing in R&D towards new paradigms in computing should also be considered not with respect to enabling new scientific applications, but rather accelerating existing scientific workloads that are incompatible with exascale technologies (GPUs). As discussed in response to Question 1, there is a very real risk of leaving entire domains of computational science behind as the definition of leadership computing (when equated to exascale) becomes increasingly narrow in scope. Developing new accelerator technologies that are of benefit to complex application workflows (e.g., multiphysics simulations) are of critical importance in the coming years missions such as stockpile stewardship and climate science fall by the wayside.

Objective 4: Improving application development and workforce development

The DOE Exascale Computing Project (ECP) has demonstrated a highly effective way of integrating researchers, application code teams, and facilities towards improving application development. Providing a coherent ecosystem of recommended methods (such as its IDEAS project; e.g., see ECP-IDEAS), development tools (funded under its Software Technologies area), algorithm-application partnerships (through its co-design centers), and application integration efforts (funded under Hardware and Integration area) are an excellent blueprint for improving application development. Developing a more generic model for establishing and supporting this style of development beyond the timeline of the ECP funding should be pursued.

Improving workforce development should reduce its focus on basic technical training and more on improving critical thinking as described in the response to Question 3 above.

Objective 5: Broadening public-private partnership

As described in the response to Question 2 above, public-private partnership is absolutely critical to sustain SC in the coming years. The financial incentives driving technology development from the world outside of HPC have come to outstrip the resources available to HPC to exist independently. SC efforts must engage with both technology providers and the primary market forces (the enterprise and hyperscale computing industries) to better understand where technologies, solutions, and opportunities can be pursued in partnership rather than in parallel.

Question 7. What challenges or objectives not included in the 2016 NSCI Strategic Plan should be strategic priorities for the federally funded SC R&D? Discuss what new capabilities would be desired, what objectives should guide such research, and why those capabilities and objective should be strategic priorities?

The mission of providing capable exascale as described in the 2016 NSCI Strategic Plan is proving to be not a sustainable long-term path. As described in the response to Question 1 above, the first exascale machines stand to accelerate scientific problems that can be cast as dense matrix-matrix multiplication problems, but there are large swaths of scientific problems to which this does not apply. If one considers the Graph500 BFS list, three of the top five systems are over seven years old and will be retired in 2019. While graph problems are not prolific in SC, the fact that such little progress has been made in accelerating extreme-scale graph traversal during the seven years that exascale has been aggressively pursued is indicative of some classes of HPC problems being abjectly left behind.

Thus, a primary objective towards capable exascale must be examining the opportunity costs of the current strategic direction. If it is determined that there is simply no way to bring forward those types of computational problems that are incompatible with GPU-based acceleration, then a clearer strategy must be formulated to ensure that the scientific challenges being solved by those computational problems do not stagnate. As it stands, the public discourse surrounding the first-generation US exascale architectures is not universally positive because of this perceived scientific exclusivity of the chosen architectures, and such exclusivity is at odds with both capable computing and computing leadership.

↧

The joy of buying a standing desk during the pandemic

August 27, 2020, 10:55 pm

≫ Next: PDSW'20 Recap

≪ Previous: Exascale's long shadow and the HPC being left behind

When my employer announced that we were all going to work remotely back in March, I had no semblance of a home office and had to scramble to figure out how to set up a space in my small urban apartment that would be suitable for days dominated by videoconferencing. Thinking it'd be just a few weeks, I thought I could ride it out using my guest bedroom's writing desk, but by the time June rolled around and it was clear we were not returning before 2021, it was time to give up the fantasy and set up a real home office. Priority #1 was to get a real desk.

Real desks are expensive, and if I was going to spend a few hundred dollars on one, I wanted it to be at the right ergonomic height. I type at a height of 28 inches from the ground which turns out to be a non-standard desktop height, so my attention quickly turned to adjustable-height desks. It wasn't long before my attention turned to standing desks which cost a bit more than just a couple hundred dollars, and being stingy, I spent weeks doing my homework and agonizing over exactly which model to order to get the most out of the $900+ investment. For the benefit of anyone else facing a similar situation and wanting to agonize over the details of standing desks, I decided to document my adventure.

Note that I have no financial interests in any of the companies mentioned below. I am just writing this in the hopes that someone finds this perspective useful.

Choosing a Desk Supplier

Because these standing desks are quite expensive, I spent a lot of time struggling to decide which company and model were the best. Anyone who Googles around for information about reputable standing desks will probably discover

Uplift's V2 and Fully's Jarvis desks are the most celebrated
A website called BTOD.com has a bunch of really interesting teardowns of many standing desk models written by a guy named Greg Knighton

I had a bunch of criteria in mind. In no particular order,

I did not want a cross member that I would bang my legs against all the time
I wanted a solid wood desk surface, not a laminate on medium-density fibreboard (MDF)
I wanted a dark finish on the wood and a dark or industrial finish to the metal, if possible
I wanted something with a lot of accessories that I could bundle
I wanted something relatively lightweight since I move quite a bit
I wanted a 60x30 desk--no longer, no shallower
I did not want a cheap desk made in China

The last bullet caused me a fair amount of angst, because both Jarvis and Uplift desk frames are manufactured in China. The BTOD desks appeared to be the only ones manufactured in the USA, but they failed all of my other criteria in that they have a leg-smashing cross member, focus primarily on laminated MDF desktops, have a very limited selection of accessories, and use a cheaper mechanical design around the lifting mechanism.

Further Googling about BTOD desks also revealed that there's a bizarre story to be told about BTOD.com and Greg Knighton. To give you a taste of that mess, consider the following webpages:

"BTOD Exposed: Deception and Lies Land BTOD and Greg Knighton in Multiple Federal Lawsuits," hosted by Xdesk (another standing desk manufacturer)
"BTOD Reviews Hit with Federal Lawsuits for Flooding Internet with Fake Reviews," hosted by Evodesk (another standing desk manufacturer)

The fact that neither Fully nor Uplift engaged in this kind of mud slinging, combined with the facts that they met all my other criteria and marketed towards professionals rather than gamers, really narrowed the field down to just those two players. I ultimately chose Uplift for the following reasons:

Pre-sales support: I contact both Uplift and Fully on the same day with questions about their desktop thickness. Uplift got back to me the same day, while Fully took four days to respond. Not a deal breaker by far, but I took it as an indicator of their level of support.
Desktop: Fully's hardwood desktops are significantly more expensive than Uplift's. Uplift also offered rubberwood desktops with a nice, eco-friendly story about where they come from. The reality is that rubberwood is cheap and plentiful since it's sourced in Asia, but the rubberwood pitch makes me feel like I know where my desk came from.
Cable management: Uplift offers a magnetic metal cable channel that matches the finish of the desk frame. This is a really nice way to run thick cables down one leg of the desk without disrupting aesthetics. Fully had nothing comparable.
Minor costs: Fully charges $20 extra for a frame that goes below 29" and I needed to go down to 28". They also charge $20 extra for the industrial finish for some reason. This would have been more palatable to me if they just charged $40 on top of every desk.
Assembly: I am a measure once, cut twice kind of guy. I know this about myself, so the thought of having to drill my own desktop was not attractive. Some of Fully's basic accessories (such as the cable management tray) require drilling, whereas Uplift's did not. In addition, Uplift had nice assembly videos that made me feel better about how easy the assembly would be.

That all said, there were some places where Fully had an advantage:

USB power: Fully's clamp-mounted surge protector has USB-C while Uplift's does not. Unfortunately the Uplift USB-C does not supply amperage sufficient to power a MacBook though.
Desktop: Fully offers a dark bamboo finish which is the best of both worlds--offers the lightest-weight desktop option in the dark finish I wanted. Uplift did stock a dark finish bamboo according to their print catalog, but apparently they sold out of it very fast and couldn't source more.
Lead time: Due to COVID-19 and supply chain issues, the Uplift rubberwood desktop I wanted (sourced from Vietnam) was back-ordered by two months. Fully could have shipped a comparable desktop within a week or two by comparison.

That last factor--lead time--probably gave me the most heartburn since I was buying a desk for both my ergonomic and mental well-being, but my wife convinced me that August would be here before we knew it. She was right, and it turned out that having something to look forward to for two months added a surprising amount of positive focus to my life.

I ultimately ordered from Uplift on June 4 through Cary, one of their sales associates who had been answering all my ultra-specific questions about product dimensions in the days prior. The configuration on which I decided was:

Uplift V2 with a two-leg C-frame in the industrial-style metallic finish
60x30 solid rubberwood desktop in the dark finish
two standard wire grommets
basic wire management kit
magnetic cable organizing channel in industrial-style metallic finish
clamp-on power
8-outlet mountable surge protector
the bamboo balance board and writing desk pad (free promo items)

The total was just under $900 before taxes--not cheap--but the ten-year warranty on the desk frame helped me justify the cost as being amortized over a decade. I also realized near-term value in the desk as something I could use to take my mind off of the stressors of the pandemic during the forthcoming months of planning, anticipating, and enjoying the desk.

Desk Assembly

After waiting two months and ten days, my desk finally arrived. The desk came in three boxes under a single FedEx shipment:

The desktop itself
The desk frame, legs, and control box
The desk frame base and any added accessories

Three-box FedEx shipment of my desk.

I was most worried about the desktop itself since it was the item with the longest lead time, the most bulk, and the most at risk of being damaged during shipping. It ultimately weighed in at under sixty pounds though, and it arrived with no visible damage.

Box in which the solid-wood desktop was shipped.

The packaging around the packaging was quite good. After removing the external straps and some packing tape, the inside of the shipping box had hard cardboard corners, corrugated cardboard sheets to protect the top and bottom faces, hard cardboard framing on all four edges of the desk, fitted foam framing beneath that, and a thin foam sheet covering the entire desktop to avoid scuffing. The packaging was clearly designed to protect against drop damage on all edges and corners; it would take dropping this box on something sharp like steps or railing to do damage.

Packaging of the desktop box. At this point I had only removed the external cardboard lid of the shipping box and a cardboard sheet that laid over the foam sheet pictured.

Although I ordered a dark stain on my desktop, I was surprised to see that even the bottom of the desk was stained (albeit to a much lighter degree). I was expecting that the rubberwood would look cheap and uninteresting since it is the cheapest solid wood option, but the wood had quite a bit of grain showing through.

Underside of the dark-finished rubberwood desk. Note that the underside is stained but to a far lighter degree than the top surface.

I was also surprised that the underside of the desktop was completely pre-drilled. Since I move a lot, the notion of disassembling and reassembling a desk held together with wood screws every few years was not appealing. However, the abundance of nut inserts and machine screws means this desk can come apart many times without concern for stripping the wood.

Wood nut inserts on the underside of the desk. These are where the frame attaches.

Pre-drilled pilot holes were provided for mounting the motor control pad and the cable management tray, but everything else was fitted with nut inserts. The entire frame attaches to the desk with machine screws.

Three-inch diameter grommet holes were also pre-cut into the desktop and roughly finished. There was an uneven and thick coating of stain along the inside edges, but nowhere near enough to cause any concern.

Three-inch grommet hole for cable pass-through.

The second box contained most of the frame, the cable management tray, the motor control box, and assembly instructions. Coming packaged straight from the OEM just as the desktop had, this box of frame components was solid and used double-walled corrugated cardboard with all parts encased in form-cut foam.

Packaging in the box containing the majority of the desk frame itself along with the assembly manual.

The third box in the shipment was a catchall that contained all of the accessories I ordered and the base of the desk. Unfortunately this box was not packed nearly as well as the others since it was a box of boxes; in the photo below, there was only wadded-up packing paper filling out the gaps when I opened it. The grommets and control pad were banging around loose, and the white accessory boxes contained accessories wrapped with only thin bubble-wrap sleeves. As I detail later, one of my accessories did sustain damage that may have been related to this minimal packaging.

Inside the accessory box. A length of crumpled packing paper was also included but is not shown.

Like the desk frame box though, the box within this box that contained the frame base was OEM-packaged and had form-cut foam and double-walled corrugated cardboard.

Below is the box containing the base (top) and the box containing the rest of the frame with the first layer of frame components already removed (bottom). The shiny black piece in the middle of the bottom box is the plastic cable management tray that comes standard with the Uplift V2 desk frames now. The inclusion of this cable management tray obviates the need to buy the Advanced Cable Management Kit over the Basic Cable Management Kit as my Uplift sales rep, Cary, pointed out; this saved me a couple bucks in the end.

The box containing the desk base and the rest of the desk frame

The box containing the desk base (top) and the rest of the desk frame (bottom).

Another nice touch about the Uplift desk is that all assembly components (screws, nuts, etc) all come in numbered pouches not unlike IKEA furniture. And, like IKEA furniture, the necessary Allen wrenches were also included, allowing you to genuinely assemble this desk with nothing other than a manual Philips screwdriver for the wood screws.

All screws, cable ties, bolts, Allen wrenches, and other loose items. Plenty of extras were included.

The desk also ships with excess and nice-to-have parts; for example, it includes both machine screws (if you use an Uplift pre-drilled desktop surface) and heavy wood screws (if you use your own desktop surface), and it comes with enough self-adhesive reusable cable mounts to tie down all of the cords and cables associated with the desk frame’s lifting mechanism.

The entirety of the desk frame components are shown below. Far fewer parts were involved than I thought, and the entire frame is held together using machine screws.

Entire desk frame prior to assembly.

The frame components themselves look and feel solid. For example, the base is reinforced steel that is definitely a cut above your typical flat-pack furniture.

Close-up of one desk foot.

The welding that connects the legs, frame top, and triangular stability brace looks robust. Also shown below are four of the Uplift’s unique accessory mounting points, which are threaded holes through a reinforced steel plate that’s welded to the structural part of the frame.

Close-up of the joint between a desk leg and the top frame. The plate with holes are accessory mounting points. Also shown is the triangular stability brace which reduces wobble.

Just as the frame is assembled entirely with machine screws, the frame itself also attaches to the desktop using machine screws. Every attachment point between the frame and desktop have thick rubber grommets above and below the frame, allowing the frame to firmly attach to wood desktops that may vary in thickness by a few millimeters. Again, all screws and washers came with the desk frame itself, and all of the pre-drilled holes lined up with the frame perfectly without needing to bend or stretch anything as one sometimes has to do with IKEA furniture.

Attachment point between desk frame and desktop. Rubber grommets are affixed to the frame at all points, and all attachment bolts come with washers.

It’s also notable that the assembly instructions for the desk are written by a native English speaker, and they contain unexpectedly helpful pointers and details specific to this desk’s assembly. For example, you have a choice of which side to mount the keypad that controls the desk’s elevation, and the manual reminds you that you are looking at the desk upside-down as you're doing this. As a result, you have to install the keypad opposite of where you want it to be when the desk is right-side up; this sort of thing is a mistake I'd make and then get mad about.

Excerpt from the assembly manual. Very well written.

In addition, Uplift e-mails you links to assembly videos a few days before your desk arrives, so if you aren’t big on reading all the instructions before you start (like me), you can still quickly scope out what things to be careful of during assembly ahead of time. I found the video particularly helpful for showing some good ways to bundle and tie down excess power/control cables for the frame.

Finally, although the English instructions are fantastic and clear, English is the only set of instructions that ships with the desk. Non-English speakers may be in some trouble, but I have to assume that Uplift knows its customer base and made a decision to ship less paper in favor of saving more trees. I’ll also note that I watched the assembly video with no sound, and it’s quite easy to understand regardless of language.

Speaking of mounting the keypad though, this is done using wood screws instead of machine screws. There are two sets of pre-drilled pilot holes under the desktop, and driving the wood screws into them is completely possible by hand using a Philips head screwdriver. Note below that the unused pair of pilot holes shown below are for different keypad options you can choose when buying the desk.

Height control keypad attached to the underside of the desk using two wood screws.

The plastic cable management tray included with the frame also mounts using pre-drilled pilot holes and provided wood screws. However, I found that the pre-drilled holes in the desktop did not line up with the pre-drilled holes in the tray itself; they were misaligned by a few millimeters. Since I know that I am a measure-once/cut-twice kind of person, I chose to drill new holes in the cable tray to match the holes in the desk since plastic is a lot more forgiving than wood. And sure enough, I did need to drill twice since I only measured once.

Slight misalignment between the pre-drilled pilot holes in the desk and the included cable management tray.

After screwing everything together, it’s a small matter to install the control box since it just slides into a bracket in the frame. Running cables from it to the motors in each desk leg and connecting the cable from the control keypad are similarly straightforward. As noted earlier, the desk frame comes with a handful of self-adhesive reusable cable ties which adhere to the metal frame very well. The written instructions provide guidance on where to best stick these along the frame to keep the wires out of sight, and the video includes additional recommendations on how to tie down the longer stretches of cable slack. This frame is suitable for desks up to 80 inches wide but I bought a 60-inch desk, so I had a good two feet of excess cable to tie down and tuck away.

Fully assembled desk, ready to be flipped upright. At this point all the cables had already been tied down with the included self-adhesive reusable zip ties.

Flipping the desk upright was a two-person job because it’s about a hundred pounds, and the Advanced Comfort Keypad sticks out in a way that makes resting the desk on its long edge inadvisable. However the end result was a lot less labor than I anticipated; I found that removing packing material from my small home office as I assembled probably doubled the time it took me to assemble.

Fully assembled desk standing upright. Notice how much darker the top finish is compared to the bottom.

Accessories

One of the big draws of ordering an Uplift is the degree of customizability in accessories; spending days agonizing about exactly which accessories to include was part of the retail therapy for me. Uplift also includes a couple of free accessories with each desk order which somehow makes it easier to rationalize taking a risk on buying accessories that may prove to be frivolous or unusable.

Most of the accessories that I ordered with my desk.

I ordered the following accessories at the same time as my desk:

Two Wire Grommets - I was tempted to get at least one Power Grommet to simplify plugging in transient stuff like my desk fan, but I also needed at least two accessible USB plugs for my phone and headphones in addition to standard 120V outlets. Given the steep cost of Power Grommets ($69 each!), I opted to solve both problems with the $45 Clamp-on Power accessory instead and stick with the cheap plastic pass-through grommets. And make no mistake, these grommets look and feel pretty cheap! But they are a standard 3" diameter, so you could get any third-party grommets you want if you want to dress up your desk.

Clamp-on Power with USB - This accessory provides two standard 120V outlets and two 5V USB-A outlets with enough amperage to charge both my iPad Pro and wireless headphones. I was worried that this would feel cheap, but it does not; the body feels like steel, and the clamp is solid. The only issue I have with it is that the power cord is very thick and comes straight down out of the bottom of the enclosure. There's no way to avoid having the power cord awkwardly bend against your desktop, in plain sight, and be routed off the desktop either off the back or all the way to the nearest grommet hole. It'd have been preferable if the power cord came out the back of the enclosure, or at least behind the clamp, so that it's not visible while seated.

Advanced Comfort Keypad - I wanted a programmable keypad, and a couple of reviews online said it was worth the extra $10. It's easier to view and control since it comes out from under the desk at an angle, but as a result, it also cannot sit flush with the bottom surface of the desk. I worry that this will subject it to damage if someone carelessly tips the desk on its front edge while trying to invert it, but one just has to be careful.

Bamboo Motion-X Board - This was a free promo item that I thought was going to be a cheap gimmick, but I'm genuinely glad I wound up getting it! I spend a lot of time on Zoom calls, and being able to rock around is more fun than tapping my leg to keep my blood flowing. For an extra $20, you can also slap an adhesive foam standing mat on top of it so it doubles as a more comfortable standing surface but I've yet to feel the need to get this. I will say that the bare bamboo board is hard on my feet after ten minutes, but I've found myself able to rock around for 60-90 minutes on it while wearing socks and/or slippers.

Writing Desk Pad - This was the other free promo item I got, and again, I wouldn't have considered it if it wasn't free. I got it in navy blue which turned out to be a beautiful color that complements my dark desk finish, and it does add both contrast and wrist comfort to the desk. Although I don't use a mouse, I would expect that it obviates the need for a mousepad as well. Contrary to its leather-like appearance, it is urethane-based, waterproof, and very pliable. Again, nicer and more useful than I expected.

Basic Wire Management Kit - I ordered this because I wanted the cable coil and zipper, and I figured that the extra zip ties and power strip would come in handy. I also initially requested the Advanced Wire Kit (which is just this basic kit plus the wire tray), but Cary at Uplift pointed out that Uplift v2 desks now include a wire tray and there's literally no sense in spending the extra $10. It also turned out that the desk frame ships with enough adhesive reusable zip ties so that I didn't need those, the magnetic cable organizing channel largely obviated the need for the cable coil, and the 8-outlet mountable surge protector meant I didn't need the power strip. I'm sure the extra ties and hook from this will find a use sometime in the future though, and I was appreciative that my Uplift sales associate actually down-sold me on something that he knew was superfluous--never had that happen before!

Magnetic Cable Organizing Channel - This is a neat metal tube that allows you to run a small bundle of cables down a desk leg discretely. As simple as it is, I quite like it since (1) it is large enough to support both the 14-gauge power cord from my desk's power strip and an Ethernet cable, (2) it matches my desk's finish so it's aesthetically invisible, and (3) it's easy to pop off to add or remove additional cables to the bundle without having to undo zip ties. This was one of the nice touches that Uplift had that Fully lacked, and the aesthetic and convenience is totally worth the $25.

8-Outlet Mountable Surge Protector - This is a really cool power strip that mounts directly to the accessory points unique to the Uplift desk. It has 900 J of surge protection, eight outlets configured so that you can stick wall warts to all of them, and a mounting position that makes it easy to bundle excess cable in the adjacent cable tray. A nice bonus of this surge protector is that it actually fits a standard 19", 1U form factor too. Shown below is the 3-hole ear from the power strip's mounting plate laying on top of a standard 1U server rack ear.

The 3-hole ear from the power strip's mounting plate lines up perfectly with a standard 19" server's rack ear.

You can't mount any old 1U PDU to an Uplift V2 desk without this kit because you'd still need the metal adapters that connect 1U ears to the Uplift V2 accessory mounts, but if you ever do need to replace the actual surge suppressor part of this kit, you could probably buy a third-party 1U PDU to replace it.

Unfortunately, I had a lot of problems with this accessory because of its low manufacturing quality. When my initial desk shipment arrived, this surge protector came out of the box with a lot of plastic bits rattling around inside, and some of the outlets did not receive a plug as well as others. I reported the symptoms to Cary at Uplift, and he immediately offered to send a replacement and said I could just toss the broken one. This no-hassle replacement really dulled the initial disappointment of receiving a broken part.

When the replacement kit came, it too came out of the box with the sound of rattling plastic inside. Since I had a broken spare that was destined for the trash, I decided to take it apart to see if I could repair it and avoid having to wait another week for a second replacement to arrive. As soon as I unscrewed one end, indeed, a bunch of black plastic shards came pouring out of the mostly hollow interior. It was also clear that something was detaching from the aluminum case that should've been holding the eight plugs in place:

Viewing inside of the 8-outlet power strip with an end removed. Something at the far end is clearly out of alignment.

Unscrewing the other end of the surge protector allowed the aluminum housing to slide right off, and that's when the root problem became very apparent--the entire mechanical interface for each pair of outlets is housed in a clamshell-like plastic enclosure that is held together with three small screws. In both my original shipment and the replacement, something impacted the power strip so hard that it shattered the cheap plastic of this housing, causing the back half to detach from the front-facing half that anchors to the aluminum housing:

Broken plastic attachment points, still holding little steel screws.

These plastic housings each support two plugs, and in my originally shipped part, three out of the four housings (six out of eight plugs) were completely destroyed. I can envision a case where trying to plug something into such a damaged strip could pose a fire hazard, so I am surprised that this wasn't identified as an issue during its UL certification.

Dissected power strip with three of four housings being completely broken.

Fortunately, the gap between the aluminum housing and the back of these housings is exactly 0.75". I was able to safely repair both damaged power strips by going to my local hobby store and buying pieces of wood that were 0.75" x 0.5", cutting them to the length of the power strip, and then re-assembling the housings such that the backings were wedged into place between their front half and this piece of wood instead of relying on the (broken) plastic attachment points.

This was the only dissatisfying part of the entire process, and the fact that Uplift sent a replacement without giving me a hard time made it much easier to cope with. I'll add that this power strip, despite being $59.00, felt like the cheapest accessory of the lot, and nothing else I ordered was damaged. The fact that I got two broken ones in a row suggests that either these parts are not being packaged appropriately for shipment, or there is a damaged lot of these at the Uplift warehouse. Until Uplift makes a more robust revision of this part though, I can't recommend ordering one unless you can pick it up in-person in Austin. It's a really convenient accessory, but it's not very sturdy for its price tag.

Afterward

Aside from the inconvenience around the 8-outlet surge protector, I really have no regrets about the time and money I put into buying this desk. The benefits to me were manyfold.

Having an ergonomic desk setup makes work more comfortable. I was fortunate to not have any ergonomic pain with my older IKEA desk, but the fact that it had drawers meant I could not type at the correct height without smashing my knees into the bottom of the desk. By virtue of being a bona fide computer desk, I can now have my keyboard at a good height while my feet are flat on the floor, and with the help of a monitor stand, position my display at the correct eye level. Being a bona fide computer desk, I can also install a clamp-on monitor stand, keyboard tray, or other accessories later on too.

Having a nice desk has improved my mood while working. Some people are content to work in spartan cubicles, but I am not one of them. I'm one of those guys who has plants, photos, tchotchkes, Christmas lights, and even an SGI O2 at my desk.

My desk at work last December, complete with pizza-shaped lights.

I also have a bunch of eccentric "nice" stuff from which I derive joy throughout the work day, like my favorite work bag or my wacky ostrich skin boots.

My therapeutic full-quill ostrich boots can't give me joy so long as I am working from home, but a similarly extravagant and frivolous desk can.

I quickly realized that losing my commute and moving into my neutral and inoffensive guest room took a lot of those little joys out of my day, resulting in just a higher general baseline level of stress and unease while working. Surrounding myself with junk that I like--whether it be nicely finished wood desktop, a sweet Cray 3 poster, or a big bushy ficus--greatly improved my overall mood.

Having something fun to plan and look forward to is important to break up the monotony during the pandemic. I've realized that all the travel I used to do for work helped to keep me motivated throughout the year; whether it be CUG in New Zealand, ISC in Frankfurt, or even a CORAL-2 review in Tennessee, there was always something new to look forward to. Getting excited about customizing the perfect desk is much like planning the perfect ISC presentation, and waiting for the desk to arrive is like waiting to have your first ultra-jetlagged Flammenkuchen and Apfelwein on the bank of the River Main. Having a goal has proven to be critical to getting me through the bad weeks of working through the pandemic.

Having said all this, I realize that I am very fortunate to be employed throughout this pandemic and able to afford buying such an expensive desk. For those who share similar fortune, though, this desk was well worth the cost given the enjoyment I've gotten out of this process.

↧

PDSW'20 Recap

November 19, 2020, 10:00 pm

≫ Next: SC'20 Recap

≪ Previous: The joy of buying a standing desk during the pandemic

This year was the first all-virtual Parallel Data Systems Workshop, and despite the challenging constraints imposed by the pandemic, it was remarkably engaging. The program itself was contracted relative to past years and only had time for three Work-In-Progress (WIP) presentations, so it was a little difficult to pluck out high-level research trends and themes. However, this year's program did seem more pragmatic, with talks covering very practical topics that had clear connection to production storage and I/O. The program also focused heavily on the HPC side of the community, and the keynote address was perhaps the only talk that focused squarely on the data-intensive data analysis side of what used to be PDSW-DISCS. Whether this is the result of PSDW's return to the short paper format this year, shifting priorities from funding agencies, or some knock-on effect of the pandemic is impossible to say.

Although there weren't any strong themes that jumped out at me, last year's theme of using AI to optimize I/O performance was much more muted this year. Eliakin del Rosario presented a paper describing a clustering and visual analysis tool he developed that underpins a study applying machine learning to develop an I/O performance model presented in the main SC technical program, but there was no work in the direction of applying AI to directly optimize I/O. Does this mean that these ideas have climbed over the hype curve and are now being distilled down into useful techniques that may appear in production technologies in the coming years? Or was the promise of AI to accelerate I/O just a flash in the pan?

In the absence of common themes to frame my recap, what follows are just my notes and thoughts about some of the talks and presentations that left an impression. I wasn't able to attend the WIP session or cocktail hour due to non-SC work obligations (it's harder to signal to coworkers that you're "on travel to a conference" when you're stuck at home just like any other workday) so I undoubtedly missed things, but all slides and papers are available on the PDSW website, and anyone with an SC workshop pass can re-watch the recorded proceedings on the SC20 digital platform.

Keynote - Nitin Agrawal

This year’s keynote by Nitin Agrawal was a long-form research presentation on SummaryStore, an “approximate storage system” that doesn't store the data you put in it so much as it stores the data you will probably want to get back out of it at a later date. This notion of a storage system that doesn't actually store things sounds like an affront at a glance, but when contextualized properly, it makes quite a lot of sense:

There are cases where the data being stored doesn't have high value. For example, data may become less valuable as it ages, or data may only be used to produce very rough guesses (e.g., garbage out) so inputting rough data (garbage in) is acceptable. In these cases, the data may not be worth the cost of the media on which it is being stored, or its access latency may be more important than its precision; these are the cases where an approximate storage system may make sense.

The specific case presented by Dr. Agrawal, SummaryStore, strongly resembled a time series database to feed a recommendation engine that naturally weighs recent data more heavily than older data. The high-level concept seemed a lot like existing time series telemetry storage systems where high-frequency time series data are successively aggregated as they age so that new data may be sampled every few seconds while old data may be sampled once an hour.

For example, LMT and mmperfmon are time series data collection tools for measuring the load on Lustre and Spectrum Scale file systems, respectively. The most common questions I ask of these tools are things like

What was the sum of all write bytes between January 2018 and January 2019?
How many IOPS was the file system serving between 5:05 and 5:10 this morning?

By comparison, it's very rare to ask "How many IOPS was the file system serving between 5:05 and 5:10 two years ago?" It follows that the storage system underneath LMT and mmperfmon can be "approximate" to save space and/or improve query performance. Dr. Agrawal's presentation included this pictorial representation of this:

Because these approximate storage systems are specifically designed with an anticipated set of queries in mind, much of Agrawal's presentation really spoke to implementation-specific challenges he faced while implementing SummaryStore--things like how SummaryStore augmented bloom filter buckets with additional metadata to allow approximations of sub-bucket ranges to be calculated. More of the specifics can be found in the presentation slides and references therein.

This notion of approximate storage is not new; it is preceded by years of research into semantic file systems, where the way you store data is driven by the way in which you intend to access the data. By definition, these are data management systems that are tailor-made for specific, high-duty cycle I/O workloads such as web service backends.

What I took away from this presentation is that semantic file systems (and approximate storage systems by extension) aren't intrinsically difficult to build for these specific workloads. Rather, making such a system sufficiently generic in practice to be useful beyond the scope of such a narrow workload is where the real challenge lies. Tying this back to the world of HPC, it's hard to see where an approximate storage system could be useful in most HPC facilities since their typical workloads are so diverse. However, two thoughts did occur to me:

If the latency and capacity characteristics of an approximate storage system are so much better than generic file-based I/O when implemented on the same storage hardware (DRAM and flash drives), an approximate storage system could help solve problems that traditionally were limited by memory capacity. DNA sequence pattern matching (think BLAST) or de novo assembly could feasibly be boosted by an approximate index.
Since approximate storage systems are purpose-built for specific workloads, the only way they fit into a general-purpose HPC environment is using purpose-built composable data services. Projects like Mochi or BespoKV provide the building blocks to craft and instantiate such purpose-built storage systems, and software-defined storage orchestration in the spirit of DataWarp or the Cambridge Data Accelerator would be needed to spin up an approximate storage service in conjunction with an application that would use it.

I'm a big believer in #2, but #1 would require a forcing function coming from the science community to justify the effort of adapting an application to use approximate storage.

Keeping It Real: Why HPC Data Services Don't Achieve I/O Microbenchmark Performance

Phil Carns (Argonne) presented a lovely paper full of practical gotchas and realities surrounding the idea of establishing a roofline performance model for I/O. The goal is simple: measure the performance of each component in an I/O subsystem's data path (application, file system client, network, file system server, storage media), identify the bottleneck, and see how close you can get to hitting the theoretical maximum of that bottleneck:

The thesis of the paper was that even though this sounds simple, there's a lot more than meets the eye. I won't recite the presentation (see the paper and slides--they're great), but I thought some of the more interesting findings included:

There's a 40% performance difference between the standard OSU MPI bandwidth benchmark and what happens when you make the send buffer too large to fit into cache. It turns out that actually writing data over the network from DRAM (as a real application would) is demonstrably slower than writing data from a tiny cacheable memory buffer.
Binding MPI processes to cores is good for MPI latency but can be bad for I/O bandwidth. Highly localized process placement is great if those processes talk to each other, but if they have to talk to something off-chip (like network adapters), the more spread out they are, the greater the path diversity and aggregate bandwidth they may have to get out of the chip.
O_DIRECT bypasses page cache but not device cache, while O_SYNC does not bypass page cache but flushes both page and device caches. This causes O_DIRECT to reduce performance for smaller I/Os which would benefit from write-back caching when used by itself, but increase performance when used with O_SYNC since one less cache (the page cache) has to be synchronized on each write. Confusing and wild. And also completely nonstandard since these are Linux-specific flags.

Towards On-Demand I/O Forwarding in HPC Platforms

Jean Luca Bez (UFRGS) presented a neat userspace I/O forwarding service, FORGE, that got me pretty excited since the field of I/O forwarding has been pretty stagnant since IOFSL came out ten years ago.

The high-level concept is simple: take the intelligence of collective I/O operations implemented in ROMIO and, instead of running them inside the same MPI application performing I/O, offload that functionality to discrete nodes:

This FORGE service is ephemeral in that it is spun up at the same time your MPI application is spun up and persists for the duration of the job. However unlike traditional MPI-IO-based collectives, it runs on dedicated nodes, and it relies on a priori knowledge of the application's I/O pattern to decide what sorts of I/O reordering would benefit the application.

This is perhaps a bit wasteful since nodes are being held idle until I/O happens, but the promise of this idea is much larger. Many large HPC systems have dedicated I/O forwarding nodes because they have to--for example, LNet routers or DVS servers exist in Cray-based HPC systems to do the network protocol conversion to allow InfiniBand-based Lustre and Spectrum Scale file systems to be mounted on Aries-based compute nodes. There's no reason these same nodes couldn't also be used to run FORGE-like services on-demand to buffer and reorder I/Os in transit. And if you stick some NVMe into these protocol conversion nodes, you suddenly have something that looks an awful lot like a transparent burst buffer akin to DDN Infinite Memory Engine.

Taking this a step further, this idea also further motivates having reconfigurable storage infrastructure within an HPC system; with a little bit of knowledge about your I/O workload, one could reconfigure the parallelism and compute power available along the I/O data path itself to optimally balance the limited resources of nodes and the performance benefit. A couple examples:

Have a very IOPS-heavy, many-file workload? Since these tend to be CPU-limited, it would make sense to allocate a lot of FORGE nodes to this job so that you have a lot of extra CPU capacity to receive these small transactions, aggregate them, and drive them out to the file system.
Have a bandwidth-heavy shared-file workload? Driving bandwidth doesn't require a lot of FORGE nodes, and fewer nodes means fewer potential lock conflicts when accessing the shared file.

This intelligent I/O forwarding naturally maps to file system architectures that incorporate I/O forwarding and stateless components--like VAST--where more network and computational parallelism can be sloshed into a compute node's data path to deal with more complex or adversarial I/O patterns.

Fractional-Overlap Declustered Parity

Huan Ke (U Chicago) presented a paper that tried to bridge the gap between RAID implementations that use declustered parity, which has really fast rebuild but a huge failure domain, and traditional (clustered) parity which has very slow rebuilds but a very small failure domain.

The special sauce proposed by Ke is being judicious about how stripes are laid out across a declustered group. Using Latin squares to map RAID blocks to physical drives, one can control how many unique stripes would be affected by a failure (termed the overlap fraction):

This is usually where I stop being able to keep up in these sorts of parity scheme talks; however, I quickly realized that this parity scheme relies on the same principle that engineers use to design cost-efficient parameter sweep experiments. In fact, I made a webpage about this exact topic in the context of optimizing a hypothetical chemical vapor deposition experiment when I was an undergraduate in materials science, and it's really not as complicated as I thought.

What it boils down to is defining a set of experiments (or mappings between RAID blocks and drives) where you vary all the parameters (temperature, pressure etc--or which RAID block maps to which drive) but ensure that the same parameter value is never repeated twice (e.g., don't have two experiments with temperature held at 30C, or have two RAID layouts where block #2 is never placed on drive #3). Orthogonal arrays (which are composed of Latin squares) provide an analytical method for coming up with these unique combinations.

In the engineering context, you essentially never repeat an experiment if you can infer the result of varying one parameter using a combination of other experiments. In the parity placement scheme, you never use a block mapping if a combination of drive failures will break all your RAID stripes. The neat idea behind what Ke presented is a method to vary this constraint so that you can find layout schemes that have any mix of blast radius (how many stripes are lost on an unrecoverable failure) against rebuild time.

NVIDIA GPUDirect Storage Support in HDF5

John Ravi presented his work implementing support for NVIDIA's brand new GPUDirect Storage (which allows data transfer between GPU memory and an NVMe device without ever touching host memory using peer-to-peer PCIe) in HDF5. Much of the talk focused on the implementation details specific to HDF5, but he did present some performance results which I found quite interesting:

In the above diagram, "SEC2" refers to the default POSIX interface, "DIRECT" is POSIX using O_DIRECT, and "GDS" is GPUDirect Storage. What surprised me here is that all of the performance benefits were expressed in terms of bandwidth, not latency--I naively would have guessed that not having to bounce through host DRAM would enable much higher IOPS. These results made me internalize that the performance benefits of GDS lie in not having to gum up the limited bandwidth between the host CPU and host DRAM. Instead, I/O can enjoy the bandwidth of HBM or GDDR to the extent that the NVMe buffers can serve and absorb data. I would hazard that in the case of IOPS, the amount of control-plane traffic that has to be moderated by the host CPU undercuts the fast data-plane path enabled by GDS. This is consistent with literature from DDN and VAST about their performance boosts from GDS.

Fingerprinting the Checker Policies of Parallel File Systems

The final PDSW talk that struck a chord was by Runzhou Han who presented a methodology for exercising parallel file systems' fsck tools using targeted fault injection. He intentionally corrupted different parts of the data structures used by BeeGFS and Lustre to store metadata, then ran fsck to see how well those mistakes were caught. I think the biggest intellectual contribution of the work was formalizing a taxonomy of different types of corruption events (junk data, zeros written, duplicate data, and out-of-sync data) and ways in which fsck does or does not cope with them:

The practical outcome of this work is that it identified a couple of data structures and corruption patterns that are particularly fragile on Lustre and BeeGFS. Alarmingly, two cases triggered kernel panics in lfsck which led me to beg the question: why isn't simple fault injection like this part of the regular regression testing performed on Lustre? As someone who's been adjacent to several major parallel file system outages that resulted from fsck not doing a good job, hardening the recovery process is a worthwhile investment since anyone who's having to fsck in the first place is already having a bad day.

That said, this paper seemed much more practical than foundational and it was unclear where this goes once the immediate issues discovered are addressed. To that end, I could see why hardening fsck isn't getting a lot of research attention.

↧

SC'20 Recap

November 23, 2020, 5:00 am

≫ Next: IOPS are dumb

≪ Previous: PDSW'20 Recap

The HPC industry's biggest conference, SC, was held virtually over the last two weeks. Although the original plan to hold it in Atlanta was supplanted by all-virtual format, it still managed to be a whirlwind show full of product showcases, research presentations, and interesting talks, panels, and workshops. The virtual format certainly wasn't the same as attending in-person, but some of the conference buzz and tone could still be sensed by following the #SC20 tag on Twitter.

As with SC'19, the conference seemed subdued in part due to the fact that many attendees were still being pulled away by their daily lives while attending and in part because the HPC community is still waiting for exascale to finally get here. The community's conversion to remote work has also smeared a lot of the usual vendor briefings and big announcements out over the entire five-month period since ISC'19, causing most of the hot news at SC this year to seem incremental over years past.

Still, I picked up on a few themes that I thought were noteworthy, and what follows is a recap of some of the highlights from the conference as I saw them.

All the standard disclaimers apply to the remainder of this post: these are just my personal opinion and do not represent the viewpoint of anyone other than me. I'm not an expert on many (most?) of these topics, so my observations may be misinformed or downright wrong--feel free to get in touch if I stand to be corrected. Also bear in mind that what I find interesting is colored by my day job as a storage architect; I don't pay close attention to the scientific or application spaces in HPC and instead focus on hardware, architecture, systems design, integration, and I/O. As such, I'm sure I missed all sorts of topics that others find exciting.

Big Splashes

Although there weren't any earth-shattering announcements this year, there were a few newsworthy developments that received a healthy amount of press attention.

What's new

RIKEN's Fugaku machine made its debut at ISC'20 in June this year, but I felt a lot of its deserved fanfare was muted by the the newness of the pandemic and the late-binding decision to convert ISC'20 to being all remote. SC'20 was when Fugaku got to really shine; it improved benchmark results for HPL, HPCG, and Graph500 relative to its ISC'20 numbers:

Fugaku performance improvements since July 2020 from Prof. Matsuoka's FLATS keynote

But RIKEN and Fujitsu had a number of early science success stories to showcase around how the machine was being cited in scientific studies towards better understanding COVID-19.

Intel announced the Ice Lake Xeon architecture as well and put a lot of marketing behind it. And by itself, Ice Lake is a major advancement since it's Intel's first server part that uses their 10 nm process and provides a PCIe Gen4 host interface, and it includes support for 2nd generation 3D XPoint DIMMs (Barlow Pass) and 8 DDR4 memory channels.

Unfortunately, Ice Lake is late to the party in the context of its competition; Intel's benchmark results position Ice Lake as a competitor to AMD Rome which matches Ice Lake's 8-channel/PCIe Gen4-based platform despite being over a year old at this point. For reference:

	Intel Ice Lake^[1]	AMD Rome^[2]
Shipping	4Q2020	3Q2019
Cores	up to 32	up to 64
Memory	8x DDR4-3200	8x DDR4-3200
Host Interface	?x PCIe Gen4	128x PCIe Gen4

By the time Ice Lake starts shipping, AMD will be launching its next-generation Milan server processors, so it's difficult to get excited about Ice Lake if one isn't married to the Intel software ecosystem or doesn't have specific use for the new AVX512 instructions being introduced.

The Intel software ecosystem is not nothing though, and Intel does seem to remain ahead on that front. Intel had its inaugural oneAPI Dev Summit during SC'20, and although I don't follow the application developer space very closely, my perception of the event is that it focused on showcasing the building community momentum around oneAPI rather than delivering splashy announcements. That said, this oneAPI Dev Summit seems to have sucked the air out of the room for other Intel software-centric events; IXPUG had no discernible presence at SC'20 despite IXPUG changing its name from "Intel Xeon Phi User Group" to "Intel eXtreme Performance User Group" when Xeon Phi was sunset. However one dev event is better than none; I did not hear of any equivalent events hosted by AMD at SC'20.

NVIDIA also announced new SKU of its Ampere A100 data center GPU with a whopping 80 GB of HBM2. This was surprising to me since the A100 with 40 GB of HBM2 was only first unveiled two quarters ago. The A100 chip itself is the same so there's no uptick in flops; they just moved to HBM2e stacks which allowed them to double the capacity and get an incremental increase in memory bandwidth.

So, who's this part for? Doubling the HBM capacity won't double the price of the GPU, but the A100-80G part will undoubtedly be more expensive despite there being no additional FLOPS. My guess is that this part was released for

People who just want to fit bigger working sets entirely in GPU memory. Larger deep learning models are the first thing that come to my mind.
People whose applications can't fully utilize A100's flops due to suboptimal memory access patterns; higher HBM2e bandwidth may allow such apps to move a little higher along the roofline.
People who may want to purchase AMD's next-generation data center GPU (which will undoubtedly also use HBM2e) but probably be released before the follow-on to Ampere is ready.

NVIDIA also upgraded its Selene supercomputer to include these A100-80G parts, moving its Top500 position to #5 and demonstrating that these parts exist and deliver as advertised.

What's missing

HPE/Cray was pretty quiet on announcements, especially after two SCs in a row with Shasta (now "Cray EX") news. HPE undoubtedly has its head down readying its first large Shasta installations, and given the fact that the primary manufacturing facilities for Cray Shasta are located in a COVID hotspot in the US, maybe this was to be expected--this autumn has not been the time to rush anything.

That said, we know that Cray EX systems have been shipping since July 2020:

A wee video for Friday afternoon. Watch the installation of the four-cabinet Shasta Mountain system, the first phase of the @ARCHER2_HPC 23-cabinet system.https://t.co/DqYRDJi39B @Cray_Inc pic.twitter.com/8D4Hv5Msmt
— EPCCed (@EPCCed) July 31, 2020

So it is a little surprising that HPE was not promoting any early customer or science success stories yet, and the only Cray EX/Shasta system to appear on Top500 was Alps, a modest 4.6 PF Rome-based system at CSCS. Next year--either at the all-virtual ISC'21 or SC'21--will likely be the year of Cray EX.

Intel was also pretty quiet about Aurora, perhaps for the same reason as HPE/Cray. The fact that Intel's biggest hardware news was around Ice Lake suggests that Intel's focus is on fulfilling the promises of disclosures they made at SC'19 rather than paving new roads ahead. There was a healthy amount of broad-stroke painting about exascale, but aside from the oneAPI buzz I mentioned above, I didn't see anything technically substantive.

Sadly, IBM was the most quiet, and it was perhaps the most prominent appearance of IBM in this year's official program was inwinning the Test of Time Award for the Blue Gene/L architecture. It was almost a eulogy of IBM's once-dominant position at the forefront of cutting-edge HPC research and development, and this feeling was perhaps underscored by the absence of perhaps the most noteworthy IBMer involved in the creation of Blue Gene. This isn't to say IBM had no presence at SC'20 this year; it's just clear that their focus is on being at the forefront of hybrid cloud and cognitive computing rather than supercomputing for supercomputing's sake.

High-level Themes

The most prevalent theme that I kept running into was not the technology on the horizon, but rather the technology further off. There were a few sessions devoted to things like "Post Moore's Law Devices" and "Exotic Technology" in 2035, and rather than being steeped in deep technical insight, they leaned more towards either recitations of similar talks given in years past (one speaker presented slides that were literally five years old)or outlandish claims that hinged on, in my opinion, incomplete views of how technology evolves.

I found the latter talks a bit disturbing to find in the SC program since they contained very little technical insight and seemed more focused on entertainment value--the sort of thing usually relegated to post-conference hotel bar conversation. So rather than repeat their predictions as gospel, I'll present my critical take on them. I realize that it's far easier for me to throw stones at people at the top of the hill than to climb there myself, and I'm perfectly willing to accept that my opinions below are completely wrong. And, if you'd like to throw stones at me yourself, I contributed my position to a panel on tiered storage this year against which all are welcome to argue.

Computing Technologies Futures

This year's focus on far-flung technologies at SC made me wonder--are these sorts of talks filling out the program because there's no clear path beyond exascale? Is it possible that the HPC community's current focus on climbing the exascale mountain is taking our minds off of the possibility that there's nothing past that mountain except desert?

For example, Shekhar Borkar gave his five-year outlook on memory technologies:

Memory Technology Score Card

According to Shekhar it is SRAM, DRAM, Flash & PCM for the next 5 years.

Other technologies are OK for research but not ready for prime time yet.#SC20 #HPC #AI pic.twitter.com/q7sjCp2DFH
— HPC Guru (@HPC_Guru) November 18, 2020

SRAM and DRAM are decades-old staples in the HPC industry, and even NAND has been used in production HPC for a decade now. The statement that PCM may be useful in the next five years is quite striking since PCM products have been shipping in volume since 2017--from this, I take that the future is going to look an awful lot like the present on the memory and storage front. The biggest change, if any, will likely be the economics of NAND and 3D integration evolving to a point where we can afford more all-flash and all-HBM systems in the coming years.

On the computational front, many of the soothsayers leaned heavily on using cryogenics for the post-Moore's Law chip designs. Ultra-low-temperature CMOS and superconductors for supercomputers are a low-hanging fruit to pick when conjecturing about the future since (1) their physics are well understood, and (2) they have clear and nonlinear benefits over the CMOS technologies baked into chips today, as shown by Borkar:

The benefits of low-temperature computing according to Shekhar Borkar

The problem, of course, is that you won't ever be able to buy a cryogenic supercomputer unless a company can make enough money selling a cryogenic supercomputer to (1) pay down the non-recurring engineering costs, (2) recuperate the costs of productizing the design, and (3) make enough profit to make the shareholders or venture capitalists underwriting (1) and (2) happy.

Realize that cryogenics at scale are dangerous and messy--compared to water cooling, there is no municipal supply of liquid helium, and the market for building pumps and piping for cryogenic plumbing is virtually zero compared to water-based plumbing. When you add the fact that the vast majority of data centers--including the hyperscalers who drive much of the data center market--don't want to touch water-cooled infrastructure, the HPC market would have to bear the cost of cryogenic computing at-scale entirely on its own for the foreseeable future.

That all said, remember that this is just my own personal opinion. For a helpful and mostly objective perspective, @HPC_Guru posted a thread that captures the general sentiment of these sessions.

For the sake of entertainment, I'll include some of the more outlandish slides that I saw on this topic.

Erik DeBenedictis had the following predictions of the future in 2006 for 2020:

The future of yesterday - a 2006 prediction of what HPC will look like in 2020 by Erik DeBenedictis

DeBenedictis' primary oversight in this prediction was failing to see the end of Dennard scaling due to physics. Had power consumption continued to drop with node size, we could very well be at 20 GHz today, and the fact that his core counts, flops/socket, and system peak were reasonable is a testament to good forecasting. However Dennard scaling is what forced CPUs towards longer vectors (which is how a 40-core socket can still get 1.6 TF without running at 20 GHz) and what motivated the development of the more power-efficient architecture of GPGPUs. DeBenedictis' predictions for the future, though, don't look as reasonable to me:

The future of HPC is hybrid quantum/classical systems according to DeBenedictis

While quantum/classical hybrid machines may very well exist in 2035, they aren't exactly solving the same problems that today's supercomputers can. In a sense, he chose to make a meta-prediction that science will change to fit the technology available--or perhaps he chose to redefine supercomputing to mean something even more niche than it does today.

Thomas Sterling also gave his 200 GHz yottaflop prediction:

Thomas Sterling's gonzo predictions of HPC in 2035

which hasn't changed since he predicted a superconducting yottaflop at ISC'18. Unlike DeBenedictis, Sterling chose not to redefine HPC to fit the available technology but instead predict a physical, economical, and practical fantasy about the future. Not there's anything wrong with that. Everyone's got to have a goal.

Kathy Yelick offered the most pragmatic 15-year prediction:

Kathy Yelick's predictions of HPC in 2035

and I can't poke holes in any of these predictions because there is a clear path from today to this vision for the future. That said, if you actually attach flops and hertz to these predictions, the future does not look nearly as exciting as superconducting yottaflops do.

As dissatisfying as it may be, Shekhar Borkar had a slide that is probably the pathway into the future of HPC:

Moore's Law will survive as long as we change what it means according to Borkar

The only way the future of HPC will be predictable is if you're willing to define what HPC is to fit whatever the available technologies are. Yelick expressed the same sentiment with her "Not sure, but it will be called OpenMP" bullet, and to his credit, Sterling himself did this with his Beowulf cluster. If the market just gives you a pile of parts, strap it together and call it HPC. And if transistor scaling has no more steam, find something that still has legs and call it Moore's Law.

Storage Technologies Futures

On the storage front, the predictions from 2006 for 2020 storage technology were pretty reasonable as well. Dr. Mark Kryder (of Kryder's Law fame) predict that Kryder's Law would hold:

Mark Kryder's vision for HDDs in 2020 as told in 2006

However he mispredicted how it would hold--his assumption was that surface bit density would keep skyrocketing, and this is why his bandwidth number was so far off. Packing magnetic bits ever more closely together turns out to be a very difficult problem, so the hard disk drive industry chose to increase capacities by solving the easier problem of packing more platters into a single 3.5" half-height form factor.

The flash predictions of Richard Freitas (who passed away in 2016) were also very reasonable:

Predictions for solid-state storage in 2020 from Rich Freitas in 2006

His biggest miscalculation was not realizing that solid-state storage would bifurcate into the two tiers we now call RAM and flash. He predicted "storage class memory" based on the assumption that it would be block-based (like flash) but use a simple and low-latency bus (like RAM). We enjoy higher bandwidth and capacity than his prediction due to the increased parallelism and lower cost of NAND SSDs, but relying on PCIe instead of a memory bus and the low endurance of NAND (and therefore significant back-end data management and garbage collection) drove up the latency.

Predictions for the future were more outlandish. Kryder's prediction for 2035 was a bit too much for me:

Kryder's 15-year outlook for HDDs with a heaping serving of "oof"

Extrapolating Kryder's Law another 15 years puts us at 1.8 petabytes per hard drive, but this rests on the pretty shaky foundation that there's something holy about hard disk drive technology that will prevent people from pursuing different media. Realizing this requires two things to be true:

The HDD industry remains as profitable in the next fifteen years as it is today. Seeing as how parts of the HDD industry are already going extinct due to flash (remember when personal computers had hard drives?) and hyperscale is taking more ownership of drive controller functionality and eating into manufacturers' margins, I just don't see this as being likely.
The cost to develop the required recording techniques (two-dimensional magnetic recording and bit-patterned media) is both as fast and as cheap as HAMR was. If it's not, see #1 above--there won't be the money or patience to sustain the HDD market.

This doesn't even consider the appeal of dealing with 1.8 PB drives as a system architect; at Kryder's forecasted numbers, it would take eleven days to fill, rebuild, or scrub one of these drives. As a system designer, why would I want this? Surely there are better ways to assemble spindles, motors, actuators, and sheet metal to increase my bandwidth and reduce my blast radius than cramming all these platters into a 3.5" form factor.

My bet (and note--I was not invited to contribute it, as I am not an expert!) is that the HDD market will continue to slow down as it falls off the Kryder Law curve due to scaling limitations. This will result in a slow but downward spiral where R&D slows because it is starved of funding, and funding is starved because HDDs fall further and further off of the economics curve. HDDs won't be gone by 2035, but they will fit in the small gap between that exists between low-cost write-once-read-many media (like ultra-dense trash flash) and low-cost write-once-read-never media (like tape).

Kryder essentially acknowledged that his projection relies on something intrinsically special about HDDs; he commented that the technological advancements required to reach 1.8 PB HDDs will happen because HDD engineers don't want to lose their jobs to the flash industry. Personally, I'd take a new job with an exciting future over a gold watch any day of the week. Maybe that's the millennial in me.

I found this general theme of wildly projecting into the future rather yucky this SC, and I won't miss it if it's gone for another fifteen years. By their very nature, these panels are exclusive, not inclusive--someone literally has to die in order for a new perspective to be brought on board. There was an element to this in the Top500 BOF as well, and one slide in particular made me cringe at how such a prominent good-ol-boys club was being held up before the entire SC community. These sorts of events are looking increasingly dated and misrepresentative of the HPC community amidst the backdrop of SC putting diversity front and center.

Actual Future Directions

Although wild projections of the future felt like fashionable hot topics of the year, a couple of previous hot topics seemed to be cooling down and transitioning from hype to reality. Two notable trends popped out at me: the long-term relationship between HPC and AI and what disaggregation may really look like.

The Relationship of HPC and AI

As has been the norm for a few years now, deep learning (now more broadly "AI") was peppered across the SC program this year. Unlike previous years, though, the AI buzz seemed to be tempered by a little more pragmatism as if it were coming down the hype curve. Perhaps the best talk that captured this was an invited talk by Cliff Young of Google about the possibility of aVirtuous Cycle of HPC and AI.

The "convergence of HPC and AI" has been talked about in the supercomputing community since HPC-focused GPUs were reinvented as an AI accelerator. If you look at who's been selling this line, though, you may realize that the conversation is almost entirely one-way; the HPC industry pines for this convergence. The AI industry, frankly, doesn't seem to care what the HPC industry does because they're too busy monetizing AI and bankrolling the development of the N+1th generation of techniques and hardware to suit their needs, not those of the HPC industry.

Dr. Young's talk closed this loop by examining what the AI industry can learn from HPC; the so-called "Cambrian explosion" of accelerators is somewhere near its peak which has resulted in a huge architectural design space to explore:

How ML can learn from HPC according to Cliff Young

When cast this way, HPC actually has a lot of experience in driving progress in these areas; the 4x4 systolic array design point has its genesis in the HPC-specific MIC architecture, and the HPC industry drove the productization of the DRAM-backed HBM memory hierarchy implemented by IBM for the Summit and Sierra systems. These HPC-led efforts presumably contributed to Google's ability to bet on much larger array sizes starting with its first-generation TPU.

In addition, it sounds like training has begun to reach some fundamental limits of data-parallel scalability:

Limitations being faced by machine learning

HPC has long dealt with the scalability limitations of allreduce by developing technologies like complex low- and high-radix fabric topologies and hardware offloading of collective operations. Whether the AI industry simply borrows ideas from HPC and implements its own solutions or contributes to existing standards remains to be seen, but standards-based interfaces into custom interconnects like AWS Elastic Fabric Adapter are a promising sign.

Another "hard problem" area in which HPC is ahead is in sparse matrices:

Impending challenges brought by moving to sparse methods in ML

Young's position is that, although "sparse" means different things to AI (50-90% sparse) than it does to HPC (>95% sparse), HPC has shown that there are algorithms that can achieve very high fractions of peak performance on sparse datasets.

His concluding slide was uplifting in its suggestion that the HPC-AI relationship may not be strictly one-way forever:

How HPC and ML can work together to advance technology

He specifically called out promise in the use of mixed precision; AI already relies on judicious use of higher-precision floating point to stabilize its heavy use of 16-bit arithmetic, and scientific computing is finding algorithms in which 16-bit precision can be tolerated.

Being more hardware- and infrastructure-minded myself, I was particularly surprised to see this nod to liquid cooling early on:

Liquid cooling in hyperscale - one of few areas in which HPC is ahead

Google's TPU v3 was its first foray into direct liquid cooling, a data center technology that HPC has been using for decades (think: Cray-2's waterfall). While this may not seem spectacular to any PC enthusiast who's done liquid cooling, the difficulty of scaling these systems up to rack-, row-, and data center-scale are not always linear. Young explicitly acknowledged HPC's expertise in dealing with liquid-cooled infrastructure, and if hyperscale is driven in this direction further, HPC will definitely benefit from the advances that will be enabled by a new and massive market driver.

Disaggregation in Practice

The promise of disaggregation--having pools of CPU, persistent memory, GPUs, and flash that you can strap together into a a single node--has been around for a long time and had steadily gained attention as a potential candidate for an exascale technology. However I don't think there was a realistic hope for this until IBM's AC922 node--the one that comprises the Summit and Sierra systems--hit the market and demonstrated a unified, hardware-enabled coherent memory space across CPUs and GPUs.

The actual story there wasn't great though; coherence between CPU and GPU was enabled using NVIDIA's proprietary NVLink protocol while the CPU and NIC were connected via a different coherence protocol, OpenCAPI, over the same physical interface. CCIX and GenZ also emerged as high-speed protocols for cache coherence and disaggregation, and the story only got worse when Intel put forth CXL as its standard for coherence and disaggregation.

Fortunately, the dust is now settling and it appears that CXL and GenZ are emerging at the front of the pack. There was an amicable panel session where members of these two consortia presented a unified vision for CXL and GenZ which almost appeared credible: CXL would be the preferred protocol for inside a chassis or rack, and GenZ would be the preferred protocol between chassis and racks. Key features of the finalized CXL 2.0 standard were unveiled which largely revolved around support for CXL switches:

Roles played by CXL 2.0's switch capability fromDebendra Das Sharma

These switches function not only as port expanders to allow many devices to plug into a single host, but also as true switches that enable multi-root complexes that pool hosts and devices to dynamically map devices to hosts using CXL's managed hotplug capability. There's also support for a CXL Fabric Manager that moderates something that looks a lot like SR-IOV; a single physical device can be diced up and mapped to up to sixteen different hosts. At its surface, this looks like a direct, open-standard competitor to NVLink, NVSwitch, and MIG.

What these new CXL switches do not support is inter-switch linking; all CXL devices much share a single switch to maintain the low latency for which CXL was designed. This is where GenZ fits in since it is a true switched fabric, and this is why the CXL and GenZ consortia have signed a memorandum of understanding (MOU) to design their protocols towards mutual compatibility and interoperability so that the future of disaggregated systems will be composed of pooled CXL devices bridged by a GenZ fabric. A direct parallel was drawn to PCIe and Ethernet, where a future disaggregated system may see CXL assume the role of PCIe, and GenZ may assume the role currently filled by Ethernet.

When it came time for Q&A, the panel got more interesting.

A lot of the audience questions revolved around what standards CXL is planning to wipe off the face of the planet. The Intel (and CXL) panelist, Debendra Das Sharma, fielded the bulk of these questions and made it clear that

(1) CXL will not replace DDR as a local memory interface; it is a complementary technology. This sounded a little disingenuous given the following slide was also shown to highlight CXL 1.0's latency being on par with DRAM latency:

Latency of CXL in the context of storage devices

(2) CXL will not replace PCIe as a host I/O interface; is a superset of PCIe and many devices will remain happy with PCIe's load/store semantics. Of course, this is what I would say too if I had effective control over both the CXL standard and the PCIe SIG.

When asked directly if Intel had joined the GenZ consortium though, Sharma gave a terse "no" followed by "no comment" as to why. He then immediately followed that with a very carefully crafted statement:

"While we have not joined the GenZ consortium, we are fully supportive of making the CXL enhancements that will help GenZ."

The panelists also commented that the MOU was designed to make transitioning from CXL to GenZ protocols smooth, but when asked exactly how the CXL-to-GenZ bridge would be exposed, Tim Symons (representing Microchip and GenZ) could not offer an answer since this bridging function is still being defined. These sorts of answers left me with the impression that CXL is in the driver's seat and GenZ has been allowed to come along for the ride.

Reading between the lines further, there was a striking absence of HPE people on the panel given the fact that GenZ originated within HPE's "The Machine" project. It remains unclear where GenZ fits now that HPE owns Slingshot, a different high-performance scale-out switched fabric technology. What would be the benefit of having a three-tier Slingshot-GenZ-CXL fabric? If CXL 2.0 adopted a single-hop switch and fabric manager, what's to stop CXL 3.0 from expanding its scope to a higher radix or multi-hop switch that can sensibly interface directly with Slingshot?

Given that CXL has already eaten a part of GenZ's lunch by obviating the need for GenZ host interfaces, I wouldn't be surprised if GenZ eventually meets the same fate as The Machine and gets cannibalized for parts that get split between future versions of Slingshot and CXL. CXL has already effectively killed CCIX, and IBM's decision to join CXL suggests that it may be positioning to merge OpenCAPI's differentiators into CXL after Power10. This is pure speculation on my part though.

Spectrum Scale User Group vs. Lustre BOF

Because SC'20 was smeared over two weeks instead of one, I got to attend both the Lustre BOF and one of the Spectrum Scale User Group (SSUG) sessions. I also came equipped with a much more meaningful technical understanding of Spectrum Scale this year (I've spend the last year managing a group responsible for Spectrum Scale at work), and it was quite fascinating to contrast the two events and their communities' respective priorities and interests.

The Spectrum Scale User Group featured a presentation on "What is new in Spectrum Scale 5.1.0" and the Lustre BOF had its analogous Feature Discussion. I broadly bucketize the new features presented at both events into four categories:

1. Enterprisey features that organizations may care about

For Spectrum Scale, this included support for newer releases of RHEL, SLES, Ubuntu, AIX(!), and Windows (!!). IBM also noted that Spectrum Scale also now supports the zEDC hardware compression unit on the z15 mainframe processor:

https://www.spectrumscaleug.org/wp-content/uploads/2020/11/episode-11-what-is-new-in-5-1.pdf

Spectrum Scale 5.1 platform updates

The Lustre discussion presented their equivalent OS support slide with a similar set of supported enterprise Linux distributions (RHEL, SLES, Ubuntu). No support for AIX or Z (s390x) though:

Lustre 2.14 platform updates

If nothing else, this was a reminder to me that the market for Spectrum Scale is a bit broader than just HPC like Lustre is. I have to assume they have enough AIX, Windows, and Z customers to justify the their support for those platforms. That said, wacky features like hardware-assisted compression is not unique to Spectrum Scale on Z; Lustre picked up hardware-assisted compression back in 2017 thanks to Intel.

New improvements to Spectrum Scale's security posture were also presented that were a little alarming to me. For example, one no longer has to add scp and echo to the sudoers file for Spectrum Scale to work (yikes!). There was also a very harsh question from the audience to the effect of "why are there suddenly so many security fixes being issued by IBM?" and the answer was similarly frightening; Spectrum Scale is now entering markets with stringent security demands which has increased IBM's internal security audit requirements, and a lot of new vulnerabilities are being discovered because of this.

It's ultimately a good thing that Spectrum Scale is finding a fixing a bunch of security problems, since the prior state of the practice was just not performing stringent audits. I assume that Lustre's approach to security audits is closer to where Spectrum Scale was in years past, and should Lustre ever enter these "new markets" to compete with Spectrum Scale, I expect a similarly uncomfortable quantity of security notices would come to light. This is all speculative though; the only definite is that IBM is moving GPFS towards role-based access control which is a positive direction.

Overall, Spectrum Scale seemed considerably more focused on developing these enterprisey features than Lustre.

2. Manageability features that administrators may care about

Spectrum Scale also revealed a bunch of smaller features that are nice to have for administrators including

Faster failing of hung RDMA requests - you can now set a maximum time that an RDMA request can hang (e.g., if an endpoint fails) before its thread is killed by Spectrum Scale itself. This avoids having to wait for lower-level timeouts and seems like a nice-to-have knob for a file system that supports a lot of path and endpoint diversity. Lustre may be ahead on this front with its lnet_transaction_timeout parameter, but it's unclear exactly how these two settings differ.
Safeguards against administrator error - Spectrum Scale added features that warn the administrator about doing something that may be a mistake, such as accidentally breaking quorum by downing a node or mapping incorrect drive slots to RAID groups. There's not really equivalent functionality in Lustre; these are the places where Lustre solution providers (think HPE/Cray ClusterStor) get to value-add management software on top of open-source Lustre (think cscli)
GUI and REST API changes - you can do an increasing amount of management operations using the Spectrum Scale GUI or its underlying control-plane REST API. Lustre has the IML GUI, but it isn't treated as a first-class citizen in the same way that Spectrum Scale does and it was not mentioned at the Lustre BOF at all. Again, this is an area where vendors usually value-add their own management on top of community Lustre.
Improved monitoring, reporting, and phone-home - a framework called "MAPS" was recently introduced to essentially do what Nagios does in most DIY environments--raise alarms for crashes, resource exhaustion, misconfiguration, and the like. It also does performance monitoring and historical data aggregation. As with the other manageability features mentioned, Lustre relies on third-party tools for these features.

For resilience, Spectrum Scale announced new tunable parameters to improve parallel journal recovery:

Spectrum Scale's latest advancements in improving recovery performance

whereas Lustre announced parallel fsck with major performance improvements to speed up recovery:

Lustre's latest advancements in improving recovery performance

Finally, Spectrum Scale showcased its vision to allow Spectrum Scale to be mounted inside containerized environments:

The Spectrum Scale vision for containerized application access

This is actually somewhere that Lustre is quite a bit ahead in some regards because it has long had features like UID/GID mapping and subdirectory mounts that allow for a greater degree of isolation that maps well to untrusted containers.

That all said, Lustre's focus is not on taking on more of these nice-to-have manageability features. When asked about adding basic manageability features like supporting easy addition/removal of Lustre OSTs and OSSes to enable evergreen Lustre systems analogous to Spectrum Scale's mmrestripefs command, the answer was effectively "no." The reason given is that (1) Lustre clients are where files get stitched together, so migration will always have to involve client access, and (2) lfs find and lfs migrate already provide the tools necessary to move data files in theory. From this, I take away that stitching those two lfs commands together into a tool that actually does what mmfsrestripe does is an exercise left to the viewer--or a company who can value-add such a tool on top of their Lustre offering.

3. Performance, scalability, and reliability features that end users may care about

Spectrum Scale didn't have a huge amount to offer in the user-facing performance/scalability/reliability features this year. They improved their support for QOS (which is admittedly fantastic when compared to Lustre's Token Bucket Filter QOSwhich cannot limit IOPS like Spectrum Scale can) from an administrator standpoint, and they have begun to think about how to incorporate TRIMming into flash-based Spectrum Scale deployments to offer reliable performance.

By comparison, Lustre's new features really shine in this department. Andreas Dilger presented this slide near the beginning of his talk:

Some of Lustre's many upcoming performance improvements

which reflects significant attention being paid to improving the performance of emerging noncontiguous and otherwise adversarial I/O pattern--perhaps motivated by storage-hungry AI and genomics markets.

Lustre is also introducing features aimed at both scale-up and scale-out, with a 30x speedup in the time it takes to mount petabyte OSTs (likely in preparation for the exascale Lustre installations coming in the next year or two) and automated directory metadata sharding, shrinking, and balancing. From this, it's clear that the primary focus of Lustre continues to be extreme scale and performance above all else, but it's unclear how much of this effort is putting Lustre ahead of Spectrum Scale as much as it is catching up to all the effort that went into making Spectrum Scale scale out to 250 PB for the Summit system.

4. Interface features that platform developers may care about

The newest release of Spectrum Scale introduces improvements to NFS (by adding v4.1 support), CSI (incremental improvements), SMB (incremental improvements), and most surprising to me, HDFS. By comparison, I don't think Lustre directly supports any of these interfaces--you have to use third-party software to expose these protocols--and if they are supported, they aren't under active development.

Overall Impressions

These two presentations pointed to a sharp contrast between how Spectrum Scale and Lustre position themselves as storage systems; IBM's vision for Spectrum Scale is as a high-capacity data lake tier against which a diversity of apps (HPC, containerized services, map-reduce-style analytics) can consume and product data. They even said as much while talking about their HDFS support:

Spectrum Scale's vision as a hub for all data in the enterprise

Spectrum Scale AFM improvements were also touted at the user group presentation as a means to enable workflows that span on-premise and public cloud for workloads involving HPC, containerized services, file, and object--no matter where you operate, Spectrum Scale will be there. They showed this logo soup diagram which spoke to this:

Spectrum Scale logo soup supporting complex workflows and hybrid cloud

and it's clearly aligned with IBM's hybrid cloud corporate strategy. I can see how this vision could be useful based on my experience in industry, but at the same time, this looks like a Rube Goldberg machine with a lot of IBM-specific linchpins that concentrates risk on IBM product support (and licensing costs!) progressing predictably.

Lustre, by comparison, appears to be focused squarely on performance and scale. There was no logo soup or architectural vision presented at the Lustre BOF itself. This is likely a deliberate effort by the Lustre community to focus on being an open-source piece to a larger puzzle that others can package up by anyone with the need or business acumen to do so. Just as Linux itself is just a community effort around which companies like Red Hat (IBM) or SUSE build and market a solution, Lustre should be just one part of an organization's overall data management strategy whereas Spectrum Scale is trying to be the entire answer.

This isn't a value judgment for or against either; Lustre offers more architectural flexibility at the cost of having to do a lot of day-to-day lifting and large-scale architectural design oneself, while Spectrum Scale is a one-stop shop that likely requires fewer FTEs and engineering effort to build infrastructure for complex workflows. The tradeoff, of course, is that Spectrum Scale and its surrounding ecosystem is priced for enterprises, and absent a new pricing scheme that economically scales cost with capacity (hypothetically referred to as "data lake pricing" at the SSUG), the choice of whether to buy into Spectrum Scale or Lustre as a part of a larger data strategy may come down to how expensive your FTEs are.

On a non-technical note, the Lustre BOF certainly felt more community-oriented than the Spectrum Scale UG; the dialog was more collegial and there were no undertones of "customers" demanding answers from "vendors." This is not to say that the SSUG wasn't distinctly more friendly than a traditional briefing; it just felt a bit more IBM-controlled since it was on an IBM WebEx whose registration was moderated by IBM and where all the speakers and question answerers were IBM employees. Perhaps there's no other way in a proprietary product since the vendor ultimately holds the keys to the kingdom.

IO-500 BOF

The IO-500 BOF is one of my favorite events at both ISC and SC each year, but as with the rest of SC'20, this year's IO-500 BOF felt like a quiet affair. I noticed two noteworthy themes:

I/O performance is being awarded in dimensions beyond just peak I/O bandwidth. There are six awards now being given for first place: 10-node bandwidth, 10-node metadata, 10-node overall, total bandwidth, total metadata, and total overall. This contrasts with Top500 which treats performance in a single dimension (peak HPL) and implicitly perpetuates the position that HPL performance is the only aspect of performance that defines "#1." I quite like the IO-500 approach because it makes it easier to see a multidimensional picture of I/O performance and apply your own value system to the list to decide what combination of hardware and storage system software qualifies as #1.
The importance of system configuration is elevating in the IO-500 community--defining a system hardware schema, presenting the data uniformly, and establishing standard tools and techniques for collecting this data from the systems running the IO500 benchmark are all on the roadmap for the IO-500 benchmark. Again, this makes the list much more valuable for the purposes of learning something since a properly annotated set of submissions would allow you to understand the effects of, for example, choosing NVMe over SAS SSDs or declustered parity over RAID6 on nonvolatile media.

The final IO-500 list for SC'20 itself didn't change much this time; experimental and proof-of-concept file systems remain dominant in the top 10 positions, and DAOS, WekaFS, and IME carry most of the weight. However the #1 position was a surprise:

The overall winner for the IO-500 full list was Pengcheng Laboratory's MadFS

A new file system called "MadFS" took the top spot with some ridiculous performance numbers, and frustratingly, there have been no public disclosures about what this file system is or how it works. The IO-500 committee said that they spoke privately with the submitters and felt comfortable that the entry was legitimate, but they were not at liberty to disclose many details since Pengcheng Laboratory is preparing to present MadFS at another venue. They did hint that MadFS drew inspiration from DAOS, but they didn't offer much more.

Peeling the MadFS submission apart does reveal a few things:

It is a file system attached to Pengcheng Laboratory's Cloudbrain-II system, which is a Huawei Atlas 900 supercomputer packed with Huawei Kungpeng 920 ARM CPUs and Huawei Ascend 910 coprocessors. Cloudbrain-II is a huge system with a huge budget, so it should have a very capable storage subsystem.
72 processes were run on each of the 255 client nodes, reaching a peak of2,209,496 MiB/second. This translates to 73 Gbit/sec out of each 100 Gb/s node--pretty darned efficient.
The MadFS file system used is 9.6 PB in size, and the fastest-running tests (ior-easy-*) ran for a little over six minutes. This corresponds to863 TB read and written in the best case, which is reasonable.
The ior-easy tests were run using a transfer size of2,350,400 bytes which is a really weird optimization point. Thus, it's unlikely that MadFS is block-based; it probably runs entirely in DRAM or HBM, is log-structured, and/or relies on persistent memory to buffer byte-granular I/O from any underlying block devices.
The submission indicates that 254 metadata nodes were used, and each node had six storage devices. The submission also says that data servers (of an undefined quantity) has 2 TB NVMe drives.

Since 255 clients and 254 metadata servers were used, this may suggest that metadata is federated out to the client nodes. This would explain why the metadata rates are so astonishing.
If the 9.6 PB of NVMe for data was located entirely on the 255 clients, this means each compute node would've had to have had over 37 TB of NVMe after parity. This seems unlikely.
From this, we might guess that MadFS stores metadata locally but data remotely. This would be a very fragile architecture for important data, but a reasonable one for ephemeral storage akin to UnifyFS.

MadFS is not ready for prime time, as its statfs(2) returns nonsense data. For example, the MadFS ior-easy-* runs report the file system has zero inodes, while the ior-hard-* runs reported268 trillion inodes all of which are used.

Until more disclosures are made about MadFS and the Cloudbrain-II system though, there's little intellectual value in this IO-500 submission. However the waters are definitely chummed, and I for one will be keeping an eye out for news about this Chinese system.

Finally, although not part of the IO-500 BOF, Microsoft Azure released some benchmark results shortly after about their successful demonstration of over 1 TB/sec using BeeGFS in Azure. This wasn't run to the IO-500 spec so it wouldn't have been a valid submission, but it is the single fastest IOR run in the cloud of which I am aware. This bodes well for the future of parallel file systems in the cloud as a blessed BeeGFS/Azure configuration would compete directly with Amazon FSx for Lustre.

Concluding Thoughts

Virtual SC this year turned out to be far more exhausting than I had anticipated despite the fact that I never had to leave my chair. On the upside, I got to attend SC with my cat for the first time:

Harriet dialing into the Women in HPC Workshop with me

and I didn't find myself getting as sweaty running between sessions. On the downside, the whole conference was just weird. The only conference buzz I felt was through the Twitter community due to the total lack of chance encounters, late nights out, early morning briefings, and copious free coffee. The content felt solid though, and I admit that I made heavy use of pause, rewind, and 2x replay to watch things that I would have otherwise missed in-person.

In my past SC recaps I remarked that I get the most out of attending the expo and accosting engineers on the floor, and the complete absence of that made SC feel a lot less whole. As a speaker, the lack of engagement with the audience was very challenging too. The 45-second delay between live video and Q&A made dialog challenging, and there was no way to follow up on questions or comments using the virtual platform. I suppose that is the price to be paid for having an otherwise robust virtual event platform.

Although COVID forced us all into a sub-optimal SC venue this year, I think it also took away a lot of advancements, discussions, and dialog that would've fed a richer SC experience as well. With any luck SC can be in-person again next year and the community will have bounced back and made up for the time lost this year. When SC'21 rolls around, we should have at least one exascale system hitting the floor in the US (and perhaps another in China) to talk about, and the Aurora system should be very well defined. We'll have a few monster all-flash file systems on the I/O front to boot (including one in which I had a had a hand!), and the world will be opening up again--both in the technological sense and the literal sense. The future looks bright.

As always, I owe my sincerest thanks to the organizers of SC this year for putting together the programs that spurred this internal monologue and the dialogues in which I engaged online these past two weeks. I didn't name every person from whom I drew insight, but if you recognize a comment that you made and would like attribution, please do let me know.

Finally, if you'd like to read more, see my recap of the PDSW'20 workshop and forthcoming recap of my Tiered Storage BOF and the DAOS User Group.

↧

IOPS are dumb

October 24, 2021, 9:56 am

≫ Next: Life and leaving NERSC

≪ Previous: SC'20 Recap

This post is a long-form dump of some thoughts I've had while testing all-flash file systems this past year, and I'll be harping on many of these themes at SC'21 this year. Bits of this appear in a presentation and paper I'm presenting at PDSW'21 about new benchmarking techniques for testing all-flash file systems, and I'll be discussing if IOPS really factor into performance for AI workloads at a Fireside Chat with Jeff Denworth. I'd love to hear if you disagree with my perspective at either of the above events or the usual social media channels!

"How many IOPS do you need?"

I'm often asked this by storage vendors, and the question drives me a little bonkers. I assume they ask it because their other customers bring them black-and-white IOPS requirements, but I argue that anyone would be hard-pressed to explain the scientific value of one I/O operation (versus one gigabyte) if ever called on it. And yet, IOPS are undeniably important; the illustrious Rob Ross devoted a whole slide dedicated to this at a recent ASCAC meeting:

Rob Ross' perspective on why IOPS are now important for HPC I/O

I agree with all of Rob's bullets and yet I disagree with the title of his slide; IOPS are dumb, and yet ignoring them when designing a performance-optimized parallel file system is even more dumb in contemporary times. So let's talk about the grey area in between that creates this dichotomy.

First, bandwidth is pretty dumb

If there's one constant in HPC, it's that everyone hates I/O. And there's a good reason: it's a waste of time because every second you wait for I/O to complete is a second you aren't doing the math that led you to use a supercomputer in the first place. I/O is the time you are doing zero computing amidst a field called "high performance computing."

That said, everyone appreciates the product of I/O--data. I/O is a necessary part of preserving the results of your calculation, so nobody ever says they wish there was no I/O. Instead, infinitely fast I/O is what people want since it implies that 100% of a scientist's time using an HPC is spent actually performing computations while still preserving the results of that computation after the job has completed.

Peeling back another layer of that onion, the saved results of that computation--data--has intrinsic value. In a typical simulation or data analysis, every byte of input or output is typically the hard-earned product of a lot of work performed by a person or machine, and it follows that if you want to both save a lot of bytes but want to spend as little time as possible performing I/O, the true value of a parallel storage system's performance is in how many bytes per second it can read or write. At a fundamental level, this is why I/O performance has long been gauged in terms of megabytes per second, gigabytes per second, and now terabytes per second. To the casual observer, a file system that can deliver 100 GB/s is more valuable than a file system that can deliver only 50 GB/s assuming all things are equal for this very reason. Easy.

This singular metric of storage system "goodness" quickly breaks down once you start trying to set expectations around it though. For example, let's say your HPC job generates 21 TB of valuable data that must be stored, and it must be stored so frequently that we really can't tolerate more than 30 seconds writing that data out before we start feeling like "too much time" is being spent on I/O instead of computation. This turns out to be 700 GB/s--a rather arbitrary choice since that 30 seconds is a matter of subjectivity, but one that reflects the value of your 21 TB and the value of your time. It should follow that any file system that claims 700 GB/s of write capability should meet your requirements, and any vendor who can deliver such a system should get your business, right?

Of course not. It's no secret that obtaining those hero bandwidths, much like obtaining Linpack-level FLOPS, requires you (the end-user) to perform I/O in exactly the right way. In the case of the aforementioned 700 GB/s file system, this means

Having each MPI process write to its own file (a single shared file will get slowed down by file system lock traffic)
Writing 4 MiB at a time (to exactly match the size of the network transmission buffers, remote memory buffers, RAID alignment, ...)
Using 4 processes per node (enough parallelism to drive the NIC, but not too much to choke the node)
Using 960 nodes (enough parallelism to drive all the file system drives, but not too much to choke the servers)

I've never seen a scientific application perform this exact pattern, and consequentially, I don't expect that any scientific application has ever gotten that 700 GB/s of performance from a "700 GB/s file system" in practice. In that sense, this 700 GB/s bandwidth metric is pretty dumb since nobody actually achieves its rated performance. Of course, that hasn't prevented me from saying these same dumb things when I stump for file systems. The one saving grace of using bandwidth as a meaningful metric of I/O performance, though, is that I/O patterns are a synthetic construct and can be squished, stretched, and reshaped without affecting the underlying scientific data being transmitted.

The value of data is in its contents, not the way it is arranged or accessed. There's no intrinsic scientific reason why someone should or shouldn't read their data 4 MiB at a time as long as the bits eventually get to the CPU that will perform calculations on it in the correct order. The only reason HPC users perform nice, 1 MiB-aligned reads and writes is because they learn (either in training or on the streets) that randomly reading a few thousand bytes at a time is very slow and works against their own interests of minimizing I/O time. This contrasts sharply with the computing side of HPC where the laws of physics generally dictate the equations that must be computed, and the order in which those computations happen dictates whether the final results accurately model some physical process or just spit out a bunch of unphysical garbage results.

Because I/O patterns are not intrinsically valuable, we are free to rearrange them to best suit the strengths and weaknesses of a storage system to maximize the GB/s we can get out of it. This is the entire foundation of MPI-IO, which receives I/O patterns that are convenient for the physics being simulated and reorders them into patterns that are convenient for the storage system. So while saying a file system can deliver 700 GB/s is a bit disingenuous on an absolute scale, it does indicate what is possible if you are willing to twist your I/O pattern to exactly match the design optimum.

But IOPS are particularly dumb

IOPS are what happen when you take the value out of a value-based performance metric like bandwidth. Rather than expressing how many valuable bytes a file system can move per second, IOPS express how many arbitrary I/O operations a file system can service per second. And since the notion of an "I/O operation" is completely synthetic and can be twisted without compromising the value of the underlying data, you might already see why IOPS are a dumb metric of performance. They measure how quickly a file system can do something meaningless, where that meaningless thing (an I/O operation) is itself a function of the file system. It's like saying you can run a marathon at five steps per second--it doesn't actually indicate how long it will take you to cover the twenty six miles.

IOPS as a performance measure was relatively unknown to HPC for most of history. Until 2012, HPC storage was dominated by hard drives which which only delivered high-value performance for large, sequential reads and writes and the notion of an "IOP" was antithetical to performance. The advent of flash introduced a new dimension of performance in its ability to read and write a lot of data at discontiguous (or even random) positions within files or across entire file systems. Make no mistake: you still read and write more bytes per second (i.e., get more value) from flash with a contiguous I/O pattern. Flash just raised the bottom end of performance in the event that you are unable or unwilling to contort your application to perform I/O in a way that is convenient for your storage media.

To that end, when a vendor advertises how many IOPS they can deliver, they really are advertising how many discontiguous 4 KiB reads or writes they can deliver under the worst-case I/O pattern (fully random offsets). You can convert a vendor's IOPS performance back into a meaningful value metric simply by multiplying it by 4 KiB; for example, I've been presenting a slide that claims I measured 29,000 write IOPS and 1,400 read IOPS from a single ClusterStor E1000 OST array:

Performance measurements of a single ClusterStor E1000 NVMe Lustre OST

In reality, I was able to write data at 0.12 GB/s and read data at 5.7 GB/s, and stating these performance metrics as IOPS makes it clear that these data rates reflect the worst-case scenario of tiny I/Os happening at random locations rather than the best-case scenario of sequential I/Os which can happen at 27 GB/s and 41 GB/s, respectively.

Where IOPS get particularly stupid is when we try to cast them as some sort of hero number analogous to the 700 GB/s bandwidth metric discussed above. Because IOPS reflect a worst-case performance scenario, no user should ever be asking "how can I get the highest IOPS" because they'd really be asking "how can I get the best, worst-case performance?" Relatedly, trying to measure the IOPS capability of a storage system gets very convoluted because it often requires twisting your I/O pattern in such unrealistic ways that heroic effort is required to get such terrible performance. At some point, every I/O performance engineer should find themselves questioning why they are putting so much time into defeating every optimization the file system implements to avoid this worst-case scenario.

To make this a little more concrete, let's look at this slide I made in 2019 to discuss the IOPS projections of this exact same ClusterStor E1000 array:

Projected performance of a ClusterStor E1000 NVMe Lustre OST based on a PCIe Gen3 platform

Somehow the random read rate went from a projected 600,000 to an astonishing 1,400,000 read IOPS--which one is the correct measure of read IOPS?

It turns out that they're both correct; the huge difference in measured read IOPS are the result of the the 600 KIOPS estimate coming from a measurement that

ran for a much longer sustained period (180 seconds vs. 69 seconds)
used fewer client nodes (21 nodes vs. of 32 nodes)
wrote larger files (1,008× 8 GiB files vs. 1,024× 384 GiB files)

Unlike the IOPS measurements on individual SSDs which are measured using a standard tool (fio with libaio from a single node), there is no standard method for measuring the IOPS of a parallel file system. And just as the hero bandwidth number we discussed above is unattainable by real applications, any standardized IOPS test for a parallel file system would result in a relatively meaningless number. And yes, this includes IO-500; its numbers have little quantitative value if you want to design a parallel file system the right way.

So who's to say whether a ClusterStor E1000 OST is capable of 600 kIOPS or 1,400 kIOPS? I argue that 1,400 kIOPS is more accurate since I/O is bursty and a three-minute-long burst of completely random reads is less likely than a one-minute long one on a production system. If I worked for a vendor though, I'm sure this would be taken to be a dishonest marketing number since it doesn't reflect an indefinitely sustainable level of performance. And perhaps courageously, the official Cray ClusterStor E1000 data sheet doesn't even wade into these waters and avoids quoting any kind of IOPS performance expectation. Ultimately, the true value of the random read capability is the bandwidth achievable by all of the most random workloads that will realistically be run at the same time on a file system. Good luck figuring that out.

Write IOPS are really dumb

As I said at the outset, I cannot disagree with any of the bullets in the slide Rob presented at ASCAC. That first one is particularly salient--there are a new class of HPC workloads, particularly in AI, whose primary purpose is to randomly sample large datasets to train statistical models. If these datasets are too large to fit into memory, you cannot avoid some degree of random read I/O without introducing biases into your weights. For this reason, there is legitimate need for HPC to demand high random read performance from their file systems. Casting this requirement in terms of 4 KiB random read rates to have a neat answer to the "how many IOPS do you need" question is dubious, but whatever. There's little room for intellectual purity in HPC.

The same can't be said for random write rates. Write IOPS are a completely worthless and misleading performance metric in parallel file systems.

In most cases, HPC applications approximate some aspect of the physical world, and mathematics and physics were created to describe this physical world in a structured way. Whether you're computing over atoms, meshes, or matrices, there is structure to the data you are writing out and the way your application traverses memory to write everything out. You may not write data out in a perfectly ordered way; you may have more atoms on one MPI process than another, or you may be traversing an imbalanced graph. But there is almost always enough structure to scientific data to squish it into a non-random I/O pattern using middleware like MPI-IO.

Granted, there are a few workloads where this is not true. Out-of-core sorting of short-read DNA sequences and in-place updates of telescope mosaics are two workloads that come to mind where you don't know where to write a small bit of data until you've computed on that small bit of data. In both these cases though, the files are never read and written at the same time, meaning that these random-ish writes can be cached in memory, reordered to be less random, and written out to the file asynchronously. And the effect of write-back caching on random write workloads is staggering.

To illustrate this, consider three different ways in which IOR can be run against an all-NVMe file system to measure random 4 KiB writes:

In the naïve case, we just write 4 KiB pages at random locations within a bunch of files (one file per MPI process) and report what IOR tells us the write IOPS were at the end. This includes only the time spent in write(2) calls.
In the case where we include fsync, we call fsync(2) at the end of all the writes and include the time it takes to return along with all the time spent in write(2).
In the O_DIRECT case, we open the file with direct I/O to completely bypass the client write-back cache and ensure that write(2) doesn't return until the data has been written to the file system servers.

These seemingly minor changes result in write IOPS rates that differ by over 30x:

Random write IOPS measured using IOR on an all-NVMe parallel file system

Again we ask: which one is the right value for the file system's write IOPS performance?

If we split apart the time spent in each phase of this I/O performance test, we immediately see that the naïve case is wildly deceptive:

Breakdown of time spent in I/O calls for 4K random write IOR workload

The reason IOR reported a 2.6 million write IOPS rate is because all those random writes actually got cached in each compute node's memory, and I/O didn't actually happen until the file was closed and all cached dirty pages were flushed. At the point this happens, the cache flushing process doesn't result in random writes anymore; the client reordered all of those cached writes into large, 1 MiB network requests and converted our random write workload into a sequential write workload.

The same thing happens in the case where we include fsync; the only difference is that we're including the time required to flush caches in the denominator of our IOPS measurement. Rather frustratingly, we actually stopped issuing write(2) calls after 45 seconds, but so many writes were cached in memory during those 45 seconds that it took almost 15 minutes to reorder and write them all out during that final fsync and file close. What should've been 45 seconds of random writes to the file system turned into 45 seconds of random writes to memory and 850 seconds of sequential writes to the file system.

The O_DIRECT case is the most straightforward since we don't cache any writes, and every one of our random writes from the application turns into a random write out to the file system. This cuts our measured IOPS almost in half, but otherwise leaves no surprises when we expect to only write for 45 seconds. Of course, we wrote far fewer bytes overall in this case since the effective bytes/sec during this 45 seconds was so low.

Based on all this, it's tempting to say that the O_DIRECT case is the correct way to measure random write IOPS since it avoids write-back caches--but is it really? In the rare case where an application intentionally does random writes (e.g., out-of-core sort or in-place updates), what are the odds that two MPI processes on different nodes will try to write to the same part of the same file at the same time and therefore trigger cache flushing? Perhaps more directly, what are the odds that a scientific application would be using O_DIRECT and random writes at the same time? Only the most masochistic HPC user would ever purposely do something like this since it results in worst-case I/O performance; it doesn't take long for a user to realize this I/O pattern is terrible and reformulating their I/O pattern would increase their productive use of their supercomputer.

So if no user in their right mind does truly unbuffered random writes, what's the point in measuring it in the first place? There is none. Measuring write IOPS is dumb. Using O_DIRECT to measure random write performance is dumb, and measuring write IOPS through write-back cache, while representative of most users' actual workloads, isn't actually doing 4K random I/Os and therefore isn't even measuring IOPS.

Not all IOPS are always dumb

This all being said, measuring IOPS can be valuable in contexts outside of parallel file systems. Two cases come to mind where measuring IOPS can be a rational yard stick.

1. Serving up LUNs to containers and VMs

By definition, infrastructure providers shouldn't be responsible for the applications that run inside black-box containers and VMs because they are providing storage infrastructure (block devices) and not storage services (file systems). Blocks in and blocks out are measured in IOPS, so the fit is natural. That said, HPC users care about file systems (that is, scientific applications do not perform I/O using SCSI commands directly!), so worrying about LUN performance isn't meaningful in the HPC context.

2. Measuring the effect of many users doing many things

While individual HPC workloads rarely perform random I/Os on purpose, if you have enough users doing many small tasks all at once, the file system itself sees a workload that approaches something random. The more, small, independent tasks running parallel and the farther back you stand from the overall I/O load timeline, the more random it looks. So, I argue that it is fair to measure the IOPS of a parallel file system for the purposes of measuring how much abuse a file system can take before it begins to impact everybody.

Take, for example, these IOPS scaling I measured on a small all-flash file system using IOR:

Scale-up IOPS benchmarking to demonstrate the saturation point of an all-flash file system

It looks like it takes about 4,096 concurrent random readers or writers to max out the file system. This alone isn't meaningful until you consider what this means in the context of the whole compute and storage platform.

What fraction of the cluster's compute nodes corresponds to 4096 cores? If you've got, say, 728 dual-socket 64-core AMD Epyc processors, it would only take 32 compute nodes to max out this file system. And if another user wanted to use any of the remaining 696 compute nodes to, say, run a Python script that needed to read in random packages scattered across the file system, there would be no remaining IOPS capacity left at this point, and everyone would experience perceptible lag.

Of course, this is the most extreme case--purely random IOPS--but you can measure the IOPS that a real workload does generate on the server side when, say, sampling a deep learning training dataset. With this, you can then figure out how much headroom that application leaves for every other random-ish workload that needs to run on the same system.

Once you realize that a lot of the unglamorous parts of of scientific computing--reading dotfiles when you log in, loading shared objects when you launch a dynamically linked executable, or even just editing source code--are full of random-like reads, you can establish a quantitative basis for figuring out how badly an IOPS-intensive data analysis application may affect everyone else's interactive accesses on the same file system.

This is not to say that we can easily answer the question of "How many IOPS do you need?" though. How many IOPS a workload can drive is not how many IOPS that workload needs--it's really how fast it can compute before it has run out of data to process and needs to read more in. The faster your compute nodes, generally, the more data they can consume. They still want all the IOPS you can give them so they can spend as much time computing (and not waiting for I/O) as possible, and how many IOPS your application can drive is a function of how quickly it runs given the full stack between it and the storage, including CPU, memory, and networking.

If everything is dumb, now what?

Give up trying to reduce I/O performance down to a single IOPS number, because it's two degrees away from being useful. Bandwidth is a better metric in that it's only one degree away from what actually matters, but at the end of the day, the real metric of I/O performance is how much time an application has to wait on I/O before it can resume performing meaningful computations. Granted, most storage vendors will give you a blank stare if you take this angle to them; telling them that your application spends 50% of its time waiting on I/O isn't going to get you a better file system from a storage company alone, so think about what the real problem could be.

Is the application doing I/O in a pattern (random or otherwise) that prevents the storage system from delivering as many bytes/second as possible? If so, ask your vendor for a storage system that delivers more bandwidth to a wider range of I/O patterns than just perfectly aligned 1 MiB reads and writes.

Is the storage system already running as well as it can, but it only takes a few compute nodes to max it out? If so, your storage system is too small relative to your compute system, and you should ask your vendor for more servers and drives to scale out.

Is the storage system running at 100% CPU even though it's not delivering full bandwidth? Servicing a small I/O requires a lot more CPU than a large I/O since there are fixed computations that have to happen on every read or write regardless of how big it is. Ask your vendor for a better file system that doesn't eat up so much CPU, or ask for more capable servers.

Alternatively, if you have a lot of users all doing different things and the file system is giving poor performance to everyone, ask your vendor for a file system with better quality of service. This will ensure that one big job doesn't starve out all the small ones.

Is the storage system slow but you don't have the time to figure out why? If so, it sounds like you work for an organization that doesn't actually value data because it's not appropriately staffed. This isn't a storage problem!

Ultimately, if solving I/O problems was as easy answering how many IOPS you need, storage wouldn't be the perpetual pain point in HPC that it has been. As with all things in computing, there is no shortcut and the proper way to approach this is by rolling up your sleeves and start ruling out problems. You can (and should!) ask for a lot from your storage vendors--flexibility in delivering bandwidth, CPU-efficient file systems, and quality of service controls are all valid requests when buying storage. But IOPS are not.

↧

Life and leaving NERSC

May 26, 2022, 10:42 pm

≫ Next: SC'22 Recap

≪ Previous: IOPS are dumb

When word started to spread that I was leaving my job at NERSC for Microsoft, a lot of people either directly or indirectly attributed my decision to being one motivated by money. Rationalizing my decision to leave is certainly a lot easier with this "Glenn was lured away with bags of cash" narrative, but that wasn't really a factor when I chose to move on. Rather, my decision is a reflection of where I see the world of HPC going in the coming decade and where I personally wanted to position myself. For my own therapeutic reasons (and perhaps the benefit of anyone interested in what it's like to work within, and subsequently leave, the DOE HPC complex), I'll try to write it all out here.

Working at NERSC

First things first: NERSC has been a wonderful place to work.

A typical view from outside NERSC's facility in Berkeley after work during the winter months. Yes, it really does look like this.

When I started in mid-2015, I came in with about three years of prior work experience (two at SDSC doing user support and one at a biotech startup) and knew a little bit about a lot of things in HPC. But I didn't really know the basics of I/O or storage--I couldn't tell you what "POSIX I/O" really meant or how GPFS worked. The fact that I got to help author NERSC's ten-year strategy around storage in just two years, was invited to present my view on how to bridge the gap between HPC and enterprise storage at Samsung's North American headquarters a year later, and was trusted to oversee the design and execution of the world's first 35 petabyte all-flash Lustre file system through my first four years is a testament to how much opportunity is available to learn and grow at NERSC.

There are a couple of reasons for this.

Stable funding

Perhaps foremost, NERSC (and DOE's Leadership Computing Facilities, ALCF and OLCF) enjoy healthy budgets and financial stability since worldwide leadership in scientific advancement is generally a national priority by both major political parties in the US. This means that, regardless of who is president and which party holds majorities in Congress, the DOE HPC facilities can pay their employees and deploy new supercomputers. This solid funding makes it much easier to invest in staff development and long-term planning; I was able to become a resident I/O expert at NERSC because I was never forced to chase after the funding du jour to make ends meet. Congress trusts NERSC to allocate its funding responsibly, and NERSC prioritized letting me learn as much as I could without distraction.

Instant credibility and access

Second, having a NERSC affiliation gives you instant credibility and access in many cases. It's not necessarily fair, but it's definitely true. Within my first year at NERSC, I was invited to give a presentation about I/O performance monitoring in Paris because the organizer wanted a lineup of speakers from all the big players in HPC. I had never been to Europe at that point in my life, but being the I/O guy from NERSC (and being able to present well!) was enough to get me there. And it was during that trip to Paris that I got to meet--and literally have conversation over dinner with--more industry bigshots that I can remember. And that trip to Paris was not an outlier; pandemic aside, NERSC let me go to Europe at least once or twice every year I've worked there.

The first photo I ever took of Notre Dame on the first day I'd ever set foot in Europe. NERSC sent me there less than a year after I started.

Of course, this is not to say that every employee at a DOE HPC facility is wining and dining in Paris every summer. Many of these opportunities are earned by showing the value of the work you're doing, just like at any job. But owing to healthy budgets, travel expenses are rarely the limiting factor in chasing after these opportunities. In addition, going out into the world and talking about what you do is part of the job at a DOE facility; being a leader in the field of HPC is part of the mission of NERSC, ALCF, and OLCF, so doing high-risk, first-of-a-kind work and telling the world about it is uniquely valued within DOE in a way that it is not in industry.

Smart people

A product of these two factors (stable budget and instant credibility) results in coworkers and colleagues who are generally very experienced and capable. There's an interesting mix of laissez-faire management and rigorous process-driven management as a result.

Staff are generally given the freedom to choose their own destiny and focus on work that they enjoy much like in any academic environment; it's not hard to pick up passion projects or even move between groups if things get stale on a day-to-day basis. Since everyone is working on their own slices of HPC, there's also easy access to world experts in different areas of technology if you need one. For example, I recall once reviewing a storage system that appeared to rely on multiplexing two 12G SAS links over a single 24G SAS. After one email and a few hours, a coworker confirmed, complete with a citation to the SCSI standards, that this was totally possible. Even if someone in-house didn't know the answer, I had direct access to an engineering manager at a leading storage vendor who owed me a favor and definitely would've known the answer. It's really, really hard to find as many smart people in arm's reach in most other HPC centers.

At the same time, there is rigorous federal oversight on major projects and procurements to ensure that taxpayer dollars are responsibly spent. This is a double-edged sword because all of the reporting and reviews that go into massive capital projects make forward progress very slow at times. All DOE HPC facilities review and re-review everything about these giant supercomputers before making a decision, so by the time the public sees a press release about a new supercomputer, lab staff have spent literal years going over every detail and risk. It sometimes may not seem that way (how many problems has Aurora had?), but rest assured that every schedule slip or technology change the public hears was preceded by countless hours of meetings about risk and cost minimization. On the flip-side though, you have the opportunity to learn every gory detail about the system directly from the people who designed it.

Pay

In true millennial fashion, I think it's important to have an open discussion about the pay. DOE labs pay more than any other HPC facility in the world as far as I am aware, and even in the San Francisco Bay Area, salary at NERSC is comparable to the base salaries offered by all the big tech companies. You can get an idea of what entry-level salaries (think: first job after postdoc or a few years out of undergrad) by searching H1B Visa postings, and anecdotally, I'd wager that a typical HPC job at NERSC pays about 2x that of the same job at a typical US university and 3x-4x that of the same job at a British or European university. All the labs pay about the same to boot, so an HPC job at somewhere like Oak Ridge can afford you a relatively luxurious lifestyle.

Don't get me wrong though; affording to buy a Bay Area house on a single NERSC salary alone would be tough in the same way that buying a Bay Area house on any single salary would be. And while NERSC's compensation is comparable to the base salary of the big tech companies, that base is about all you can get since DOE labs cannot offer equity or substantial bonuses. This is less of a gap if you're just starting out, but anyone who's looked at compensation structures in tech knows that stock-based compensation, not base salary, dominates total compensation as you move up.

So, if money wasn't an issue for me and NERSC is such a great place to work, why would I ever leave?

The road ahead for HPC

On one hand, HPC's future has never been brighter thanks to how much life (and money!) the AI industry is bringing to the development of HPC technologies. We have new all-flash file systems, gigantic GPUs, awesome CPU memory technologies, and mixed-precision techniques in the HPC space that were all directly driven by developments primarily intended for AI workloads. On the other hand, leadership HPC appears to be engaging in unsustainable brinkmanship while midrange HPC is having its value completely undercut by cloud vendors. I've not been shy about my overall anxiety about where HPC is going because of this, but I'll elaborate now that the exascale race has been won.

The future of leadership HPC

Without some monumental breakthrough in transistor technology, there is only one path forward in continuing to build faster and faster supercomputers in the next decade: pour more and more energy (and dissipate more and more heat) into larger and larger (and more and more) GPUs.

The goal post for exascale power keeps moving because that's been the easiest way to hit the mythical exaflop milestone; while the original goal was 20 MW, Frontier is coming in at 29 MW and Aurora at "under 60 MW." Not only is this just a lot of power to feed into a single room, but the cost and effort of actually building this infrastructure is newsworthy in and of itself these days. At the current trajectory, the cost of building a new data center and extensive power and cooling infrastructure for every new leadership supercomputer is going to become prohibitive very soon.

HPC data centers situated in places where the cost of electricity and real estate (stacked atop the risk of earthquake or wildfire) further skew the economics of just adding more power are going to run up against this first. It used to be easy to dismiss these practicality concerns by arguing that colocating scientists with supercomputers created immeasurable synergy and exchange of ideas, but the fact that science never stopped during the work-from-home days of the pandemic have taken a lot of air out of that argument.

My guess is that all the 50-60 MW data centers being built for the exascale supercomputers will be the last of their kind, and that there will be no public appetite to keep doubling down.

Given this, DOE's leadership computing facilities are facing an existential threat: how do you define leadership computing after exascale if you can't just add another 50% more power into your facility? How do you justify spending another $600 million for a supercomputer that uses the same power but only delivers 15% more performance? You can pour similarly huge amounts of money into application modernization to accelerate science, but at the end of the day, you'd still be buying a lot of hardware that's not a lot faster.

The future of places like NERSC

NERSC is probably a little better off since its lack of an exascale machine today gives it at least one more turn of the crank before it hits a hard power limit in its data center. That gives it the ability to deploy at least one more system after Perlmutter that is significantly (at least 2x) more capable but draws significantly more power. However, compared to Frontier and Aurora, such a system may still look rather silly when it lands in the same way that Perlmutter looks a bit silly compared Summit, which was funded by the same agency but deployed years earlier.

And therein lies the dilemma of centers like NERSC--how do you position yourself now so that by the time you deploy an HPC system that is close to maxing out on power, it is sufficiently different from a pure-FLOPS leadership system that it can solve problems that the leadership systems cannot?

The easy go-to solution is to craft a story around "data-centric" supercomputing. We did this when I was at the San Diego Supercomputer Center when we were budget-limited and had to differentiate our $12 million Comet supercomputer from TACC's $30 million Stampede. You invest more in the file system than you would for a pure-FLOPS play, you provide low-cost but high-value onramps like Jupyter and science gateways to enable new science communities that have modest computing needs, and you fiddle with policies like allocations and queue priority to better suit interactive and urgent computing workloads. From a productivity standpoint, this is can be a great story since users will always respond well to lower queue wait times and less frustrations with the file system. From a system architect's standpoint, though, this is really boring. The innovation happens in policies and software, not clever hardware or design, so there's very little that's new for a system designer to think about in this case.

A more innovative approach is to start thinking about how to build a system that does more than just run batch jobs. Perhaps it gives you a private, fast file system where you can store all your data in a way indistinguishable from your personal laptop. Perhaps it gives you a convenient place to run a Jupyter notebook that has immediate access to a powerful GPU. Or perhaps it gives you all the tools to set up an automated process where all you have to do is upload a file to trigger an automatic data analysis and reduction pipeline that returns its output to a shiny HTTP interface. Such a system may not be able to crank out an exaflop using HPL, but does that matter if it's the only system in the country that supports such automation?

There are interesting system architecture questions in the latter case, so as a system designer, I much prefer it over the "data-centric" angle to non-exaflop supercomputing strategies. But there remains a problem.

The problem: cloud

Such a "more than just batch jobs" supercomputer actually already exists. It's called the cloud, and it's far, far ahead of where state-of-the-art large-scale HPC is today--it pioneered the idea of providing an integrated platform where you can twist the infrastructure and its services to exactly fit what you want to get done. Triggering data analysis based on the arrival of new data has been around for the better part of a decade in the form of serverless computing frameworks like Azure Functions. If you need to run a Jupyter notebook on a server that has a beefy GPU on it, just pop a few quarters into your favorite cloud provider. And if you don't even want to worry about what infrastructure you need to make your Jupyter-based machine learning workload go fast, the cloud providers all have integrated machine learning development environments that hide all of the underlying infrastructure.

And therein lies the problem: the definition of "innovation" as non-exaflop HPC runs up against this power wall might actually mean "catching up to the cloud."

This is not to say that NERSC-like HPC centers are entirely behind the cloud; all the DOE HPC facilities have bigger, faster, and more convenient parallel file systems that are generally always on and where data is always somewhere "fast." They also provide familiar, managed software environments and more egalitarian support to small- to mid-scale science projects. DOE HPC also takes the most risk in deploying unproven technologies to shake them out before they become available to the wide market.

However, those gaps are beginning to close. You can stick a full Cray EX system, identical to what you might find at NERSC or OLCF, inside Azure nowadays and avoid that whole burdensome mess of building out a 50 MW data center. You can also integrate such a system with all the rich infrastructure features the cloud has to offer like triggered functions. And when it comes to being first to market for risky HPC hardware, the cloud has already caught up in many ways--Microsoft deployed AMD Milan-X CPUs in their data centers before any HPC shop did, and more recently, Microsoft invested in AMD MI-200 GPUs before Frontier had a chance to shake them out.

Given this steep trajectory, I see only two scenarios for large-scale, non-exaflop HPC facilities in the 10+ year horizon:

They develop, adopt, steal, or squish cloud technologies into their supercomputers to make them functionally equivalent to cloud HPC deployments. They may be a little friendlier to scientific users since cloud functionality wasn't designed for scientific computing alone, but they also may not be as stable, mature, or feature-rich as their cloud cousins.
They find better overall economics in eventually moving to massive, long-term, billion-dollar deals where flagship HPC systems and their "more than just batch jobs" features are colocated inside cloud datacenters sited at economically advantageous (that is, cheap power, cooling, and labor) locations in the country.

There's also grey area in between where national HPC facilities consolidate their physical infrastructure in cheap areas to manage costs but still self-manage their infrastructure rather than fully outsource to a commercial cloud. CSCS has hinted at this model as their future plan since they cannot build 100 MW datacenters in Switzerland, and this is proof that leading HPC facilities around the world see the writing on the wall and need to maneuver now to ensure they remain relevant beyond the next decade. Unfortunately, the politics of consolidating the physical infrastructure across the DOE HPC sites would likely be mired in Congressional politics and take at least a decade to work out. Since serious work towards this hasn't started yet, I don't envision such a grey-area solution emerging before all the DOE facilities hit their power limit.

Hopefully I've painted a picture of how I perceive the road ahead for large-scale HPC facilities and you can guess which one I think will win out.

Final thoughts

I have every confidence that there will still be DOE HPC facilities in ten years and that they will still be staffed by some of the brightest minds in HPC. And even if a cloud-based HPC facility ultimately consumes centers like NERSC, I don't think many people would be out of work. The vast majority of what DOE's HPC people do is think carefully about technology trends, maintain a deep understanding of user requirements, provide excellent support to its thousands of users, and keep complex supercomputers running well. Those jobs don't go away if the supercomputer is in the cloud; it's just the physical location, the hands doing physical hardware swaps, and the breadth of vendor interactions that may change.

For me as a system architect though, it's become too hard for me to catch up to all the new technologies and techniques HPC needs for the future while also building up other staff to be masters of today's I/O challenges. I found myself at a fork in the road. One path would mean catching up on a technical level and then getting in front of where the future of HPC lies before it gets there. The other path would mean trying to steer the entire DOE HPC ship in the right direction, as long as that may take, and have faith that the people I bring along can race far enough ahead to tell me if we're still going where we need to go. Perhaps a bit selfishly, I chose the former. I'm just not ready to give up on racing ahead myself yet, and the only way I could hope to catch up was to make it a full-time job.

I don't claim to know the future, and a lot of what I've laid out is all speculative at best. NERSC, ALCF, or OLCF very well may build another round of data centers to keep the DOE HPC party going for another decade. However, there's no denying that the stakes keep getting higher with every passing year.

That all said, DOE has pulled off stranger things in the past, and it still has a bunch of talented people to make the best of whatever the future holds.

↧

SC'22 Recap

November 23, 2022, 6:00 pm

≫ Next: SC'23 Recap

≪ Previous: Life and leaving NERSC

The biggest annual conference in HPC, the SC conference, was recently held in Dallas, Texas in its second hybrid incarnation since being all-remote for the pandemic. This year attracted over 11,000 attendees which is much closer to the pre-pandemic high of 14,000 than last year's 7,000, and judging from the crushed conference rooms and busy expo floor, it looks like SC is not that much worse for wear.

This year's conference quite different for me since I attended for my first time as a vendor, not a researcher or practitioner, and I spent most of my days behind closed doors talking to customers. I didn't get to attend any of the keynotes, BOFs, or panels to which I wasn't invited as a result, so I'm not really qualified to give an erudite summary of the conference or expo this year.

So instead, I'm just writing down what I remember in order that I remember it and not necessarily in a coherent narrative form. I'm sure I missed a lot (for example, mixed precision seemed big this year, and I heard Jack Dongarra gave a fantastic Turing Award talk) so I encourage others to write their own recaps and share with the community!

High-level themes

I actually started writing an SC'21 recap last year which I never posted, and re-reading the intro was funny--you'd think nothing has changed in the last year.

The underwhelming

The biggest deal appears to be that exascale is here, and it turns out that it's not that big of a deal. China let the air out of the tires by debuting their exascale systems at SC'21, and not only did they thumb their nose at Top500 by not submitting, they debuted by winning a Gordon Bell prize instead. The first US exascale system, Frontier, was debuted at ISC this year leaving its showing at SC a bit deflated too. Frontier was featured in the Gordon Bell prize-winning paper this year, but that work required the use of four Top-10 systems, not just Frontier, painting the reality that one giant computer rarely stands on its own when it comes to advancing science.

This isn't to say that deploying exascale systems isn't a noteworthy feat and worth commendation, but I felt like the hype over the last five years treated the achievement like an end state instead of a milestone. And now that we've passed the milestone, the community is grasping to figure out what comes next. So what is next?

Quantum had a strong and growing presence at SC, as it has for the last few years. But the conclusion of the panel "Quantum Computing: A Future for HPC Acceleration" was that no, it's not close to being ready.

Disaggregation and composability was another theme with growing momentum. And like quantum, there was a panel asking the same question: "Does HPC need composability now?" The answer, again, was no, not yet. More on that below.

What about RISC-V? Surely that will revolutionize the field. As it turns out, the answer there is also that RISC-V is not ready to do anything useful for HPC yet.

The list goes on of technologies and trends that people are trying to boost now that exascale is "solved." The reality, I think, is that "exascale" will take years to actually mature since it appears to have a ton of technical debt that accumulated during the race to be first. US Exascale rests on the shoulders of AMD and Intel, two companies whose software stacks have not caught up to the market leader, so there will be a lot of thrashing around as development practices and optimization settle out around these systems.

Struggling with code porting is not very exciting to computer science Ph.D.s, so I expect future SCs to mirror this one and bifurcate into two distinct tracks: those struggling to identify the next big thing in the research space, and those struggling to use the systems that were rushed to deployment.

The unexpected

My SC experience was very biased since I didn't get out much, but two related themes kept popping up across different meetings and the sessions I did attend.

Power efficiency is serious business now. It used to seem like people talked about the need for energy-efficient HPC in an abstract sense while continuing to jam more power into every rack without changing their approach to system design, facilities, and deployment models. That has hit a hard wall with energy prices soaring in Europe, though. The financial impacts of power-inefficient supercomputing have gone from a one-time capex cost to an ongoing opex cost that is putting many HPC facilities on an unsustainable cost trajectory. Even sites that aren't doing new deployments are facing sudden, sharp increases in their costs, and nobody has good answers about how they will keep the lights on.

Cloud HPC is confusing. With only 15% of total HPC dollars winding up in the cloud, it's little surprise that most HPC folks are only peripherally aware of what HPC in the cloud really means. Worse yet, a subset of those folks are actively hostile towards the idea of running HPC workloads in the cloud. I spoke with my colleagues from all three major cloud service providers as well as my colleagues in DOE, NSF, and education throughout the week, and everyone painted this same general picture.

There seems to be a mismatch between the expectations of on-prem HPC folks and cloud HPC folks. For example, I was asked why Windows doesn't support OpenMP very well, and after a bit of digging, I realized that the question really wasn't about using OpenMP on Windows as much as it was about using OpenMP in the cloud. There was a latent assumption that "HPC in Microsoft's cloud" must mean "HPC on Windows" which, for the record, is false--I don't even know how to use Windows anymore. Similarly, people decried the performance impacts of sharing HPC nodes with others in the cloud (they are not shared), overheads of virtualizing InfiniBand or GPUs (everyone uses PCIe passthrough or SR-IOV for HPC nodes), and other misconceptions.

This isn't to say that cloud people aren't confused too; I heard stories about conversations that went sideways because a cloud folks (not from my employer, thankfully!) didn’t realize that the requirements of a traditional gov/edu HPC facility couldn’t be neatly wrapped up into a single workload with a single solution, contrary to the case across many commercial AI shops. And both sides are struggling to find models for partnership and engagement that mirror the traditional relationship between places like a DOE or NSF facility and a company like Cray. HPC departments are used to buying supercomputers and parallel file systems, while cloud providers sell computing and storage as a service. The distinction may seem trivial at the surface, but there's a large divide that becomes evident once both sides start trying to drill into the details of what a partnership would look like.

Parallel I/O in Practice Tutorial

This was my fifth year contributing to the Parallel I/O in Practice Tutorial with my colleagues at Argonne and Google, and it was our first time doing it in-person since 2019. It felt really good to be back in front of people to opine about the perils of POSIX and the greatness of the Darshan I/O profiling tool, and this year I retired out the material I used to present on burst buffers (since DataWarp and Infinite Memory Engine have lost relevance in HPC) and the TOKIO holistic I/O analysis framework (since it is no longer funded/maintained). In their stead, I presented material on benchmarking with IOR and mdtest I debuted at LUG 2022 this year.

I haven't gotten feedback yet on whether this change was a net positive one, but I think it went over well. Benchmarking I/O is really challenging if you don't understand how things like page cache really work in distributed systems, and walking through some benchmark examples concretizes a lot of abstract parallel file system concepts like locking and striping. And since benchmarking is a rabbit hole of arbitrary complexity, ending the tutorial with advanced benchmarking topics turned out to be a nice way to add buffer to the end of an eight-hour stretch of carefully timed presentations. It's very easy to skip over the nuances of analyzing mdtest outputs if attendees have a lot of questions about more important things at the end of the day.

The most surprising observation of the tutorial is how many attendees aren't using MPI anymore. We got a lot of questions last year about task-oriented I/O, and this year had some great questions about trying to understand or tune the I/O performed by Python-based analytics frameworks. We decided to add support for Darshan to profile non-MPI applications back in 2019 which is now paying dividends by ensuring it is a relevant tool for these new analytics and AI workloads, and we'll probably have to give more attention to optimizing these workloads' I/O in the future.

DAOS User Group

Monday morning was cold and rainy--a perfect day to attend the 2022 DAOS User Group which was held off-site at the Fairmont Hotel.

Whether you particularly care about DAOS or not, the cross-community HPC I/O brain trust is guaranteed to be in attendance, and this year did not disappoint. In addition to the expected stakeholders from Intel and DOE, representatives from all three big CSPs were in attendance. Google Cloud, Seagate, and HPE/Cray were all on the agenda, painting a diversifying landscape of large HPC companies investing time into DAOS and the strength and willingness of the DAOS team to partner with all comers.

Life after Optane

The question that opened up the meeting, of course, was "what is the future of DAOS since Intel cancelled Optane?" Kelsey Prantis had the official statement:

Official announcement about DAOS support after Optane was cancelled

The high-level project answer is that DAOS isn't going anywhere. Aurora, by virtue of still having Optane DIMMs, will not be affected, and DAOS will maintain support for Optane until Intel drops its last Optane DIMMs (Crow Pass for Sapphire Rapids) from support life sometime towards the end of this decade.

For new customers who aren't going to use Optane, the answer is "Metadata on NVMe," a development being codeveloped by Intel, HPE, and Google to implement a write-ahead log (WAL) and allow DAOS to use volatile DRAM instead of Optane. It will work like a file system journal in that a compact representation of writes will be committed to NVMe immediately after landing in DRAM, and then DAOS will asynchronously write back the properly serialized representation of that transaction after it is acknowledged. Johann Lombardi had a helpful cartoon that showed how this WAL will fit into DAOS:

WAL implementation diagram as it relates to DAOS metadata in DRAM and on NVMe. Slides available on the DUG22 website.

A key benefit of DAOS's implementation of this WAL is that it will be able to still service incoming writes while flushing old writes; although I don't fully grasp how this works, it is something enabled by the sophisticated I/O scheduler already implemented in DAOS.

The complete implementation isn't expected to be released until Spring 2024, but it appears to touch only a few components of DAOS and doesn't affect anything above the VOS layer of the DAOS server.

There was also mention of developing operability with new CXL-attached memory-semantic SSDs to keep the persistent memory capability of DAOS alive beyond Optane. I'm not sure if this would offer a performance benefit over the metadata-on-NVMe feature; early results show that metadata-on-NVMe actually delivers higher IOPS than Optane since the synchronous write path is much simpler without having to account for memory persistence. That said, I didn't really follow the full extent of options on the table for how DAOS metadata may work across different types of memory though.

DAOS in the flesh at Argonne

Kevin Harms presented an update on Aurora's massive 220 PB DAOS installation and laid out its configuration. There are 1,024 DAOS servers based on the Intel Coyote Pass server design, each sporting

2x Intel Xeon 5320 (Ice Lake) sockets
2x DAOS engines (one per socket)
16x 32GB DDR4 DIMMs
16x 512GB Optane DIMMs (Persistent Memory 200)
16x 15.36 TB Samsung PM1733 NVMe SSDs
2x 200 Gb/s Slingshot NICs

The total configuration is quoted at 220 PB usable, but Kevin pointed out that this assumes that every object is erasure coded at 16+2. Unlike virtually every other storage system out there, though, users can choose the data protection for their individual objects when they create them, meaning this 220 PB capacity is an upper limit to what users can do. Users with very hot, read-only objects may choose to replicate instead of erasure code, while others who are capacity-constrained may choose to erasure code everything at 16+2 at the cost of latency and IOPS. This flexibility is really powerful for users since they can tailor their object layout ("object class" in DAOS parlance) to match the needs of their workload.

Argonne will be slicing up this DAOS system by giving each scientific project its own DAOS pool, and each pool will be assigned to only 80% of the available DAOS servers by default. This seems like a nice way of providing most of the storage system performance to every user, but offering more freedom to work around bad hardware, bad users, and other performance problems that plague file systems like Lustre that distribute everything across every single server equally.

Finally, I noticed that Aurora will be using Samsung SSDs, not the Intel (now Solidigm) QLC NAND that appeared in all the DAOS slides floating around two years ago. I'm not sure what happened there, but the move from Solidigm QLC to Samsung TLC couldn't have been cheap.

New features and contributions

DAOS is starting to pick up some truly valuable features that are being developed and contributed by third parties. Of note, croit has contributed a feature which allows DAOS to serve up NVMe over Fabrics targets, and Seagate contributed an S3 gateway for DAOS. Along with the DFS file system interface, DAOS now offers the trifecta of standard object, block, and file services just like Ceph. Unlike Ceph though, performance on DAOS is a first-class citizen. While croit made it clear that the NVMeoF support still has a ways to go to improve the way it does thread pooling and provides resilience, they showed 1.4 million IOPS from a single storage client using TCP over Ethernet with minimal client-side overhead.

Intel is also developing multitenant support for DFUSE, allowing a single compute node to share a DAOS mount and let permissions be enforced through UID/GID just like a regular file system. Before this update, the FUSE-based nature of DAOS allowed any unprivileged user to mount their container (good), but only one FUSE agent could be alive on a single node at a time (not good) which prevented multiple users sharing a node from both mounting their own containers.

DAOS also has some longer-term enhancements that I thought were interesting:

expanding the range of POSIX calls supported by DAOS's intercept library to include metadata calls and memory-mapped I/O using userfaultfd
implementing collaborative caching - essentially reimplementing the Linux kernel page cache in userspace so that multiple processes can share cached DAOS pages
supporting a computational storage paradigm by enabling offload of userspace eBPF scripts to DAOS servers

DAOS in a larger data center ecosystem

Dean Hildebrand from Google Cloud then gave an overview of Google's efforts in bringing DAOS into the cloud. He had some nice performance graphs and I'll link the full presentation here once it's uploaded (it's worth a watch), but the part I found the most insightful was how they are trying to decide where a technology like DAOS fits in the larger cloud storage ecosystem. He outlined two different ways DAOS could work in GCP:

Caching: Google Cloud Storage (GCS) is the point of truth and DAOS is a cache
Tiering: DAOS is a point of truth, and GCS is an archive

Two modes of integrating DAOS in GCP. Slides available on the DUG22 website.

He said they were leaning towards the caching model where data only lives ephemerally in DAOS, and personally, I think this is the right move since DAOS in the cloud is not resilient without Optane. However, this choice reflects a much larger tension in cloud storage for HPC:

The centerpiece of every cloud's data story is a scalable, low-cost, low-performance object store which is analogous to what on-prem HPC would call campaign, community, or project storage.
HPC demands higher performance than what these object stores can generally deliver though.

To bridge the gap between these two truths, auxiliary services must bolt on to the object layer and provide higher performance, at a higher cost, for the duration of I/O-intensive HPC jobs. Some choose to provide true tiering from object into a resilient layer of flash (like FSx Lustre and Weka do), while others project the contents of the object through a high-performance caching layer (like HPC Cache and File Cache) and are never meant to persistently hold data.

This isn't rocket science, but I never thought deeply about the two models since campaign/community/project storage in on-prem HPC is usually fast enough to avoid needing caches or fine-grained tiering capabilities.

John Bent also had a thought-provoking presentation about how Seagate's now-"deprioritized" CORTX object store, which once competed with DAOS as Mero, contains ideas that can complement DAOS:

DAOS+CORTX is a match made in heaven. Video available online.

Whereas DAOS delivers high performance using NVMe, CORTX delivers great economics using HDDs, and their strengths are complementary to each other. While I don't fully grasp how a tiered (or caching!) system comprised of DAOS and CORTX could be implemented, John rightly pointed out that the same level of space efficiency can deliver higher data protection if multi-level erasure coding is used to stripe across durable block storage. His specific example was erasure coding at 8+1 across servers and 10+1 within servers to deliver both high efficiency and high durability. This could map to something like running DAOS atop something like CORVAULT, but I don't think all the necessary pieces are in place to realize such a harmonious coexistence yet.

Of course, completely tossing Reed-Solomon for something more sophisticated (like VAST does with its locally decodable 150+4 scheme) obviates the need for multilevel erasure entirely. But DAOS has not gone down that route yet.

And as with every talk John gives, there were lots of other interesting nuggets scattered throughout his presentation. Two of my favorites were:

A slide that pointed out that, when you buy something like Ceph as an appliance, you may be spending only 25% of the total cost on storage media and the rest is infrastructure, service, and support. This struck me as a bit on the low end, but some enterprisey NAS and midrange parallel file system appliances can go this low. Spending 60% to 90% on media is a lot nicer for the buyer (and companies like Seagate) if you can buy at scale or eschew the white-glove support, and John suggested that it's up to companies like Seagate to fix the software issues that require customers to pay for white-glove support in the first place. After all, the less someone spends on support and licenses, the more they can spend on Seagate hard drives.
John's final slide pointed out that object stores were originally designed to get around the limitations of POSIX file systems, but as they've evolved over the last decade, they're starting to look a lot like file systems anyway since they require strong consistency, hierarchical namespaces, and familiar file semantics. Has all the work put into developing super-fast object stores like DAOS over the last ten years really just brought us back full circle to parallel file systems? Companies like VAST and Weka have shown that maybe POSIX isn't as bad as the research community (myself included!) have claimed it to be; it was really just low-performance implementations that nobody wanted.

John's talk was recorded and is now online. Like Dean Hildebrand's talk, it is well worth watching (but for wildly different reasons!)

PDSW 2022

I had to duck out of the DAOS User Group early to run (through the rain) to the 7th International Parallel Data Systems Workshop (PDSW 2022) on Monday afternoon.

Much to everyone's surprise, PDSW was only given a half day this year and everything felt a little compressed as a result. The organizers kept the work-in-progress (WIP) sessions which can often be an interesting peek into what students are pursuing, but little A/V problems and the unforgiving schedule probably did a disservice to the up-and-comers who use the WIP track to lay the groundwork for future full-length papers. Hopefully SC'23 restores PDSW to its original full-day status.

Splinters keynote from Arif Merchant at Google

The keynote presentation was given by Arif Merchant from Google about Splinters, the framework that Google Cloud uses to sample I/Os in a scalable way. The challenge they face is that it's impossible to trace and store every single I/O that hits Google's storage servers (D servers), but having an understanding of I/O patterns is essential for characterizing workload I/O behavior and planning for future infrastructure. In fact, this problem is so important that Google isn't the only cloud that's solved it!

A lot of what Arif talked about is very similar to how Azure does its I/O tracing under the hood. I suppose it should not be surprise that there are only so many ways to solve the challenge of sampling individual IOPS in a way that fairly represents the aggregate workload of a huge distributed storage system. One really smart thing Splinters does that I liked was sample along two different dimensions: not only do they evenly sample across all IOPS at a fixed rate (the obvious thing), but they also sample across files at a fixed rate. In this latter case of per-file sampling, they take a tiny fraction of files and capture every I/O for that file to get a complete picture of how individual files are being accessed.

This file sampling fills the huge gap that exists when randomly sampling IOPS alone. Because different I/Os have different "costs" (for example, reading a 1 MiB file using a single 1 MiB read op or 256x 4 KiB read ops are functionally equivalent to an application), randomly sampling ops introduces systematic biases that can be difficult to back out after the data has been sampled, subsampled, aggregated, and reduced. Splinters' approach lets you see the workload from two different angles (and biases) and answer a much larger range of questions about what's really happening across thousands of storage servers.

That said, it was interesting to hear Arif describe how Splinters evolved out of a different internal Google project but wound up outliving it. Splinters is also similar to, but slightly different from, their Dapper infrastructure which also does scalable distributed system tracing. And he made overtures to F1, a scalable SQL database that is similar to (but not the same as) the SQL-like query interface that Splinters uses. I got the impression that new technologies come and go pretty quickly at Google, and there's a large appetite for creating new software systems outright rather than shoehorning an existing system into solving a new problem. I can't say one way is better than the other; I was just surprised at the contrast with my own experiences.

Practical papers

PDSW had a healthy combination of both very-researchy papers and applied research papers this year. I could only stick around for the applied papers, and two left an impression.

In the first, Jean Luca Bez presented Drishti, a tool that lives downstream of the Darshan I/O profiling library and finally does what the Darshan community has danced around for years--turning a Darshan log into an actionable set of recommendations on how to improve I/O performance. It does this by cataloguing a bunch of heuristics and using Darshan's new Python integrations to pore through a log and identify known-problematic I/O patterns. Like Jean Luca's DXT Explorer tool, Drishti has a slick user interface and greatly extends the usability and insights that can be pulled out of a Darshan log file. It probably won't win a Turing Award, but this sort of work is probably going to benefit scores of HPC end-users by making Darshan (and troubleshooting I/O problems) much more accessible to mere mortals for years to come.

Adrian Jackson also presented a very tidy apples-to-apples comparison of DAOS and Lustre on the same hardware using both a systems-level benchmark and an application-inspired, object-oriented data model benchmark. The specific bake-off of a new curiosity (DAOS) and the decades-old incumbent (Lustre) is probably interesting to storage nerds, but I think the real novelty of the work is in its exploration of some uncomfortable realities that the HPC I/O community will have to face in the coming years:

Does "slow memory" (nonvolatile Optane or CXL-attached memory SSDs) give actual benefit to existing file systems (like Lustre), or is rethinking the entire storage stack (like DAOS did) really necessary to unlock the performance of new hardware?
Do applications need to rethink their approach to I/O to make use of post-POSIX storage systems like DAOS, or is performing I/O as you would on a file system (Lustre) on a post-POSIX storage system (DAOS) good enough?

My take from the work is that, for simple I/O patterns like checkpoint/restart, you can get pretty far by just treating something like DAOS the same as you would a parallel file system:

Figure from Manubens et al, "Performance Comparison of DAOS and Lustre for Object Data Storage Approaches."

But if you want your data at rest to have the same data model as how it's handled within the application, you really ought to use a storage system that supports data models that are more expressive than a stream of bytes (which is what POSIX files are).

The authors didn't do a perfect job of giving Lustre its fair shake since they chose to use (abuse) directories and files to represent their application's data model on-disk instead of developing an object-file model that file systems like Lustre handle a little better. But let's be real--HPC is full of applications that do the exact same thing and represent datasets on-disk using complex hierarchies of directories and files simply because that's the easiest way to map the application's representation of data into the standard file system model. In that sense, storage systems that represent rich data models in a high-performance way should be really valuable to naive applications that map in-memory data structures directly to files and directories.

Going back to John Bent's closing slide from his DAOS User Group talk, though, does any of this even matter since all answers lead back to parallel file systems? Maybe there's something to be learned about adding better back-door APIs that support more diverse data models than what POSIX file interfaces give us.

The SC22 Expo

The expo is my favorite part of SC because it's when I get to talk to people one-on-one and learn about corners of the HPC industry that I would've never otherwise sought out. Much to my dismay, though, I had very little time to walk the floor this year--so little that I didn't get any swag. If you want to read up on what interesting technology was being showcased, I strongly recommend reading all the great content that Patrick Kennedy and his team at STH created covering the expo.

That said, I did notice some curious trends about the show floor overall.

The NVIDIA booth was notably absent this year (though they shared booth space with partners), and many of the usual top vendors had significantly smaller presence on the expo floor. Just for fun, I compiled the top ten(ish) vendors by booth size:

Weka.io (3,200 sqft)
VAST Data, Department of Energy, Penguin Computing, HPE, and Microsoft (2,500 sqft)
AWS (2,000 sqft)
Google and TACC (1,600 sqft)
Supermicro, AMD, Intel, Dell, NASA, and Indiana University (1,500 sqft)

I think it's amazing to see all-flash storage companies at the top of the list alongside all of the Big 3 cloud service providers. I may be reading too much into this, but this may mean that the money behind SC is shifting towards companies playing in the cloud-based AI space instead of traditional big iron for simulation. Or perhaps it's a sign that most of the traditional HPC players are taking a hard look at the return they get on a big booth given the current economic climate and pulled back this year.

I did chat with a couple colleagues who completely opted out of a booth this year (for reference, SC'21 had 10% fewer exhibitor booths than SC'19), and the reasoning was consistent: they found more value in having staff meet with customers privately or attend the technical sessions and engage with people organically. Combined with a bit of bad taste left over from SC's high cost of hosting pandemic-era "digital booths" despite low return (did anyone visit digital booths at SC'20 or SC'21?), I can see why some vendors may have chosen to skip the expo this year.

Whatever the reasons may be, I was a bit sad to see such a small presence from some of my favorites like IBM, Fujitsu, Atos, and NEC. Hopefully the SC Exhibits Committee (and the economy!) can find ways to bring back the pre-pandemic glory of the show floor.

The expo wasn't all doom and gloom though! Even though I couldn't make my complete rounds this year, there were a couple of highlights for me.

VAST's masterful marketing

Perhaps the splashiest vendor at SC was VAST Data who had a brilliant marketing presence. First was the giant Vastronaut mascot that was the centerpiece of their booth:

A quick search of Twitter shows just how many people seized the opportunity to take a selfie at their booth. I would love to know how they transported that thing to and from the conference, but whatever the cost, I'll bet it was worth it.

At the Grand Opening Gala on Monday, they also gave out delightfully tacky light-up cowboy hats that everyone seemed to be wearing:

We were there! #sc22 #sc2022 @VAST_Data pic.twitter.com/fWhuSgBfpL
— ntnu-hpc (@ntnuhpc) November 15, 2022

The subtle genius of this was that not only did people wear them during the gala and the Flop Gun-themed Beowulf Bash 2022 party later that night, but they had to wear them on their plane rides home since they were so inconveniently bulky. Proof in point, my wife (who doesn't work in tech) sent me this text message to confirm that she was waiting for me at the right luggage carousel at San Francisco Airport:

I wonder how many innocent bystanders, traveling home for Thanksgiving on Thursday or Friday, saw the shiny cowboy hats at airports around the country and wondered what VAST was.

The icing on the cake was VAST's CEO, Renen Hallak, parading around in an unmissable Chuck McGill-style space suit all week, clearly not taking himself too seriously and painting VAST as a work hard/play hard kind of company. Now, do flashy space suits and blinking cowboy hats alone mean VAST has a great product? I can't say^**. But marketing is an art that I appreciate, and VAST hit some great notes this year.

^** (Seriously, I'm not sure I wouldn't get in trouble for opining about another company here.)

The Microsoft hardware bar

The only booth where I spent any appreciable time this year was my own employer's. I personally love booth duty and accosting strangers on the show floor, especially if there's something interesting at the booth to jumpstart a conversation. When I worked at SDSC it was a Raspberry Pi cluster, and at the Microsoft booth this year it was the "hardware bar."

In addition to the customary booth presentations with giveaways, swag desk, seating area, and a fun caricature artist, the physical servers that underpin the HPC nodes in Azure were on display. Microsoft contributes its hardware platform designs to the Open Compute Project so the physical hardware that runs in Azure data centers isn't entirely mysterious. Still, every cloud has its hardware secrets, so I was surprised to see these servers laid bare.

The newest HPC node type (dubbed HBv4) on display was a node powered by AMD's Genoa processors just announced a few days earlier:

This wasn't a display model, either; it had real DDR5 DRAM, a real NDR InfiniBand HCA, real PCIe Gen5, and real big OCP mezzanine card with real big aluminum heat sinks and a big Microsoft sticker on top. A couple visitors commented on the way the heat piping for those Genoa CPUs was done which I guess is unusual; rather than have a giant copper block on top of each socket, heat pipes connect the socket to massive aluminum heat sinks that are closer to the chassis inlets. In retrospect it makes sense; Genoa has a whopping twelve DDR5 DIMMs per socket which leaves little extra room for heat sinks, and these 88+ core sockets have a staggering thermal design power.

Another exotic piece of hardware on display was an "ND MI200 v4" server:

It's logically similar to Azure's "ND A100 v4" server platform with two CPU sockets, eight SXM4 GPU sockets, eight 200G HDR InfiniBand HCAs, and a bunch of M.2 NVMes. But this specific server has eight MI200 GPUs on a common OAM baseboard and uses Infinity Fabric for GPU-to-GPU communication. I've never seen an OAM-socketed anything in real life before, much less eight of them on a baseboard, so I thought this was pretty great to see in the flesh.

The ND A100 v4 platform was also on display and looked very similar-but-different with its eight A100 GPUs and HGX baseboard:

And unlike the MI200 variant, the general public can run on these nodes.

I'm not sure what more I'm allowed to say, but my colleague Karl made a nice, quick video that runs through the entire Microsoft booth that's worth a watch, and more details can be had by contacting me or your favorite Microsoft account team privately.

Of course, the hardware bar was just a way to lure people into the booth so I could achieve my real goal: meeting new folks. As I wrote before, one of my biggest realizations at SC this year is how generally confused people are about what HPC in the cloud really means--both people who come from traditional on-prem HPC and people who come from traditional enterprisey cloud. I found myself surprising many of the people with whom I spoke on the show floor with factoids that I have taken for granted. For example,

Linux is the most common OS on these HPC node types. While you probably(?) can run Windows if you want on this stuff, I think only a few niche markets do this.
The usage model for an HPC cluster in the cloud can be the same as on-prem. You can have login nodes, Slurm, home directories, parallel file systems, and all that. Jobs don't have to be containerized or turned into a VM image.
The InfiniBand coming out of these nodes is real InfiniBand with real OFED that supports real mpich/mvapich/OpenMPI. It's the same stuff as in on-prem supercomputers. And nodes are assembled into full-bisection fat tree InfiniBand clusters just like normal.
There's no noisy neighbor problem on compute nodes because HPC node types aren't shared between users. When you run a VM on an HPC node, you get the whole thing. Just like on large supercomputers.
There's no horrible loss of performance due to running in a VM. Virtualization extensions, PCIe passthrough, and SR-IOV bypass the hypervisor for most things. Inside your VM, you see real Zen cores and real Mellanox HCAs, not virtualized devices.

My takeaway impression is that a lot of traditional HPC folks looked at the cloud five or ten years ago, had a sour experience, and haven't paid attention since. In those last five years, though, AI has changed the game. Massive demand for the latest CPUs and accelerators, funded by live-fast-die-young venture capital, has given cloud vendors tremendous financial incentive to catch up to on-prem levels of performance efficiency for AI workloads. And it just so happens that infrastructure that's good for AI is also good for traditional modeling and simulation.

SCinet!

One of the unexpected highlights of my SC this year arose from a chance encounter with a former coworker from NERSC, Ron Kumar, who gave me a whirlwind tour of SCinet.

I have to confess great ignorance around SCinet in general; I always saw it was a weird technological proof of concept that the strange networking people at work would go off and do in the weeks leading up to the actual conference. I knew they did some impressive wide-area transfer demos (like the petabyte-in-a-day demo at SC'16), but I didn't really get the significance.

So what is SCinet? It's this yellow bundle of cables dangling from the ceiling.

The yellow cables are 144-core fiber trunks that bring over a terabit per second of bandwidth into the convention center from the Internet via the national research backbones like ESnet and Internet2 and distribute many terabits per second of capacity throughout the SC conference venue. For comparison, most HPC centers in the US only have a tenth of SCinet's wide-area bandwidth at best since 400G infrastructure is still rolling out.

Most attendees may be familiar with the row of expensive-looking networking racks behind a glass wall towards the back of the expo which is where those yellow cables dangling from the ceiling end. Here's a photo from inside that glass wall:

What I didn't realize is that if you go around to the back of the giant walled area behind this glass display, there's a security checkpoint that gates entry into a massive network operations center (NOC) full of laptops, spools of fiber, meeting rooms, and busily working teams in charge of all the lower layers of the networking stack.

The process to get into the NOC involves an escort and being tagged in with a tamper-proof wristband, and I learned on the tour that there's millions upon millions of dollars worth of high-end networking equipment in the racks shown above. If you look closely, you can see a security camera at the end of the aisle that speaks to this; that camera was one of many.

Behind the pretty public-facing side of the SCinet racks is a mess of fiber and cables:

I guess if you have to tear all this down after just a few weeks, there's no point in investing days in dressing it all up nicely! I particularly enjoyed the fiber panels in the third rack that appear to be affixed to the rack post with shoe laces.

This year, SCinet did do a neat proof-of-concept where they demonstrated three 400G routers from three vendors (Juniper, Arista, and Cisco?) all talking the same protocol to handle what I assume is the core routing for everything in the convention center:

I wish I remembered exactly what was going on here, but I know enough about networking to know that, despite there being standard protocols for coordinating between networking gear, each vendor does their own implementation that is rarely easy to get interoperability from. If anyone out there knows the details of this achievement, please let me know so I can explain this a little better!

In addition to networking nerd-level demonstrations, SCinet also serves up all the wifi across the convention center. That is why there were tripods with access points scattered around, and why astute attendees may have noticed janky networking equipment scattered around that looked like this:

Again, I get it: for a network infrastructure that's only going to last a week, I don't think it's a good use of anyone's time or money to nicely dress all the networking.

One last factoid I didn't know until this year was that exhibitors can request 100 Gb/s network drops into their individual booths for demos (or downloading the latest version of a PowerPoint presentation really fast). The end result of supporting both a vast wifi network and 100G fiber across the show floor is that there was a lot of fiber going into the single row of SCinet equipment:

Finally, when I posted some of these photos online during the conference, my colleague Bilel was kind enough to post a slide from the SC22 opening presentation that had the speeds and feeds of what I had toured:

Candy Culhane shared Scinet facts #SC22 #HPC

5.01 Tb/s of WAN capacity
$70M in HW & SW, & services provided by 29 SCinet contrib.
175 volunteers from 80 vol. organiz.
> 450 wireless deployed
29 network research exhibition proposals
11.7 miles of fiber
2384 fiber patch https://t.co/JtPhjVHZJd pic.twitter.com/kwGl5Ydqp5
— Bilel Hadri (@mnoukhiya) November 16, 2022

If you know anyone involved with SCinet, I highly recommend seeing if you can get a tour at the next SC. Even as a relative networking novice, I walked away with a much greater appreciation for the annual achievement of building SCinet. And who knows? Once I get bored of this whole storage thing, maybe I'll try getting into high-performance networking.

Composability panel

This year I was invited to participate in a panel titled "Smackdown! Does HPC Need Composability Now?" moderated by Addison Snell and Dan Olds from Intersect360 Research. This panel was...different. Unlike the traditional SC panel where panelists take turns presenting slides and saying erudite things, this panel had two teams of panelists. And my team only had one slide to present:

The ground rules included "personal attacks are allowed," and needless to say, the panel was about equal parts entertainment and technical discourse. That's not a bad thing, though.

Addison and Dan did a phenomenal job of pulling their respective teams together and leading discussion in a format that both brought forward the key pros and cons of composability in HPC while poking fun at the thinly veiled, ego-driven personalities that often make up these sorts of panels. Rather than politely dancing around issues like sacrificing memory bandwidth by putting accelerators at the far end of a PCIe bus or gaining higher utilization by allowing users to mix and match CPU, NICs, and GPUs, us panelists were free to shoot straight (or perhaps a bit hyperbolically) and call each other out on our hidden agendas.

I hope it goes without saying that all us panelists were in on the format and don't actually think people on the other side are dumb. By wrapping technical arguments in snarky comments, we could keep the level of discussion accessible to a wide audience, drive home the key points from both sides, and ensure that we weren't losing audience members who don't care about the PhD-level details as much as they want to hear what their peers are thinking about this exciting new space. I got some feedback afterwards that I didn't seem to hold back, so if anyone did take anything I said seriously, I am very sorry!

On a technical level, what was the outcome?

It turns out that there was about a 60/40 split between people who felt composability wasn't required yet and those who felt it was after both sides argued their case. Even among panelists, many of us were a lot less convinced about our respective positions than we let on during the panel itself. I got a chuckle when I realized that I wasn't the only one who, when invited to be on the panel, asked "what side do you want me to argue?" I honestly could have gone either way because the dust has not yet settled. Dan Stanzione, director of TACC, gave the truest answer to the question of "will composability help HPC" up front--"it depends." Maybe this is a growth opportunity, or maybe it's a lukewarm reception.

Either way, composable technologies are hitting the market regardless of whether you think they'll be useful or not. AMD Genoa supports CXL 1.1 with extensions for memory pooling, Samsung has memory-semantic SSDs, and everyone and their mother is working on photonics to get higher bandwidths and lower latencies over longer distances. This makes it easier for people to dip their toes in the water to see if composability makes sense, and I think that's what a lot of people will wind up doing in the coming years.

Customer meetings

Unlike in years past, my SC experience this year was dominated by customer meetings. I've been on the customer side of the table plenty of times, but I was surprised to find that it was actually more fun to be on the vendor side for a change. I'm part salesman at heart, so I found it personally gratifying to end a meeting with people nodding along rather than scratching their heads. I learned as a customer that it's very easy for vendors to go way off the rails and waste everyone's time, so I was grateful to have avoided the awkward confusion that punctuates those kinds of meetings.

I also went into the week worrying that I'd be sitting in the same room, hearing the same pitch and the same jokes, and answering the same questions all week. Thankfully, I work with some great field, business, and product teams who set up interesting conversations rather than rote recitations of boring roadmap slides. Approaching the same topics from different angles helped me figure out how all the pieces of what I'm working on fit together to make a complete picture too; there weren't nearly as many opportunities to do this in the DOE world since the end-users of the HPC systems on which I worked aren't told anything until all the design decisions have already been made.

A few personal notes

This SC was significant to me at a variety of levels; it was the first time I'd gotten on an airplane since February 2020, the first time I'd traveled since starting a new job at a new company, and the first time I'd met any of my new coworkers outside of the structure of a Teams call. During the pandemic I realized that getting out into the world and talking to people from all corners of HPC were my favorite part of my job. Not being able to go to events like SC and maintain that a sense of community involvement dramatically impacted my level of professional satisfaction for the last two years, so I'm glad I was able to finally go this year.

Though customer meetings were a lot more fun than I expected them to be, I still felt bummed that I could spend so little time walking the expo, talking to folks, and attending all the BOFs normally on my must-attend list. Compounding this was my personal choice to not dine indoors and consequently miss out on almost all other chances to catch up with old friends and colleagues. I also decided to leave SC a day earlier than I usually do to reduce my risk of getting sick which didn't help either. There's never enough time at SC, but this year was particularly pressed.

I say all this not to complain, but to say how much I appreciated the people who went out of their way to come accost me during the precious few hours I actually had on the exhibit floor. Some I'd not seen since SC'19, and some I'd never actually met since we only started working together mid-pandemic. The conference is busy for everyone, so giving me a slice of your time was very meaningful. That sense of community membership is why I go to SC, it's why I still work in this business, and it's why I try to contribute whatever I can to whomever wants it whether it be a student, engineer, salesperson, or marketer.

↧

SC'23 Recap

November 23, 2023, 12:05 am

≫ Next: A closer look at "training" a trillion-parameter model on Frontier

≪ Previous: SC'22 Recap

The largest high-performance computing industry conference of the year, SC23, was held in Denver last week. This year's conference attracted over 14,000 attendees and 438 exhibitors, finally breaking pre-pandemic records, and it solidly felt like the old days of the conference in terms of breadth of attendees, the technical program, and overall engagement and interaction across the community.

This was the second time I've attended the conference as a vendor instead of a customer, and this meant I spent a fair amount of time running to and from meetings instead of walking the show floor or attending technical sessions. I'm sure I missed some major announcements and themes as a result, but I thought it still might be valuable to contribute my observations based on this narrow lens of an AI-minded storage product manager for a major cloud service provider. If you're interested in a more well-rounded perspective, check out the HPC Social Supercomputing 2023 Summary and contribute your own thoughts!

I don't know the best way to organize the notes that I took, so I grouped them into a few broad categories:

Big news on the Top500
What's new in storage for HPC and AI
The emergence of pure-play GPU clouds
Other technological dribs and drabs
Personal thoughts and reflections on the conference and community

I must also disclose that I am employed by Microsoft and I attended SC23 in that capacity. However, everything in this post is my own personal viewpoint, and my employer had no say in what I did or didn't write here. Everything below is written from my perspective as an enthusiast, not an employee, although my day job probably colors my outlook on the HPC industry.

With all that being said, let's dive into the big news of the week!

Big news on the Top500

Unveiling the new Top500 list is the tentpole event of SC every year regardless of how much people (including myself!) deride HPL, and unlike the lists over the past year, this newest listing had two big surprises. Many of us went into the SC23 season wondering if the Aurora system, whose hardware was delivered this past June, would be far enough in installation and shakeout to unseat Frontier as the second listed exascale system. At the same time, nobody had expected another >500 PF supercomputer to appear on the list, much less one operated privately and for-profit. But both systems made big debuts in the top 5, carrying with them interesting implications.

The new #2: Argonne's Aurora

The Aurora exascale system has a storied history going back to 2015; first conceived of as a 180 PF supercomputer to be delivered in 2018, it evolved into a GPU-based exascale supercomputer that was supposed to land in 2021. Now two years late and a few executives short, Intel and Argonne were stuck between a rock and a hard place in choosing whether to list their HPL results at SC23:

If Aurora wasn't listed on SC23's Top500 list, it risked going up against El Capitan at ISC'24 and being completely overshadowed by the simultaneous launch of a newer, bigger exascale system.
If Aurora was listed at SC23's Top500 list but in an incomplete form, it would fall short of its long-awaited debut as the #1 system and would require a careful narrative to avoid being seen as a failed system.

Intel and Argonne ultimately chose option #2 and listed an HPL run that used only 5,439 of Aurora's 10,624 nodes (51.1% of the total machine), and as expected, people generally understood that this sub-exaflop score was not an indictment of the whole system underdelivering, but more a reflection that the system was still not stable at its full scale. Still, headlines in trade press were dour, and there was general confusion about how to extrapolate Aurora's HPL submission to the full system. Does the half-system listing of 585.34 PF Rmax at 24.7 MW power mean that the full system will require 50 MW to achieve an Rmax that's still lower than Frontier? Why is the efficiency (Rmax/Rpeak = 55%) so low?

Interestingly, about half the people I talked to thought that Argonne should've waited until ISC'24 to list the full system, and the other half agreed that listing half of Aurora at SC'23 was the better option. Clearly there was no clearly right answer here, and I don't think anyone can fault Argonne for doing the best they could given the Top500 submission deadline and the state of the supercomputer. In talking to a couple folks from ALCF, I got the impression that there's still plenty of room to improve the score since their HPL run was performed under a time crunch, and there were known issues affecting performance that couldn't have been repaired in time. With any luck, Aurora will be ready to go at full scale for ISC'24 and have its moment in the sun in Hamburg.

The new #3: Microsoft's Eagle

The other new Top500 entry near the top of the list was Eagle, Microsoft's surprise 561 PF supercomputer. Like Aurora, it is composed of GPU-heavy nodes, and like Aurora, the HPL run utilized only part (1,800 nodes) of the full system. Unlike Aurora though, the full size of Eagle is not publicly disclosed by Microsoft, and its GPU-heavy node architecture was designed for one specific workload: training large language models for generative AI.

At the Top500 BOF, Prabhat Ram gave a brief talk about Eagle where he emphasized that the system wasn't a custom-built, one-off stunt machine. Rather, it was built from publicly available ND H100 v5 virtual machines on a single 400G NDR InfiniBand fat tree fabric, and Microsoft had one of the physical ND H100 v5 nodes at its booth. Here's the back side of it:

From top to bottom, you can see it has eight E1.S NVMe drives, 4x OSFP ports which support 2x 400G NDR InfiniBand each, a Microsoft SmartNIC, and a ton of power. A view from the top shows the HGX baseboard and fans:

Logically, this node (and the ND H100 v5 VM that runs on it) looks a lot like the NVIDIA DGX reference architecture. Physically, it is an air-cooled, Microsoft-designed OCP server, and Eagle's Top500 run used 1,800 of these servers.

Big HPL number aside, the appearance of Eagle towards the top of Top500 has powerful implications on the supercomputing industry at large. Consider the following.

Microsoft is a for-profit, public enterprise whose success is ultimately determined by how much money it makes for its shareholders. Unlike government agencies who have historically dominated the top of the list to show their supremacy in advancing science, the Eagle submission shows that there is now a huge financial incentive to build giant supercomputers to train large language models. This is a major milestone in supercomputing; up to this point, the largest systems built by private industry have come from the oil & gas industry, and they have typically deployed at scales below the top 10.

Eagle is also built on the latest and greatest technology--NVIDIA's H100 and NDR InfiniBand--rather than previous-generation technology that's already been proven out by the national labs. SC23 was the first time Hopper GPUs have appeared anywhere on the Top500 list, and Eagle is likely the single largest installation of both H100 and NDR InfiniBand on the planet. Not only does this signal that it's financially viable to stand up a leadership supercomputer for profit-generating R&D, but industry is now willing to take on the high risk of deploying systems using untested technology if it can give them a first-mover advantage.

Eagle also shows us that the potential upside of bringing a massive new AI model to market is worth both the buying all the infrastructure required to build a half-exaflop system and hiring the talent required to shake out what is literally a world-class supercomputer. And while the US government can always obtain a DPAS rating to ensure it gets dibs on GPUs before AI companies can, there is no DPAS rating for hiring skilled individuals to stand up gigantic systems. This all makes me wonder: if Aurora was a machine sitting in some cloud data center instead of Argonne, and its commissioning was blocking the development of the next GPT model, would it have been able to take the #1 spot from Frontier this year?

The appearance of such a gigantic system on Top500, motivated by and paid for as part of the AI land grab, also raises some existential questions for the US government. What role should the government have in the supercomputing industry if private industry now has a strong financial driver to invest in the development of leadership supercomputing technologies? Historically, government has always incubated cutting-edge HPC technologies so that they could stabilize enough to be palatable to commercial buyers. Today's leadership supercomputers in the national labs have always wound up as tomorrow's midrange clusters that would be deployed for profit-generating activities like seismic imaging or computer-aided engineering. If the AI industry is now taking on that mantle of incubating and de-risking new HPC technologies, perhaps government now needs to focus on ensuring that the technologies developed and matured for AI can still be used to solve scientific problems.

What's new in storage for HPC and AI?

Since I spent much of my career working in HPC storage, and I now focus largely on AI, it should be no surprise that I heard a lot about the intersection of AI and storage. AI remains high in the hype cycle, so it's natural that just about every storage vendor and discussion had some talk of AI forced into it regardless of it was really relevant or not. However, there were a few places where AI and storage topics intersect that I found noteworthy.

The AI-storage echo chamber

I was asked a lot of questions about storage from journalists, VCs, and even trusted colleagues that followed a common theme: What storage technologies for AI excite me the most? What's the future of storage for AI?

I don't fault people for asking such a broad question because the HPC/AI storage industry is full of bombastic claims. For example, two prominent storage vendors emblazoned their booths with claims of what their products could do for AI:

These photos illustrate the reality that, although there is general agreement that good storage is needed for GPUs and AI, what constitutes "good storage" is muddy and confusing. Assuming the above approach to marketing (10x faster! 20x faster!) is effective for someone out there, there appears to be a market opportunity in just capitalizing on this general confusion by (1) asserting what the I/O problem that's jamming up all AI workloads is, and (2) showing that your storage product does a great job at solving that specific problem.

For example, the MLPerf Storage working group recently announced the first MLPerf Storage benchmark, and Huiho Zheng from Argonne (co-author of the underlying DLIO tool on which MLPerf Storage was built) described how the MLPerf Storage benchmark reproduces the I/O characteristics of model training at the Workshop on Software and Hardware Co-Design of Deep Learning Systems in Accelerators:

When I saw this premise, I was scratching my head--my day job is to develop new storage products to meet the demands of large-scale AI model training and inferencing, and I have never had a customer come to me claiming that they need support for small and sparse I/O or random access. In fact, write-intensive checkpointing and fine-tuning, not read-intensive data loading, is the biggest challenge faced by those training large language models in my experience. It wasn't until a few slides later did I realize where these requirements may be coming from:

Storage and accelerator vendors are both defining and solving the I/O problems of the AI community which seems counterproductive--shouldn't a benchmark be set by the practitioners and not the solution providers?

What I learned from talking to attendees, visiting storage vendor booths, and viewing talks like Dr. Zheng's underscores a reality that I've faced on my own work with production AI workloads: AI doesn't actually have an I/O performance problem, so storage vendors are struggling to define ways in which they're relevant in the AI market.

I outlined the ways in which LLM training uses storage in my HDF5 BOF talk, and their needs are easy to meet with some local storage and basic programming. So easy, in fact, that a reasonably sophisticated AI practitioner can duct tape their way around I/O problems very quickly and move on to harder problems. There's no reason for them to buy into a sophisticated Rube Goldberg storage system, because it still won't fundamentally get them away from having to resort to local disk to achieve the scalability needed to train massive LLMs.

So yes, I've got no doubt that there are storage products that can deliver 10x or 20x higher performance for some specific AI workload. And MLPerf Storage is probably an excellent way to measure that 20x performance boost. But the reality I've experienced is that a half a day of coding will deliver 19x higher performance when compared to the most naive approach, and every AI practitioner knows and does this already. That's why there are a lot of storage vendors fishing in this AI storage pond, but none of them seem to be reeling in any whoppers.

This isn't to say that there's nothing interesting going on in high-performance storage though. If the most common question I was asked was "what's the future of storage for AI," the second most common question was "what do you think about VAST and WEKA?"

VAST & WEKA

Both companies seem to be doing something right since they were top of mind for a lot of conference attendees, and it probably grinds their respective gears that the field still groups them together in the same bucket of "interesting parallel storage systems that we should try out." Rather than throw my own opinion in the pot though (I work with and value both companies and their technologies!), I'll note the general sentiments I observed.

WEKA came into the week riding high on their big win as U2's official technology partner in September. Their big booth attraction was a popular Guitar Hero game and leaderboard, and an oversized Bono, presumably rocking out to how much he loves WEKA, presided over one of their seating areas:

Much of their marketing centered around accelerating AI and other GPU workloads, and the feedback I heard from the WEKA customers I bumped into during the week backed this up. One person shared that the WEKA client does a great job with otherwise difficult small-file workloads, particularly common in life sciences workloads, and this anecdote is supported by the appearance of a very fast WEKA cluster owned by MSK Cancer Center on the IO500 Production list. People also remarked about WEKA's need for dedicated CPU cores and local storage to deliver the highest performance; this, combined with its client scalability, lends itself well to smaller clusters of fat GPU nodes. I didn't run into anyone using WEKA in the cloud though, so I assume the feedback I gathered had a bias towards more conventional, on-prem styles of architecting storage for traditional HPC.

Whereas WEKA leaned into its rock 'n' roll theme this year, VAST doubled down on handing out the irresistibly tacky light-up cowboy hats they introduced last year (which I'm sure their neighbors at the DDN booth absolutely loved). They were all-in on promoting their new identity as a "data platform" this year, and although I didn't hear anyone refer to VAST as anything but a file system, I couldn't throw a rock without hitting someone who either recently bought a VAST system or tried one out.

Unlike last year though, customer sentiment around VAST wasn't all sunshine and rainbows, and I ran into a few customers who described their presales engagements as more formulaic than the white-glove treatment everyone seemed to be getting a year ago. This isn't surprising; there's no way to give all customers the same royal treatment as a business scales. But it does mean that the honeymoon period between VAST and the HPC industry is probably at an end, and they will have to spend the time between now and SC24 focusing on consistent execution to maintain the momentum they've gotten from the light-up cowboy hats.

The good news for VAST is that they've landed some major deals this past year, and they came to SC with customers and partners in-hand. They co-hosted a standing-room-only party with CoreWeave early in the week and shared a stage with Lambda at a customer breakfast, but they also highlighted two traditional, on-prem HPC customers (TACC and NREL) at the latter event.

VAST clearly isn't letting go of the on-prem HPC market as it also pursues partnerships with emerging GPU cloud service providers; this contrasted with WEKA's apparent focus on AI, GPUs, and the cloud. Time will tell which strategy (if either, or both) proves to be the better approach.

DAOS

Though commercial buyers were definitely most interested in VAST and WEKA, folks from the more sophisticated HPC shops around the world also tossed a few questions about DAOS my way this year.

I usually make it a point to attend the annual DAOS User Group meeting since it is always attended by all the top minds in high-performance I/O research, but I had to miss it this year on account of it running at the same time as my I/O tutorial. Fortunately, DAOS was pervasive throughout the conference, and there was no shortage of opportunity to find out what the latest news in the DAOS was. For example, check out the lineup for PDSW 2023 this year:

Three out of thirteen talks were about DAOS which is more than any other single storage product or project. DAOS also won big at this year's IO500, taking the top two spots in the production storage system list:

In fact, DAOS underpinned every single new awardee this year, and DAOS is now the second most represented storage system on the list behind Lustre:

Why is DAOS at the top of so many people's minds this year? Well, DAOS reached a few major milestones in the past few months which has thrust it into the public eye.

First, Aurora is finally online and running jobs, and while the compute system is only running at half its capability, the full DAOS system (all 220 petabytes of it, all of which is TLC NVMe) is up and running--a testament to the scalability of DAOS that many parallel storage systems--including VAST and WEKA--have not publicly demonstrated. Because DAOS is open-source software and Aurora is an open-science system, all of DAOS' at-scale warts are also on full display to the community in a way that no competitive storage system besides of Lustre is.

Second, Google Cloud cast a bold vote of confidence in DAOS by launching Parallelstore, its high-performance parallel file service based on DAOS, in August. Whereas AWS and Azure have bet on Lustre to fill the high-performance file gap (via FSx Lustre and Azure Managed Lustre), GCP has planted a stake in the ground by betting that DAOS will be the better foundation for a high-performance file service for HPC and AI workloads.

Parallelstore is still in private preview and details are scant, but GCP had DAOS and Parallelstore dignitaries at all the major storage sessions in the technical program to fill in the gaps. From what I gathered, Parallelstore is still in its early stages and is intended to be a fast scratch tier; it's using DRAM for metadata which means it relies on erasure coding across servers to avoid data loss on a single server reboot, and there's no way to recover data if the whole cluster goes down at once. This lack of durability makes it ineligible for the IO500 list right now, but the upcoming metadata-on-NVMe feature (which previews in upstream DAOS in 1H2024) will be the long-term solution to that limitation.

Finally, the third major bit of DAOS news was about the formation of the DAOS Foundation. First announced earlier this month, this initiative lives under the umbrella of the Linux Foundation and is led by its five founding members:

Argonne National Laboratory, who has a vested interest in seeing DAOS endure given its massive investment in it,
Enakta Labs, a company spun out of Croit, a German storage services company that was contributing feature development to DAOS,
Google Cloud, who has made a big bet on DAOS as the underpinnings for its Parallelstore service,
HPE, who has a shared fate with the DAOS installation at Argonne and who has also been contributing feature development, and
Intel, whose engineers largely developed DAOS as part of the Aurora program.

I see this handoff of DAOS from Intel to this new foundation as a positive change that makes DAOS a more stable long-term bet; should Intel choose to divest itself of DAOS once its obligations to the Aurora program end, DAOS now can live on without the community having to fork it. The DAOS Foundation is somewhat analogous to OpenSFS (one of the nonprofits backing Lustre) in that it is a vendor-neutral organization around which the DAOS community can gather.

But unlike OpenSFS, the DAOS Foundation will also assume the responsibility of releasing new versions of DAOS after Intel releases its final version (2.6) in March 2024. The DAOS Foundation will also steer feature prioritization, but seeing as how the DAOS Foundation doesn't fund developers directly, it's not clear that contributors like Intel or GCP are actually at the mercy of the foundation's decisions. It's more likely that the DAOS Foundation will just have authority to decide what features will roll up into the next formal DAOS release, and developers contributing code to DAOS will still prioritize whatever features their employers tell them to.

So, DAOS was the talk of the town at SC23. Does this all mean that DAOS is ready for prime time?

While Intel and Argonne may say yes, the community seems to have mixed feelings. Consider this slide presented by László Szűcs from LRZ at the DAOS Storage Community BOF:

DAOS is clearly crazy fast and scales to hundreds of petabytes in production--Aurora's IO500 listing proves that. However, that performance comes with a lot of complexity that is currently being foisted on application developers, end-users, and system administrators. The "opportunities" listed in László's slide are choices that people running at leadership HPC scale may be comfortable making, but the average HPC user is not equipped to make many of these decisions and make thoughtful choices about container types and library interfaces.

The fact that DAOS was featured so prominently at PDSW--a research workshop--probably underscores this as well. This slide presented by Adrian Jackson's lighting talk sums up the complexity along two different dimensions:

His results showed that your choice of DAOS object class and I/O library atop the DAOS POSIX interface can result in wildly different checkpoint bandwidth. It's hard enough to teach HPC users about getting optimal performance out of a parallel file system like Lustre; I can't imagine those same users will embrace the idea that they should be mindful of which object class they use as they generate data.

The other DAOS-related research talk, presented by Greg Eisenhauer, was a full-length paper that caught me by surprise and exposed how much performance varies when using different APIs into DAOS. This slide is one of many that highlighted this:

I naively thought that the choice of native userspace API (key-value or array) would have negligible effects on performance, but Eisenhauer's talk showed that this isn't true. The reality appears to be that, although DAOS is capable of handling unaligned writes better than Lustre, aligning arrays on large, power-of-two boundaries still has a significant performance benefit.

Based on these sorts of technical talks about DAOS presented this year, the original question--is DAOS ready for prime time--can't be answered with a simple yes or no yet. The performance it offers is truly best in class, but achieving that performance doesn't come easy right now. Teams who are already putting heroic effort into solving a high-value problems will probably leap at the opportunity to realize the I/O performance that DAOS can deliver. Such high value problems include things like training the next generation of foundational LLMs, and GCP's bet on DAOS probably adds differentiable value to their platform as a place to train such models as efficiently as possible. But the complexity of DAOS at present probably limits its appeal to the highest echelons of leadership HPC and AI, and I think it'll be a while before DAOS is in a place where a typical summer intern will be able to appreciate its full value.

Infinia

It would be unfair of me to give all this regard to WEKA, VAST, and DAOS without also mentioning DDN's brand new Infinia product, launched right before SC23. Those in the HPC storage industry have been awaiting its launch for years now, but despite the anticipation, it really didn't come up in any conversations in which I was involved. I did learn that the engineering team developing Infinia inside DDN is completely separate from the Whamcloud team who is developing Lustre, but this could be a double-edged sword. On the good side, it means that open-source Lustre development effort isn't competing with DDN's proprietary product in engineering priorities on a day-to-day basis. On the bad side though, I still struggle to see how Infinia and Lustre can avoid eventually competing for the same business.

For the time being, Infinia does seem to prioritize more enterprisey features like multitenancy and hands-free operation while Lustre is squarely aimed at delivering maximum performance to a broadening range of workloads. Their paths may eventually cross, but that day is probably a long way off, and Lustre has the benefit of being deeply entrenched across the HPC industry.

The emergence of pure-play GPU clouds

In addition to chatting with people about what's new in storage, I also went into SC23 wanting to understand how other cloud service providers are structuring end-to-end solutions for large-scale AI workloads. What I didn't anticipate was how many smaller cloud service providers (CSPs) showed up to SC for the first time this year, all waving the banner of offering NVIDIA H100 GPUs. These are predominantly companies that either didn't exist a few years ago or have historically focused on commodity cloud services like virtual private servers and managed WordPress sites, so it was jarring to suddenly see them at an HPC conference. How did so many of these smaller CSPs suddenly become experts in deploying GPU-based supercomputers in the time between SC22 and SC23?

I got to talking to a few folks at these smaller CSPs to figure out exactly what they were offering to customers, and their approach is quite different from how AWS, Azure, and GCP operate. Rather than defining a standard cluster architecture and deploying copies of it all over to be consumed by whoever is willing to pay, these smaller CSPs deploy clusters of whitebox GPU nodes to customer specification and sell them as dedicated resources for fixed terms. If a customer wants a bunch of HGX H100s interconnected with InfiniBand, that's what they get. If they want RoCE, the CSP will deploy that instead. And the same is true with storage: if a customer wants EXAScaler or Weka, they'll deploy that too.

While this is much closer to a traditional on-prem cluster deployment than a typical elastic, pay-as-you-go infrastructure-as-a-service offering, this is different from being a fancy colo. The end customer still consumes those GPUs as a cloud resource and never has to worry about the infrastructure that has to be deployed behind the curtain, and when the customer's contract term is up, their cluster is still owned by the CSP. As a result, the CSP can either resell that same infrastructure via pay-as-you-go or repurpose it for another dedicated customer. By owning the GPUs and selling them as a service, these CSPs can also do weird stuff like take out giant loans to build more data centers using GPUs as collateral. Meanwhile, NVIDIA can sell GPUs wholesale to these CSPs, book the revenue en masse, and let the CSPs deal with making sure they're maintained in production and well utilized.

It also seems like the services that customers of these smaller CSPs get is often more barebones than what they'd get from a Big 3 CSP (AWS, Azure, and GCP). They get big GPU nodes and an RDMA fabric, but managed services beyond that are hit and miss.

For example, one of these smaller CSPs told me that most of their storage is built on hundreds of petabytes of open-source Ceph. Ceph fulfills the minimum required storage services that any cloud must provide (object, block, and file), but it's generally insufficient for large-scale model training. As a result, all the smaller CSPs with whom I spoke said they are also actively exploring VAST and Weka as options for their growing GPU-based workloads. Since both VAST and Weka offer solid S3 and file interfaces, either could conceivably act as the underpinnings of these GPU clouds' first-party storage services as well.

As I said above though, it seems like the predominant model is for these CSPs to just ship whatever dedicated parallel storage the customer wants if something like Ceph isn't good enough. This, and the growing interest in storage from companies like VAST and Weka, suggest a few things:

Some of these CSPs have been obtaining and deploying GPUs faster than they've had time to think about the end-to-end experience, and customers have so much pent-up demand for GPUs that they're willing to either work with whatever third-party storage vendor is brought to the table or take on the responsibility of choosing their preferred storage vendor themselves.
Having giant piles of GPUs is necessary, but not sufficient, to have a competitive offering in the GPU cloud services landscape. A credible platform for AI training must also have an integrated high-performance storage service.
It is looking like many pure-play GPU clouds are finding it more cost-effective to buy their way out of high-performance storage problems through partnerships than build and manage their own services atop open-source software like Lustre or DAOS.

None of these observations are terribly surprising; at the price these smaller CSPs are offering GPUs compared to the Big 3 CSPs, their gross margin (and therefore their ability to invest in developing services on top of their IaaS offerings) has got to be pretty low. In the short term, it's cheaper and easier to deploy one-off high-performance storage systems alongside dedicated GPU clusters based on customer demand than develop and support a standard solution across all customers.

Of course, building a low-cost GPU service opens the doors for other companies to develop their own AI services on top of inexpensive GPU IaaS that is cost-competitive with the Big 3's native AI platforms (AWS SageMaker, Azure Machine Learning, and Google AI Platform). For example, I chatted with some folks at together.ai, a startup whose booth caught my eye with its bold claim of being "the fastest cloud for [generative] AI:"

Contrary to their banner, they aren't a cloud; rather, they provide AI services--think inferencing and fine-tuning--that are accessible through an API much like OpenAI's API. They've engineered their backend stack to be rapidly deployable on any cloud that provides basic IaaS like GPU-equipped VMs, and this allows them to actually run their computational backend on whatever cloud can offer the lowest-cost, no-frills GPU VMs. In a sense, companies like together.ai develop and sell the frills that these new GPU CSPs lack, establishing a symbiotic alternative to the vertically integrated AI platforms on bigger clouds.

I did ask a few of these smaller CSPs what their overall pitch was. Why I would choose GPU cloud X over their direct competitor GPU cloud Y? The answers went in two directions:

They offer lower cost per GPU hour than their competition
They are faster to get GPUs off a truck and into production than their competition

There's a big caveat here: I didn't talk to many representatives at these CSPs, so my sample size was small and not authoritative. However, taking these value propositions at face value struck me as being quite precarious since their value is really a byproduct of severe GPU shortages driven by the hyped-up AI industry. What happens to these CSPs (and the symbionts whose businesses depend on them) when AMD GPUs appear on the market in volume? What happens if NVIDIA changes course and, instead of peanut-buttering its GPUs across CSPs of all sizes, it focuses its attention on prioritizing deliveries to just a few blessed CSPs?

There is no moat around generative AI, and I left SC23 feeling like there's a dearth of long-term value being generated by some of these smaller GPU CSPs. For those CSPs whose primary focus is buying and deploying as many GPUs in as short a time as possible, not everyone can survive. They'll either come out of this GPU shortage having lost a lot of money building data centers that will go unused, or they'll be sold for parts.

More importantly to me though, I learned that I should give less credence to the splashy press events of hot AI-adjacent startups if their successes lie exclusively with smaller GPU CSPs. Some of these CSPs are paying to make their problems go away in an effort to keep their focus on racking and stacking GPUs in the short term, and I worry that there's a lack of long-term vision and strong opinions in some of these companies. Some of these smaller CSPs seem much more like coin-operated GPU cluster vending machines than platform providers, and that business model doesn't lend itself to making big bets and changing the industry.

Put another way, my job--both previous and current--has always been to think beyond short-term band aids and make sure that my employer has a clear and opinionated view of the technical approach that will be needed to address the challenges of HPC ten years in the future. I know who my peers are at the other Big 3 CSPs and leadership computing facilities across the world, and I know they're thinking hard about the same problems that I am. What worries me is that I do not know who my peers are at these smaller CSPs, and given their speed of growth and smaller margins, I worry that they aren't as prepared for the future as they will need to be. The AI industry as a whole will be better off when GPUs are no longer in such short supply, but the ecosystem surrounding some of these smaller GPU CSPs is going to take some damage when that day comes.

Other dribs and drabs

I also had a lot of interesting conversations and noticed a few subtle themes last week that don't neatly fit into any other category, but I'd love to hear more from others if they noticed the same or have more informed opinions.

APUs and superchips - are they really that useful?

Because I spent my booth duty standing next to one of Eagle's 8-way HGX H100 nodes, a lot of people asked me if I thought the Grace Hopper superchip would be interesting. I'm not an expert in either GPUs or AI, but I did catch up with a few colleagues who are smarter than me in this space last week, and here's the story as I understand it:

The Grace Hopper superchip (let's just call it GH100) is an evolution of the architecture developed for Summit, where V100 GPUs were cache-coherent with the CPUs through a special widget that converted NVLink to the on-chip coherence protocol for Power9. With GH100, the protocol used to maintain coherence across the CPU is directly compatible with the ARM AMBA coherence protocol, eliminating one bump in the path that Power9+V100 had. Grace also has a much more capable memory subsystem and NOC that makes accessing host memory from the GPU more beneficial.

Now, do AI workloads really need 72 cores per H100 GPU? Probably not.

What AI (and HPC) will need are some high-performance cores to handle all the parts of application execution that GPUs are bad at--divergent code paths, pointer chasing, and I/O. Putting capable CPU cores (Neoverse V2, not the N2 used in CPUs like new Microsoft's Cobalt 100) on a capable NOC that is connected to the GPU memory subsystem at 900 GB/s opens doors for using hierarchical memory to train LLMs in clever ways.

For example, naively training an LLM whose weights and activations are evenly scattered across both host memory and GPU memory won't go well since that 900 GB/s of NVLink C2C would be on the critical path of many computations. However, techniques like activation checkpointing could become a lot more versatile when the cost of offloading certain tensors from GPU memory is so much lower. In essence, the presence of easily accessible host memory will likely allow GPU memory to be used more efficiently since the time required to transfer tensors into and out of HBM is easier to hide underneath other computational steps during training.

Pairing an over-specified Grace CPU with a Hopper GPU also allows the rate of GPU development to proceed independently of CPU development. Even if workloads that saturate an H100 GPU might not also need all 72 cores of the Grace CPU, H200 or other future-generation GPUs can grow into the capabilities of Grace without having to rev the entire superchip.

I didn't get a chance to talk to any of my colleagues at AMD to get their perspective on the MI300 APU, but I'd imagine their story is a bit simpler since their memory space is flatter than NVIDIA's superchip design. This will make training some models undoubtedly more straightforward but perhaps leave less room for sophisticated optimizations that can otherwise cram more of a model into a given capacity of HBM. I'm no expert though, and I'd be happy to reference any explanations that real experts can offer!

What about quantum?

Quantum computing has been a hot topic for many years of SC now, but it feels like a topic that is finally making its way out of pure CS research and into the minds of the everyday HPC facility leaders. I talked to several people last week who asked me for my opinion on quantum computing because they have come to the realization that they need to know more about it than they do, and I have to confess, I'm in the same boat as they are. I don't follow quantum computing advancements very closely, but I know an increasing number of people who do--and they're the sort who work in CTOs' offices and have to worry about risks and opportunities more than intellectual curiosities.

It's hard to say there've been any seismic shifts in the state of the art in quantum computing at SC23; as best I can tell, there's still a rich ecosystem of venture capital-backed startups who keep cranking out more qubits. But this year felt like the first year where HPC facilities who haven't yet started thinking about their position on quantum computing are now behind. Not everyone needs a quantum computer, and not everyone even needs a quantum computing researcher on staff. But everyone should be prepared with a strong point of view if they are asked "what will you be doing with quantum computing?" by a funding agency or chief executive.

NextSilicon

One of the least-stealthy stealth-mode startups in the HPC industry has been NextSilicon, a company who debuted from stealth mode at SC23, launched their new Maverick accelerator, and announced their first big win with Sandia National Lab's Vanguard II project.

What's notable about NextSilicon is that, unlike just about every other accelerator startup out there, they are not trying to go head-to-head with NVIDIA in the AI acceleration market. Rather, they've created a dataflow accelerator that aims to accelerate challenging HPC workloads that GPUs are particularly bad at--things like irregular algorithms and sparse data structures. They've paired this hardware with a magical runtime that continually optimizes the way the computational kernel is mapped to the accelerator's reconfigurable units to progressively improve the throughput of the accelerator as the application is running.

The concept of dataflow accelerators has always been intriguing since they're the only alternative to improving computational throughput besides making larger and larger vectors. The challenge has always been that these accelerators are more like FPGAs than general-purpose processors, and they require similar amounts of hardcore CS expertise to use well. NextSilicon claims to have cracked that nut with their runtime, and it seems like they're hiring the rights sorts of people--real HPC with respectable pedigrees--to make sure their accelerator can really deliver value to HPC workloads.

I/O benchmarking developments

At the IO500 BOF, there was rich discussion about adding new benchmarking modes to IOR and IO500 to represent a wider range of patterns.

More specifically, there's been an ongoing conversation about including a 4K random read test, and it sounds like the most outspoken critics against it have finally softened their stance. I've not been shy about why I think using IOPS as a measure of file system performance is dumb, but 4K random IOPS do establish a lower bound of performance for what a real application might experience. Seeing as how IO500 has always been problematic as any representation of how a file system will perform in real-world environments, adding the option to run a completely synthetic, worst-case workload will give IO500 the ability to define a complete bounding box around the lower and upper limits of I/O performance for a file system.

Hendrik Nolte from GWDG also proposed a few new and appealing IOR modes that approach more realistic workload scenarios. The first was a new locally random mode where data is randomized within IOR segments but segments are repeated:

Compared to globally randomized reads (which is what IOR normally does), this is much closer representation of parallel workloads that are not bulk-synchronous; for example, NCBI BLAST uses thread pools and work sharing to walk through files, and the resulting I/O pattern is similar to this new mode.

He also described a proposal to run concurrent, mixed workloads in a fashion similar to how fio currently works. Instead of performing a bulk-synchronous parallel write followed by a bulk-synchronous parallel read, his proposal would allow IOR to perform reads and writes concurrently, more accurately reflecting the state of multitenant storage systems. I actually wrote a framework to do exactly this and quantify the effects of contention using IOR and elbencho, but I left the world of research before I could get it published. I'm glad to see others seeing value in pursuing this idea.

The other noteworthy development in I/O benchmarking was presented by Sven Breuner at the Analyzing Parallel I/O BOF where he described a new netbench mode for his excellent elbencho benchmark tool. This netbench mode behaves similarly to iperf in that it is a network-level throughput test, but because it is part of elbencho, it can generate the high-bandwidth incasts and broadcasts that are typically encountered between clients and servers of parallel storage systems:

This is an amazing development because it makes elbencho a one-stop shop for debugging the entire data path of a parallel storage system. For example, if you're trying to figure out why the end-to-end performance of a file system is below expectation, you can use elbencho to test the network layer, the object or file layer, the block layer, and the overall end-to-end path separately to find out which layer is underperforming. Some file systems have specialized included tools to perform the same network tests (e.g., nsdperf for IBM Spectrum Scale), but elbencho now has a nice generic way to generate these network patterns for any parallel storage system.

Some personal thoughts

As with last year, I couldn't attend most of the technical program due to a packed schedule of customer briefings and partner meetings, but the SC23 Digital Experience was excellently done, and I wound up watching a lot of the content I missed during the mornings and after the conference (at 2x speed!). In that sense, the hybrid nature of the conference is making it easier to attend as someone who has to juggle business interests with technical interests; while I can't jump into public arguments about the definition of storage "QOS", I can still tell that my old friends and colleagues are still fighting the good fight and challenging conventional thinking across the technical program.

My Parallel I/O in Practice tutorial

This was the sixth year that I co-presented the Parallel I/O in Practice tutorial with my colleagues Rob Latham, Rob Ross, and Brent Welch. A conference photographer got this great photo of me in the act:

Presenting this tutorial is always an incredibly gratifying experience; I've found that sharing what I know is one of the most fulfilling ways I can spend my time, and being able to start my week in such an energizing way is what sustains the sleep deprivation that always follows. Giving the tutorial is also an interesting window into what the next generation of I/O experts is worrying about; for example, we got a lot of questions and engagement around the low-level hardware content in our morning half, and the I/O benchmarking material in the late afternoon seemed particularly well received. The majority of attendees came from the systems side rather than the user/dev side as well, perhaps suggesting that the growth in demand for parallel storage systems (and experts to run them) is outstripping the demand for new ways to perform parallel I/O. Guessing wildly, perhaps this means new developers are coming into the field higher up the stack, using frameworks like fsspec that abstract away low-level I/O.

Since I've jumped over to working in industry, it's been hard to find the business justification to keep putting work hours into the tutorial despite how much I enjoy it. I have to confess that I didn't have time to update any of the slides I presented this year even though the world of parallel I/O has not remained the same, and I am going to have to figure out how to better balance these sorts of community contributions with the demands of a day job in the coming years.

An aside on COVID safety

At SC22, I fastidiously wore a KN95 mask while indoors and avoided all after-hours events and indoor dining to minimize my risk of catching COVID. At that time, neither my wife nor I had ever gotten COVID before, and I had no desire to bring it home to my family since my father died of COVID-related respiratory failure two years prior. Staying fully masked at SC22 turned out to be a great decision at the time since a significant number of other attendees, including many I spoke with, contracted COVID at SC22. By comparison, I maintained my COVID-free streak through 2022.

This year I took a more risk-tolerant approach for two reasons:

My wife and I both broke our streaks this past summer and contracted COVID while on vacation, so if I got sick, we knew what to expect, and
I got my gazillionth COVID and flu shots in October in anticipation of attending SC.

Part of my approach to managing risk was bringing my trusty Aranet4 CO2 sensor with me so that I could be aware of areas where there was air circulation and the risk of contracting an airborne illness would be higher. I only wore a KN95 at the airport gates and while on the airplane at SC23, and despite going in all-in on after-hours events, indoor dining, and copious meetings and tours of booth duty, I'm happy to report that I made it through the conference without getting sick.

I have no doubt that being vaccinated helped, as I've had several people tell me they tested positive for COVID after we had dinner together in Denver. But it's also notable that the Denver Convention Center had much better ventilation than Kay Bailey Hutchison Convention Center in Dallas where SC22 was held last year. To show this quantitatively, let's compare air quality measurements from SC22 to SC23.

My schedule for the day on which I give my tutorial is always the same: the tutorial runs from 8:30am to 5:00pm with breaks at 10:00, 12:00, and 3:00. Because of this consistent schedule, comparing the CO2 readings (which are a proxy for re-breathed air) for my tutorial day at SC22 versus SC23 shows how different the air quality was in the two conference centers. Here's what that comparison looks like:

What the plot shows is that CO2 (re-breathed air) steadily increased at the start of the tutorial at both SC22 and SC23, but Denver's convention center kicked on fresh air ventilation after an hour while Dallas simply didn't. Air quality remained poor (over 1,000) throughout the day in Dallas, whereas Denver was pretty fresh (below 700) even during the breaks and the indoor luncheon. This relatively good air circulation inside the convention center at SC23 made me much more comfortable about going maskless throughout the week.

This isn't to say that I felt there was no risk of getting sick this year; there was at least one busy, upscale restaurant/bar in which I dined where the air circulation was no better than in a car or airplane. For folks who just don't want to risk being sick over Thanksgiving, wearing a mask and avoiding crowded bars was probably still the best option this year. And fortunately, Denver's weather was gorgeous, so outdoor dining was completely viable during the week.

AI's effects on the HPC community

Although AI has played a prominent role in previous SC conferences, this was the first year where I noticed that the AI industry is bleeding into the HPC community in weird ways.

For example, I had a bunch of journalists and media types accost me and start asking rather pointed questions while I was on booth duty. Talking to journalists isn't entirely unusual since I've always been supportive of industry press, but the social contract between practitioners like me and journalists has always been pretty formal--scheduling a call in advance, being invited to speak at an event, and things like that have long been the norm. If I was being interviewed on the record, I knew it.

This year though, it seemed like there was a new generation of younger journalists who approached me no differently than a casual booth visitor. Some did introduce themselves as members of the press after we got chatting (good), but others did not (not good) which led me to take away a learning: check names and affiliations before chatting with strangers, because the days where I could assume that all booth visitors would act in good faith are gone.

Now, why the sudden change? I can think of three possible reasons:

I'm getting older, and there are now tech industry journalists who are younger than me and think I am worth talking to since I've always been around. Maybe the old-school HPC folks that predate me have always had to deal with this.
The proliferation of platforms like Substack make it financially viable to be an independent journalist, and conversely, anyone can be a journalist without editorial oversight.
The spotlight on the massive AI industry is also illuminating the HPC industry. HPC and AI are both built on the same foundational technologies (GPUs, RDMA fabrics, HBM, and the like) so AI journalists now have a reason to start showing up at HPC community events.

It'd be fair to argue that #3 is a stretch and that this isn't an AI phenomenon if not for the fact that I was also accosted by a few venture capitalists for the first time this year. HPC has never been an industry that attracted the attention of venture capital in the way that AI does, so I have to assume being asked specific questions about the viability of some startup's technology is a direct result of the AI market opportunity.

While it's nice to have a broader community of attendees and more media coverage, the increasing presence of AI-focused media and VC types in the SC community means I can't be as open and honest as I once was. Working for a corporation (with secrets of its own to protect) doesn't help there either, so maybe getting cagier when talking to strangers is just a part of growing up.

SC23 as a milestone year

Attending SC23 this year coincided with two personal milestones for me as well.

This is the tenth year I've been in the HPC business, and the first SC I ever attended was SC13. I can't say that this is my eleventh SC because I didn't attend in 2014 (on account of working at a biotech startup), but I've been to SC13, SC15 through SC19, SC20 and SC21 virtually, and SC22 and SC23 in-person. At SC13 ten years ago, the weather was a lot colder:

But I still have the fondest memories of that conference because it that was the week where I felt like I had finally found my community after having spent a decade as an unhappy materials science student.

SC23 is also a milestone year because it may be the last SC I attend as a storage and I/O guy. I recently signed on for a new position within Microsoft to help architect the next generation of supercomputers for AI, and I'll probably have to trade in the time I used to spend at workshops like PDSW for opportunities to follow the latest advancements in large-scale model training, RDMA fabrics, and accelerators. But I think I am OK with that.

I never intended to become an I/O or storage expert when I first showed up at SC13; it wasn't until I joined NERSC that I found that I could learn and contribute the most by focusing on storage problems. The world has changed since then, and now that I'm at Microsoft, it seems like the problems faced at the cutting edge of large language models, generative AI, and the pursuit of AGI are where the greatest need lies. As I said earlier in this post, AI has bigger problems to deal with than storage and I/O, and those bigger problems are what I'll be chasing. With any luck, I'll be able to say I had a hand in designing the supercomputers that Microsoft builds after Eagle. And as has been true for my last ten years in this business, I'll keep sharing whatever I learn with whoever wants to know.

↧

A closer look at "training" a trillion-parameter model on Frontier

January 13, 2024, 3:41 pm

≫ Next: On the road to Hamburg for ISC'24

≪ Previous: SC'23 Recap

A paper titled "Optimizing Distributed Training on Frontier for Large Language Models" has been making its rounds over the last few weeks with sensational taglines saying the authors trained a trillion-parameter model using only a fraction of the Frontier supercomputer. The superficiality of the discourse around this paper seemed suspicious to me, so in the interests of embracing my new job in AI systems design, I decided to sit down with the manuscript and figure out exactly what the authors did myself.

As a caveat, I am by no means an expert in AI, and I relied on my friend ChatGPT to read the paper with me and answer questions I had along the way. It is from that perspective that I compiled the notes that follow, and I'm sharing them in the event that there are other folks like me who are interested in understanding how large-scale training maps to HPC resources but don't understand all the AI jargon.

Before getting too far into the weeds, let's be clear about what this study did and didn't do. Buried in the introduction is the real abstract:

"So, we performed a feasibility study of running [Megatron-DeepSpeed] on Frontier, ported the framework to Frontier to identify the limiting issues and opportunities, and prepared a training workflow with an optimized AI software stack."
"...our objective was not to train these models to completion for the purpose of achieving the highest possible accuracy. Instead, our approach was centered around understanding and enhancing the performance characteristics of training processes on HPC systems."

To spell this out:

The authors did not train a trillion-parameter model. They ran some data through a trillion-parameter model to measure training throughput, but the model wasn't trained at the end of it.
It's worth repeating - they did not train a trillion-parameter model! All the articles and headlines that said they did are written by people who either don't understand AI or didn't read the paper!
The authors did not create a novel trillion-parameter model at all. This paper wasn't about a new model. There is no comparison to GPT, Llama, or any other leading LLM.
The authors present a nice overview of existing parallelization approaches for training LLMs. For each approach, they also describe what aspect of the HPC system affects scalability.
The authors ported a very good LLM training framework from NVIDIA to AMD GPUs. This is a strong validation that all the investment in LLM training for NVIDIA also applies to AMD.
The authors present a good recipe for training LLMs on hundreds or thousands of GPUs. They tested their approach on transformers with up to a trillion parameters to show that their recipe scales.

This isn't a paper about a new trillion-parameter model. Rather, it is an engineering paper describing how the authors took:

existing parallelization techniques (data, tensor, and pipeline parallelism)
an existing training framework that implements those techniques (Megatron-DeepSpeed)
an existing model architecture that can be made arbitrarily large (a generic, GPT-style transformer)
existing GPU frameworks, libraries, and packages (CUDA, ROCm, PyTorch, APEX, DeepHyper)

and combined them to all work together, then showed that their approach scales up to at least a trillion parameter and at least a few thousand GPUs.

This paper is also a pretty good crash course on how LLM partitioning strategies translate into HPC system requirements. Let's focus on this latter point first.

Data requirements

Training dataset size

The paper starts with a few interesting nuggets about the data requirements for training large language models. For example, the introduction states:

"Some studies also reported the loss scaling law, which states that an LLM model can keep learning from data up to 20x-200x of its parameter count [1, 6, 7]."

What this line doesn't state is that 20x-200x refers to the number of tokens in the overall training data you can train on before the LLM stops improving. Given that a typical token in an English-language body of data is somewhere between 3 bytes and 4 bytes, we can get a ballpark estimate for how much training data you'd need to train a trillion-parameter model:

On the low end, 1 trillion parameters * 20 tokens of training data per parameter * 3 bytes per token = 60 terabytes of tokenized data
On the high end, 1 trillion parameters * 200 tokens of training data per parameter * 4 bytes per token = 800 terabytes of tokenized data

Bear in mind that tokenized data are stored as numbers, not text. 60 TB of tokenized data may correspond to petabytes of raw input text.

Computational power required

The introduction also contains this anecdote:

“A rough estimate [11] tells us that training a Trillion parameter model on 1-30 Trillion tokens will require [...] 6 - 180 Million exa-flops (floating point operations).”

The authors rightly point out that this estimate is rough; the actual requirements are a function of exactly how the LLM's layers are composed (that is, how those trillion parameters are distributed throughout the model architecture), the precisions being used to compute, choice of hyperparameters, and other stuff). That said, this establishes a good ballpark for calculating either the number of GPUs or the amount of time you need to train a trillion-parameter model.

The paper implicitly states that each MI250X GPU (or more pedantically, each GCD) delivers 190.5 teraflops. If

6 to 180,000,000 exaflops are required to train such a model
there are 1,000,000 teraflops per exaflop
a single AMD GPU can deliver 190.5 teraflops or 190.5 × 10¹² ops per second

A single AMD GPU would take between

6,000,000,000,000 TFlop / (190.5 TFlops per GPU) = about 900 years
180,000,000,000,000 TFlop / (190.5 TFlops per GPU) = about 30,000 years

This paper used a maximum of 3,072 GPUs, which would (again, very roughly) bring this time down to between 107 days and 9.8 years to train a trillion-parameter model which is a lot more tractable. If all 75,264 GPUs on Frontier were used instead, these numbers come down to 4.4 days and 146 days to train a trillion-parameter model.

To be clear, this performance model is suuuuuper sus, and I admittedly didn't read the source paper that described where this 6-180 million exaflops equation came from to critique exactly what assumptions it's making. But this gives you an idea of the scale (tens of thousands of GPUs) and time (weeks to months) required to train trillion-parameter models to convergence. And from my limited personal experience, weeks-to-months sounds about right for these high-end LLMs.

GPU memory required

The limiting factor for training LLMs on GPUs these days is almost always HBM capacity; effective training requires that the entire model (all trillion parameters) fit into GPU memory. The relationship between GPU memory and model parameter count used in this paper is stated:

"training a trillion parameter model requires 24 Terabytes of memory."

This implies that you need 24 bytes (192 bits) of memory per parameter. The authors partially break it down these 24 bytes down into:

a 16-bit (2-byte) weight
a 32-bit (4-byte) gradient
a 32-bit (4-byte) copy of the weight
a 32-bit (4-byte) momentum (the optimizer state)

That's only 14 of the 24 bytes though, and the authors don't explain what the rest is. That said, other papers (like the ZeRO-DP paper) have a similar number (16 bytes per parameter) and spell out the requirements as:

a 16-bit (2-byte) weight
a 16-bit (2-byte) gradient
a 32-bit (4-byte) copy of the weight for the optimizer reduction
a 32-bit (4-byte) momentum (one part of the optimizer state)
a 32-bit (4-byte) variance (the other part of the optimizer state)

Of course, this is all subject to change as models begin to adopt 8-bit data types. The story also changes if you use a different optimizer (the above "optimizer state" components are required by the Adam optimizer), and storing models for inferencing can collapse this down much further since most of these per-parameter quantities are used only during training.

Back to a trillion-parameter model, 24 bytes per parameter would require 24 terabytes of GPU memory, and 16 bytes per parameter would require 16 terabytes of GPU memory. On Frontier, each GPU (well, each GCD) has 64 GB of HBM, meaning you'd need to distribute the model's parameters over at least 256 to 384 GPUs to get the required 16 to 24 TB of HBM required to train one copy of a trillion-parameter model. Of course, training requires other stuff be stored in GPU memory as well, so the actual amount of GPU memory and GPUs would be higher.

LLMs and data structures

At its core, this paper describes how you can distribute this 24 TB model over 256 to 384 GPUs in a way that minimizes data transfer during the training process. To understand the different approaches to partitioning a model, we have to first understand the basic parts of an LLM that must be broken up.

Defining features of an LLM

The paper has a good overview of the transformer architecture, but it details aspects of LLMs that aren't relevant to the work done in the paper itself which tripped me up. The authors used a decoder-only, GPT-style model architecture in their trillion-parameter model architecture, so even though transformers can have encoders and decoders as shown in their figures, all discussion of encoders can be ignored.

That said, let's talk about the parts of these GPT-style (decoder-only) transformer LLMs. Such transformers are comprised of repeating layers.

Confusingly, a layer of a transformer is not the same thing as a layer in other types of neural networks. Rather, a transformer layer is a repeating block that generally has two sub-components: a multi-head attention block and a feed-forward neural network.

The multi-head attention block is what receives input, and its job is to which parts of that input should get the most focus. It's called "multi-head" because one head establishes focus on one part of the input, and using multiple heads allows multiple areas of focus to be determined in parallel. The output of this attention block encodes information about how different parts of the input are related or depend on each other. Some places refer to a "masked" attention block; this masking is like telling the attention block that it shouldn't try reading an input sentence backwards to derive meaning from it (called "causal self-attention"). Without this masking, inputs are read forward, backward, and in every order in between (establishing "non-causal self-attention").

The feed-forward neural network takes the output of the attention block and runs it through what I think of as a simple multilayer perceptron with a single hidden layer. Note that this feed-forward neural network (FFNN) hidden layer is different than the transformer layer; each transformer layer contains a FFNN, so we have layers within layers here (confusing!). ChatGPT tells me that the FFNN helps establish more complex patterns in the attention block's output.

There's some massaging and math going on between these two components as well, but this outlines the basics of a decoder-only transformer. In practice, you connect a bunch of these transformer layers in series, resulting in a model that, at a very high level, looks like this:

Conceptual diagram of a decoder-only, GPT-style transformer

The more transformer layers you use, the more parameters your model will have. It's conceptually very easy to make arbitrarily huge, trillion-parameter models as a result.

Relating parameters to model architecture

The model weights (parameters) of an LLM are contained entirely within the attention block and feed-forward neural network of each layer, and the paper lays them all out.

The multi-head attention block has three sets of weight matrices: keys, queries, and values. These matrices have the same x and y dimension (i.e., they're square d× d matrices), and the size of this dimension d (called the hidden dimension) is pretty arbitrary. If you make it bigger, your model gets more parameters. So the number of parameters in each transformer layer's attention block is 3d².

The feed-forward neural network is a perceptron with a single hidden layer that's typically got 4d features (neurons). Since it takes its input directly from the attention block (which outputs d values) and outputs into the next transformer layer's attention block (which receives d values), the FFNN is comprised of three layers:

Conceptual diagram of the feed-forward neural network within a transformer layer

The parameters (weights) describe the interactions between the layers, resulting in this FFNN having two matrices containing parameters:

The weights of the connections between the input layer and the hidden layer are a d × 4d matrix
The weights of the connections between the hidden layer and the output layer are a 4d × d matrix

So, the number of parameters in each transformer layer's FFNN block is 4d² + 4d², or 8d². This four-to-one ratio of the hidden layer seems arbitrary, but it also seems pretty standard.

The total number of parameters for a single transformer layer is thus 11d² (3d² from the attention block and 4d² + 4d²from the FFNN). To make a bigger model, either increase the hidden dimension size d or stack more transformer layers (or both!).

The paper points out that "width" and "depth" are the terms used to describe these two dimensions:

"LLMs are transformer models whose shapes are determined linearly by the depth (number of layers) and quadratically by the width (hidden dimension)."

A wider model has a higher d, and a deeper model has more transformer layers. Understanding this is important, because parallelizing the training process happens along these two dimensions of width and depth.

Distributing LLMs across GPUs

The paper goes on to describe three strategies for distributing a model across multiple GPUs:

Tensor parallelism
Pipeline parallelism
Sharded data parallelism

In addition, they also don't ever describe regular (non-sharded) data parallelism even though they use it in the study. Perhaps they viewed it as too obvious to describe, but figuring out a data-parallel approach is an essential aspect to scaling out training of LLMs, so I'll provide my own interpretation of it below.

Following the format of the paper, let's talk about model partitioning from finest-grained to coarsest-grained parallelism.

Tensor parallelism

Tensor parallelism breaks up a model on a per-tensor (per-matrix) basis; in our depth-and-width parlance, tensor parallelism parallelizes along the width of the model.

The paper uses the notation W_K, W_Q, W_V, W₁, and W₂ to denote the keys, queries, values, and two FFNN parameter matrices; these are what get partitioned and computed upon in parallel. There's a diagram in the paper (Figure 3) which describes the attention half of this process, but it also shows a bunch of stuff that is never described which added to my confusion. To the best of my knowledge, this is what the tensor-parallel computation for an entire attention + FFNN transformer layer looks like:

The input matrix going into the transformer layer is chopped up and distributed across GPUs.
Each GPU computes a portion of the attention matrices (W_K, W_Q, and W_V,) in parallel.
A global reduction is performed to create a single matrix that is output by the attention block. This is an expensive collective.
The resulting matrix is then chopped up and redistributed across the GPUs.
Each GPU then uses this chopped-up matrix to compute a portion of the FFNN parameter matrices (W₁, and W₂)
Another global reduction is performed to create a single matrix that is squirted out of this layer of the transformer for the next layer to start processing.

I'm leaving out a lot of small transformations that occur between each step, but the high-level point is that tensor parallelism requires a significant number of collectives within each layer to distribute and recombine the parameter matrices.

In addition, the above steps only describe the forward pass through each transformer layer. Once data has finished flowing through all layers in the forward pass, gradients must be calculated, and the backward pass must occur. This means more repartitioning of matrices and global reductions to synchronize gradients for each layer.

The communication demands of tensor parallelism are the reason why NVLink (and Infinity Fabric) exists; the extreme bandwidths (hundreds of gigabytes per second) between GPUs is required to keep the communication overheads of tensor parallelism low enough to prevent the GPUs from stalling out. You effectively cannot implement tensor parallelism outside of a pool of GPUs interconnected with NVLink; conversely, the sum of all the HBM connected to a single NVLink pool limits the size of the LLM with which tensor parallelism can be used. If your model has too many parameters, you can't fit them all in a single NVLink domain with tensor parallelism alone.

The paper shows measurements to back this up; GPU throughput is halved as soon as tensors are split across multiple Infinity Fabric coherence domains on Frontier.

Pipeline parallelism

Pipeline parallelism (or layer parallelism) is a completely different approach. Whereas tensor parallelism partitions the model along the width dimension, pipeline parallelism partitions along the depth dimension. The transformer layers are partitioned and distributed across GPUs, and as data flows through the transformer's layers, it also moves through GPUs. The process goes something like this:

The entire LLM is chopped up into partitions such that each partition has multiple consecutive transformer layers (entire attention blocks + feed-forward neural networks). These partitions are distributed across GPUs. For example, a twelve-layer LLM distributed over four GPUs would have layers 0-2 on GPU0, 3-5 on GPU1, 6-8 on GPU2, and 9-11 on GPU3.
A minibatch of training data is chopped up finely into micro-batches.
Micro-batches are fed into the pipeline of layers for the forward pass.

Once GPU0 has passed the first micro-batch through layers, GPU1 and its layers begin processing the data.
At the same time, GPU0 can now begin processing the second micro-batch.
GPU0 and GPU1 should finish at the same time since they both share equal fractions of the overall transformer. Output from GPU1 moves to GPU2, output from GPU0 moves to GPU1, and a third micro-batch is fed to GPU0.

When a micro-batch reaches the end of the pipeline and exits the last layer of the transformer LLM, its gradients are calculated, and it begins the backward pass.
Once the last micro-batch in a minibatch has completed its backward pass, a global reduction performed to synchronize gradients across the entire model. Model weights are then updated using these gradients.

The communication is much less intensive than tensor parallelism because most of it occurs when GPU hands its last layer's output matrices off to another GPU to use as inputs to its layers. The costly global synchronizations only happen after a bunch of micro-batches have been processed. That said, these global synchronizations introduce bubbles in training, since GPUs start to sit idle as (1) they wait for the first micro-batch to arrive and (2) after their last micro-batch leaves. Intuitively, the relative impact of this bubble increases as transformer layers are split over more GPUs, and optimal pipeline parallelism involves running as many micro-batches as you can through the pipeline before synchronizing to minimize the impact of the utilization bubble.

Like with tensor parallelism, the paper shows quantitative results that back up this qualitatively intuitive scaling trend.

If my description of pipeline parallelism and bubbles doesn't make sense without pictures, check out the PipeDream paper which introduced the above process.

Boring old normal data parallelism

Data-parallel training is the easiest way to scale out model training and it is well understood. I strongly recommend reading Simon Boehm's post on the topic to understand its communication requirements and scalability, but in brief,

Each GPU gets a complete replica of an entire model.
A batch of training data is chopped it up into minibatches, and each GPU (and each model replica) gets one minibatch.
Each GPU runs its minibatch through the forward pass of the model. Losses are calculated.
Each GPU begins the backward pass. After each layer is done calculating its gradients, it kicks off a nonblocking, global synchronization occurs to accumulate all the gradients for that layer.
After all GPUs have completed their backward passes, those gradients are also all collected and used to update model parameters, then the process repeats.

The paper doesn't describe this process and it doesn't do any tests about its scalability, but this is a well-known partitioning strategy whose scalability is well documented across the Internet.

Sharded-data parallelism

The paper describes sharded-data parallelism only briefly, and it doesn't do any sort of scalability measurements with it as was done for tensor and pipeline parallelism. However, it's a clever way to emulate the process of data parallelism in a way that is very memory-efficient on GPUs, allowing larger models to fit on fewer GPUs. It goes something like this:

Every layer of the model is chopped up equally and distributed across all GPUs such that every GPU has a piece of every layer. This is similar to tensor parallelism's partitioning strategy.
The batch of training data is chopped up into minibatches, and each GPU gets a minibatch. This is similar to boring data parallelism.
To begin the forward pass, all GPUs perform a collective to gather all the pieces of the first layer which were distributed across all GPUs in step #1. This rehydrates a complete replica of only that first layer across all GPUs.
All GPUs process their minibatch through the first layer, then throw away all of the pieces of that first layer that they don't own.
All GPUs collectively rehydrate the next layer, process it, and so on. I don't see a reason why all GPUs must synchronously process the same layer, so my guess is that each GPU shares its pieces of each layer asynchronously to whatever layer it is computing.

This process keeps going through all layers for the forward pass, losses are calculated, and then the backward pass is performed in a similar rehydrate-one-layer-at-a-time way. As with boring data parallelism, gradients are accumulated as each layer is processed in the backward pass.

This approach has the same effect as boring data parallelism because it ultimately chops up and trains on minibatches in the same way. However, it uses much less GPU memory since each GPU only has to store a complete replica of one layer instead of all layers. This allows larger models to fit in fewer GPUs in exchange for the increased communication required to rehydrate layers.

On the one hand, this increases the number of collective communications happening, but on the other, it reduces the size of the domain over which these collectives occur. I can see this being useful for designing fabrics that have tapering, since you can fit larger models into smaller high-bandwidth domains.

3D parallelism

The paper makes reference to "3D parallelism" which is really just combining the above partitioning schemes to improve scalability, and in reality, all massive models are trained using a combination of two or more of these approaches. For example,

You might implement tensor parallelism within a single node so that tensors are distributed over eight GPUs interconnected by NVLink or Infinity Fabric.
You might implement pipeline parallelism across all nodes connected to the same network switch.
You might implement data parallelism across nodes sharing the same switches.

In the above example, tensor parallelism would work exactly as I described earlier. Pipeline parallelism would be passing matrices between entire nodes instead of entire GPUs as micro-batches made their way through the forward and backward passes. Data parallelism would have models being replicated over groups of nodes instead of individual GPUs.

So what did they actually do?

The interesting parts of the paper begin in Section IV, where the authors describe using DeepHyper, a tool they developed in 2018, to perform sensitivity analysis on different model partitioning strategies. Their Figure 9 is where much of the money is, and they find that when combining tensor parallelism, pipeline parallelism, and data parallelism:

Choice of micro-batch size is the most important factor for throughput. Intuitively, this makes sense; getting this wrong will introduces bubbles into the training pipeline where GPUs are idling for a time that's linearly proportional to the micro-batch size.
Choice of tensor partitioning is second-most important. Again, not surprising since tensor parallelism is very communication-intensive. Interestingly, the authors did not test the sensitivity of this partitioning strategy outside of a single high-bandwidth coherence domain (8 GPUs interconnected with Infinity Fabric) since I presume they knew that would go poorly.
Choice of layer partitioning is third-most important. This makes sense, as pipeline parallelism isn't as communication-intensive as tensor parallelism.
Number of nodes follows. They say number of nodes, but my interpretation is that this is really the degree of data parallelism that results from the choice of pipeline and tensor partitioning. Since they used a fixed number of GPUs for this entire parameter sweep, the way they tested didn't really give much room to test the sensitivity to data partitioning as the total degree of parallelism increased. Combined with the fact that data parallelism is the least communication-intensive way to partition training, this low sensitivity isn't surprising.
Using sharded-data parallelism is least impactful. Although this does introduce additional communication overheads (to reduce the GPU memory required to train), they used the least aggressive form of sharded-data parallelism and only distributed a subset of the matrices (the ones containing optimizer states). My guess is that this only saves a little memory and introduces a little extra communication, so the net effect is that it makes little difference on training throughput.

Based on these findings, they propose a very sensible recipe to use when training massive models: use lots of micro-batches, don't partition tensors across high-bandwidth NVLink/Infinity Fabric domains, and use optimized algorithms wherever possible.

Section V then talks about how they applied this formula to actually run training of a trillion-parameter decoder-only LLM on 3,072 MI250X GPUs (384 nodes) on Frontier. The section is rather short because they didn't run the training for very long. Instead, they just ran long enough to get steady-state measurements of their GPU utilization (how many FLOPS they processed) to show that their approach accomplished the goal of avoiding extensive GPU stalling due to communication.

What didn't they do?

They didn't say anything about storage or data challenges.

Why?

Their brief discussion of roofline analysis says why:

"For these models, our achieved FLOPS were 38.38% and 36.14%, and arithmetic intensities of 180+. The memory bandwidth roof and the peak flops-rate roof meet close to the arithmetic intensity of 1."

Training LLMs is ridiculously FLOPS-intensive; every byte moved is accompanied by over 100x more floating-point operations. This provides a lot of opportunity to asynchronously move data while the GPUs are spinning.

Going back to the start of this post, remember we estimated that a trillion-parameter model might have 80 TB to 800 TB of training data but take months to years to train. The time it takes to move 80 TB of data into hundreds of GPU nodes' local SSDs pales in comparison to the time required to train the model, yet that data transfer time is only incurred once.

But what about re-reading data after each epoch, you ask?

You don't re-read training data from its source at each epoch because that's really slow. You can simply statically partition batches across the SSDs in each node belonging to a model replica and re-read them in random order between epochs, or you can shuffle data between replicas as needed. The time it takes to do these shuffles of relatively small amounts of tokenized training data is not the biggest hurdle when trying to keep GPU utilization high during LLM training.

How valuable is this work?

How novel was the sensitivity analysis? To the AI industry, the answer is "not very." AI practitioners already have an intuitive sense of how the different approaches to parallelization scale since each approach was developed to overcome a previous scalability limitation. Everyone training LLMs in industry is taking this general approach already; they just don't write it down since the industry tends to care more about the outcome of training (a fancy new model) than the mechanical approach taken. That said, I'm sure AI practitioners will find comfort in knowing that the bright minds at Oak Ridge couldn't find a better way to do what was already done, and now this process is documented in a way that can be easily handed off to new hires.

Relatedly, the scientific community will likely benefit from seeing this recipe spelled out, as it's far easier to get access to a large number of GPUs in the open science community than it is in private industry. I could easily see an ambitious graduate student wanting to train a novel LLM at scale, having a healthy allocation on Frontier, and accidentally training in a way that leaves the GPUs idle for 90% of the time.

DeepHyper also sounds like a handy tool for figuring out the exact partitioning of each model layer, across model layers, and across the training dataset during scale-up testing. Regardless of if it's training an AI model or running a massive simulation, the work required to figure out the optimal way to launch the full-scale job is tedious, and the paper shows that DeepHyper helps short-circuit a lot of the trial-and-error that is usually required.

How impactful was the demonstration of training a trillion-parameter LLM on Frontier? I'd say "very."

Sure, running training on a generic, decoder-only, trillion-parameter LLM by itself isn't new or novel; for example, a 30-trillion-parameter model went through a similar process using only 512 Volta GPUs back in 2021. However, to date, these hero demonstrations have exclusively run on NVIDIA GPUs. What this study really shows is that you can train massive LLMs with good efficiency using an entirely NVIDIA-free supercomputer:

AMD GPUs are on the same footing as NVIDIA GPUs for training.
Cray Slingshot in a dragonfly is just as capable as NVIDIA InfiniBand in a fat tree.
NVIDIA's software ecosystem, while far ahead of AMD's in many regards, isn't a moat since the authors could port DeepSpeed to their AMD environment.
All of the algorithmic research towards scaling out training, while done using NVIDIA technologies, is transferable to other high-throughput computing technology stacks.

What's more, the fact that this paper largely used existing software like Megatron-DeepSpeed instead of creating their own speaks to how straightforward it is to get started on AMD GPUs. No heroic effort in software engineering or algorithmic development seemed to be required, and after reading this paper, I felt like you don't need an army of researchers at a national lab to make productive use of AMD GPUs for training huge LLMs.

With luck, the impact of this paper (and the work that will undoubtedly follow on Frontier) is that there will be credible competition in the AI infrastructure space. Though it might not relieve the supply constraints that make getting NVIDIA GPUs difficult, you might not need to wait in NVIDIA's line if you're primarily interested in LLM training. Frontier represents a credible alternative architecture, based on AMD GPUs and Slingshot, that can get the job done.

↧

On the road to Hamburg for ISC'24

May 6, 2024, 5:23 pm

≫ Next: ISC’24 recap

≪ Previous: A closer look at "training" a trillion-parameter model on Frontier

I have the great fortune of being able to attend to the 2024 ISC High Performance conference in Hamburg next week where I will be both speaking and listening throughout the contributed program and the workshop day. As I plan out where I want to be throughout the conference, I'll be updating this post as my personal, living agenda.

My contributions

There are a few events I signed up to help with this year, so I can guarantee I'll be at the following sessions.

ISC'24 Research Paper Session: Power, Energy, and Performance

Date: Wednesday, May 15 at 10:45am
More info: Research Paper Session: Power, Energy and Performance (swapcard.com)

I volunteered to chair this session of the technical program where three papers will be presented on topics related to increasing the efficiency of high-performance computing workloads:

Power Consumption Trends in Supercomputers: A Study of NERSC's Cori and Perlmutter Machines, by Ermal Rrapaj et al. This paper was authored by my former group at NERSC based on power telemetry from systems that I helped design and optimize.
EcoFreq: Compute with Cheaper, Cleaner Energy via Carbon-aware Power Scaling, by Kozlov et al. This paper is presenting a framework for scheduling workloads to run at times when energy is greener (e.g., when the sun is shining, or the wind is blowing) to increase the carbon impact of these jobs. Paying attention to the energy mix available when a job is running is becoming increasingly important in my own work, so I'm looking forward to hearing how this is being thought about in Europe.
BlueField: Accelerating Large-Message Blocking and Nonblocking Collective Operations, by Rich Graham et al. This paper details how the programmable offload capabilities of NVIDIA's BlueField smart NIC can be used to accelerate some collectives. Although not the focus of their paper, offloading computation to lower-power, purpose-built devices such as smart NICs can decrease the overall energy cost of running a workload. This paper is also stacked with notable authors.

Artificial Intelligence and Machine Learning for HPC Workload Analysis BOF

Date: Wednesday, May 15 at 1:00pm
More info: Artificial Intelligence and Machine Learning for HPC Workload Analysis (swapcard.com)

I will be presenting a talk titled "AI and ML for workload analysis in the public cloud" at this BOF. I want to present two halves to HPC workload analysis in the cloud environment:

the traditional perspective, where you profile nodes and applications to understand what they are doing, and
the cloud perspective, where trying to observe what an application is doing is a violation of customer privacy.

I'll show how Azure offers powerful ML tools to make #1 easy, then talk about some of the challenges we face with #2 and how we address them to ensure that we know just enough about our customers' workloads to ensure we can continue to deliver new products that work well for their workloads and preserve the safety of our infrastructure at scale.

HPC I/O in the Data Center Workshop

Thursday, May 16 at 11:30am
More info: HPC-IODC: HPC I/O in the Data Center Workshop [HPS] (vi4io.org)

I will be co-presenting an expert talk title "Debunking the I/O Myths of Large Language Model Training" along with two of my esteemed colleagues from VAST Data, Kartik and Sven Breuner. This talk resulted from a conversation that Kartik and I had after he wrote an insightful blog post on how the I/O demands of training large language models are extremely overblown. I gave a presentation to this effect at the HDF5 BOF at SC'23 last year, and Kartik did one better by developing a beautiful model that allows you to calculate exactly how much I/O performance you need to train a large-language model like Llama or GPT.

I liked his model so much that I thought we should share it with the world since I had built a spiritually similar model for sizing all-flash Lustre file systems and presented it at HPC-IODC a few years ago. Kartik and Sven will present the model, and I will provide some perspective on how the most demanding I/O patterns of LLM training can be further optimized by using more sophisticated hierarchical checkpointing.

Workshop on Interactive and Urgent High-Performance Computing

Date: Thursday, May 16 at 5:10pm
More info: InteractiveHPC

I was invited to present a lightning talk about how Azure can support interactive and urgent high-performance computing workloads. I don't have a crisp title yet, but I am thinking of defying expectation and not presenting the same old boring "you can burst to the cloud if you need compute nodes for interactive or urgent computing!" story. Not only is that dull, but it's a narrative that's not always realistic for true HPC cloud resources because of how one builds clusters with RDMA fabrics like InfiniBand. I'll abandon the myth that the cloud is limitless, and instead focus on the interesting convergence I'm seeing between two unexpected forms of computation: e-commerce and tightly coupled, high-performance computing.

To make the talk a little concrete, I'll also try to find time to describe a specific workflow architecture, built on Azure services, that connect high-performance computing capabilities with low-latency service infrastructure used to support things like distributed web services. The marriage of these two worlds allows scientists to run workflows that use genuine HPC resources (like InfiniBand-connected compute nodes) but get the same level of critical infrastructure support that businesses rely on for their most time-critical, throughput-sensitive, urgent workloads like Black Friday sales.

This talk is either going to be really interesting or make absolutely no sense!

Other sessions of note

I served on this year's program committee and reviewed six papers. Sadly, only one of the papers I reviewed was accepted, but it was an interesting study around DAOS performance across its multiple APIs that revealed some warts and gotchas. It is called Optimizing Metadata Exchange: Leveraging DAOS for ADIOS Metadata I/O and will be presented on Wednesday at 1:00pm. It's worth checking out, as it's the first published study that shows that DAOS still suffers from some of the complexities of parallel file systems when used in certain ways.

The Lustre BOF is also worth checking out on Tuesday at 1:00pm. I will be there as an attendee, but my colleague Brian Barbisch will be participating as a speaker. He is one of the technical leaders of the Lustre development team behind Azure Managed Lustre and a smart yet grounded engineer.

↧

ISC’24 recap

May 27, 2024, 10:24 pm

≫ Next: How has life after leaving the Labs been going?

≪ Previous: On the road to Hamburg for ISC'24

I had the great pleasure of attending the ISC High Performance conference this month, marking the fifth time I've attended what has become one of my top must-attend industry conferences of the year. This year was particularly meaningful to me because it is the first time that:

I attended ISC as a Microsoft employee. This is also the first time I've attended any HPC conference since I changed my focus from storage into AI infrastructure.
I attended ISC in-person since before the pandemic. It's also the first time I've visited Hamburg which turned out to be an absolute delight.

Although registrations have been lower since the pandemic, this year's final registration count was over 3,400 attendees, and there was no shortage of old and new colleagues to bump into walking between the sessions at the beautiful Congress Center Hamburg.

This year's theme was "Reinvent HPC," and that idea—that HPC needs to reinvent itself—was pervasive throughout the program. The whole industry had been pulling towards exascale for the better part of a decade, and now that there are two exaflop systems on Top500 and the dust is settling, it feels like everyone is struggling to figure out what’s next. Is it quantum? AI?

It was difficult for me to draw a line through all the topics worth reviewing at this year's ISC, as it was a very dense four days packed with a variety of topics, discussions, vendors, and events. I only experienced a fraction of everything there was to be seen since so many interesting sessions overlapped, but I thought it might be worthwhile to share my perspective of the conference and encourage others to do the same.

Reinventing HPC (and blast those hyperscalers!)

The need to reinvent HPC was the prevailing theme of the conference from the very first session; with the listing of Aurora as the second system on Top500 to break the 1 exaflops barrier, the community is in search of a new milestone to drive research (and funding!). At the same time, commercial AI has rapidly risen up largely in an independent, parallel effort with a speed and scale that begs the question: how important was the decade-long drive to break the exaflops barrier if the AI industry could catch up so quickly without the help of the institutions that have historically posted the top HPL scores? If the commercial AI industry overtakes scientific computing as the world leader in deploying at scale, how can “HPC” be reinvented so it can continue to claim leadership in another dimension?

Kathy Yelick's opening keynote

ISC’s opening keynote was given by Kathy Yelick, where she provided commentary on two recent government-commissioned reports on the future of HPC:

Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration, commissioned by the National Academies
Can the United States Maintain Its Leadership in High-Performance Computing?, commissioned by the US Department of Energy’s Advanced Scientific Computing Research program

Living up to her reputation, Dr. Yelick’s talk was fast and insightful, describing the insatiable demand for computing driven by scientific research, the struggle to expose continuing amounts of parallelism to make use of newer processors, and some promising directions to address that disconnect. However, her talk started in a direction that I didn’t like when she went into describing the disruptors that necessitate reinventing HPC:

Three disruptors to HPC: AI, quantum, and cloud

The above slide implied that AI, quantum, or cloud may pose an existential threat to the HPC community gathered at ISC this year; this immediately raised my hackles, as it cast the relationship between “HPC” and “AI”/“cloud” as having some sort of adversarial tension. As the talk went on, I realized that “HPC” didn’t really mean “high-performance computing” to her. Rather, it was used to refer to something much more narrowly scoped—high-performance computing to solve scientific problems. Slide after slide, the presentation kept doubling down on this idea that “HPC” as the audience knows it is being threatened. For example, Yelick talked through this slide:

Market cap of traditional HPC vendors vs. hyperscalers

The picture she painted is that “HPC” (denoted by companies with blue bars) no longer has influence over technology providers because the “hyperscalers” (green bars) have such an outsized amount of investment. She then used this to call on the audience to think about ways “we” could influence “them” to produce technologies that are useful for both scientific computing and low-precision AI workloads.

Her talk culminated in this slide:

Post-exascale strategy is "us" versus "them."

Which was accompanied by this conclusion:

"So what’s a post-exascale strategic for the scientific community? It's the beat 'em or join 'em strategy. The beat 'em strategy says we’re going to design our own processors. [...] The join 'em strategy says let's leverage the AI hardware that's out there. [...] The sort of sneaky way of doing this is getting embedded in the AI community and trying to convince them that in order to make AI better for commercial AI applications, you really want to have certain features. Like don't throw away your 64-bit arithmetic and things like that."

I found myself getting increasingly unsettled through the keynote, because this "us versus them" mentality put me, a long-standing member of this HPC community, in the camp of "them." It was as if I was suddenly an outsider in a conference that I've been attending for years just because I no longer work for an organization that has been doing HPC since the early days of computing. Even though the clusters I support use the same NVIDIA and AMD GPUs, the same InfiniBand fabrics, and the same Lustre file systems that "HPC" uses, I am no longer in "HPC" because I am "hyperscale" or "cloud" or "AI."

The underlying message is one I get; GPUs are trending in a direction that favors massive gains in lower-precision computation over FP64 performance. And the cost of HBM is driving the overall value (in FP64 FLOPS per dollar) of accelerators backwards for the first time in the history of scientific computing. But the thesis that the scientific computing community needs to be sneaky to influence the hyperscale or AI players seemed way off the mark to me. What seemed absent was the recognition that many of the "hyperscalers" are her former coworkers and remain her colleagues, and "they" sit in the same audiences at the same conferences and share the same stages as the "HPC" community. All that is true because "HPC" is not somehow different than "cloud" or "AI" or "hyperscale." If there really is a desire to influence the hyperscale and AI industry, the first step should be to internalize that there is no "us" and "them."

Closing keynotes on the future

Just as the conference was opened with a talk about this "us versus them" mentality, it was closed with a talk about "us versus them" in a keynote session titled, "Reinventing HPC with Specialized Architectures and New Applications Workflows" which had two speakers followed by Q&A.

Chiplets for modular HPC

John Shalf gave one half of the closing keynote, where he gave his usual rally for investments in chiplets and specialized processors for HPC:

John Shalf's birds slide calling for processor specialization

He gives a variant of this talk at every ISC, but this year he lasered in on this notion that the "HPC" community needs to do what the "hyperscalers" do and use chiplets to develop custom ASICs. It was an energetic and impassioned talk, but this notion that hyperscalers are already executing on his idea for the future sounded a little funny to me seeing as how I now work for one of these hyperscalers and his message didn't resonate.

John Shalf's concluding thoughts were off the mark

If you really follow the money, as Shalf suggested, a huge amount of it is flowing into GPUs, not specialized processors. It wasn't clear to me what specialization he was thinking of when he referred to custom silicon being developed by the likes of Meta, Google, AWS, and Microsoft; it's true that these companies are developing their own silicon, but those efforts are largely addressing cost, risk, and supply, not improving performance beyond more general-purpose silicon like GPUs. And it turns out that a significant fraction of the (non-US) HPC community is already developing custom silicon for the same reasons as the hyperscalers; Japan, China, and Europe are all developing their own indigenous processors or accelerators for scientific computing at leadership scales. In that sense, Shalf was preaching to the choir given that, on the international stage, his government is the odd one out of the custom silicon game.

He also suggested a dichotomy where the HPC community would either have to just (1) make every scientific problem an AI problem or (2) join this journey towards making domain-specific accelerators, ignoring the significant, unexplored runway offered by using mixed precision arithmetic in scientific applications. He called for partnering with hyperscalers, but his examples of implementing a RISC-V-based stencil accelerator and a SambaNova-based DFT processor didn't draw a clear line to the core missions of the large hyperscalers he extolled. He briefly said that partnering would benefit hyperscalers by addressing some capital cost challenges, but seeing as how the annual capital expenditures of the hyperscalers outstrips those of the US national HPC effort by orders of magnitude, I couldn't understand what the hyperscalers would stand to gain by partnering in this way.

Integrating HPC, AI, and workflows

Rosa Badia gave the second half of the closing keynote where she proposed ideas around complex scientific workflows and the novel requirements to support them. This talk felt a lot more familiar, as the focus was squarely on solving scientific computing challenges by connecting traditional HPC resources together in nontraditional ways using software whose focus goes beyond cranking out floating point arithmetic.

As she spoke, I couldn't help but see parallels between the challenges she presented and the sort of technologies we live and breathe every day in cloud services. For example, she showed this slide:

Rosa Badia's HPC workflows as a service slide

Dr. Badia obviously wanted to make a cloud-tie in by calling this "HPC Workflows as a Service," but what I'm not sure she realized is that this model almost exactly describes platform-as-a-service frameworks that already exist in commercial clouds. For example,

What she calls a "Data Catalog" is a public or private object storage account (a blob container, an S3 bucket) or a PaaS abstraction built atop them
What she calls a "Software Catalog" is a container registry (Azure Container Registry, Amazon Elastic Container Registry) or an abstraction built atop them
A "Workflow Description" is something like an AzureML pipeline or SageMaker pipeline
A "Workflow Registry" is just a Github repository containing pipelines
The "Portal" is the web UI provided by AzureML or SageMaker

I don't think there's anything truly new here; the challenges she described lie in wedging these workflows into HPC infrastructure which lacks the platform features like robust identity and access management (i.e., something better than LDAP that supports more modern authentication and authorization flows and finer-grained access controls) and data management (i.e., something better than a parallel file system that depends on POSIX users, groups, and permissions and implicit trust of clients).

She went on to describe a workflow data management system that reinvented a bunch of infrastructure that is already baked into commercial cloud object stores like Azure Blob and AWS S3:

Rosa Badia's slide on data management for workflows

As she was describing the requirements for such a workflow data management layer, it struck me that what the scientific data community calls "FAIR principles" are the same basic requirements for operating in commercial environments where data may be subject to strict privacy and compliance regulations. The notion of findable data may be aspirational for scientific datasets, but when a company is having to find datasets because it's being sued or subpoenaed, findability is a bare-minimum requirement for any data management system. Similarly, tracking the provenance of data may be a nice-to-have for scientific data, but it is a hard requirement when establishing a secure software supply chain. Cloud storage systems solved many of these challenges a long time ago, and I can't help but wonder if this idea that workflows in HPC pose a new set of challenges is another manifestation of "us" not realizing "they" might have done something useful and applicable for science.

Badia's final slide had a particularly poignant statement which read, "Systems can only be justified if we have applications that need them." I think she was trying to call for more investment in application development to exploit new systems, but I think the inverse is also true. If modern scientific applications truly require more complex orchestration of compute and data, maybe the scientific computing community should stop building computing platforms that make it really difficult to integrate different systems.

Again, "HPC" is not the opposite of "cloud;" it's not an either/or decision. There are technologies and tools that were designed from the beginning to simplify the secure connection of services and resources; they just weren't invented by the HPC community.

Top500 and Aurora

One of the cornerstones of ISC is the semiannual release of the Top500 list, and unlike at SC, the Top500 announcements and awards do not overlap with any other sessions, so it tends to have a higher profile and draw all attendees. This go-around, there were no dramatic changes in the Top 10; the new Alps system at CSCS was the only new entry, and the order of the top five systems remained the same. Notably, though, Aurora posted a significantly higher score than at SC'23 and broke through the exaflops barrier using 87% of the system, cementing its place as the second exascale system listed. But let's start at the top.

#1 - Frontier

Frontier at Oak Ridge remained #1, but it squeezed twelve more petaflops out of the same node count and is now just over 1.2 EF. Nothing groundbreaking, but it's clear evidence that ORNL is continuing to tune the performance of Frontier at full system scale.

#2 - Aurora

Aurora, on the other hand, finally eked over the exaflops line with 1.012 EF using 87% of the system's total 63,744 GPUs. Rick Stevens gave a short talk about the achievement which is summed up on this slide:

Rick Stevens' slide summarizing their latest Top500 runs

I was a little surprised by how honest Stevens was in this talk; the typical game that is played is that you stand up on stage, talk about how great of a partnership you had with your partners to realize this achievement, extol the virtues of the technologies on which your system was built, and talk about how this HPL score is just the start of a lot of great science.

Stevens didn't do that though.

He started out by telling the conference that Intel had bad product names, then explained that their low Graph500 and HPCG scores were the result of their exclusive focus on breaking the exaflops barrier with HPL, implying they didn't have time or ability to run Graph500 or HPCG at the same 87%-89% scale as their HPL and HPL-MxP runs. Based on this, it sounds like Aurora is still a ways away from being stable at scale, and we're unlikely to see any Gordon Bell-nominated papers at SC'24 this November.

After this session, folks seemed to relish in dunking on Aurora; its window to be #1 is likely to have closed and it has some power efficiency issues. But I don't think anyone involved with the Aurora project needs to be told that; if what Stevens implied is true, the folks at ALCF, Intel, and HPE have been struggling for a long time now, and topping out over 10¹⁸ was a hard-sought, major milestone to be celebrated. The Aurora project has been thrown more curveballs than I would have ever guessed a single HPC project could have, so all parties deserve credit for sticking it through all this way rather than just walking away. With any luck, Aurora will stabilize in the next six months, and we'll see full-scale runs of Top500, Graph500, HPCG, and science apps by November.

#3 - Eagle

The third highest system on the list was Eagle, whose HPL score was not updated since the system was first listed at SC'23 last year. Through a few twists of fate, I wound up being the person who accepted the award on-stage, and I now have a Top500 award for the #3 system sitting in my home office. Here's a photo of me goofing around with it:

It's not entirely inappropriate that I was the one to accept it since my teammates are the ones carrying pagers for the on-call rotation of that system, and we were also the hands-on-keyboard when that HPL run was conducted. Still, it was a bit surreal to walk on-stage to pick up such a noteworthy award immediately following two actually important people (both of whom have "director" in their titles) accepting the same award. By comparison, most of my career highlights to date have been just trolling HPC people on Twitter (as the esteemed Horst Simon actually said out loud as I was leaving the stage!)

It was weird.

That said, I take this to mean that it is now my duty to be the friendly face from Microsoft who can speak intelligently about the #3 system on Top500. To that end, I'll answer some questions that I was asked at ISC about the system and Azure HPC clusters in general below. None of this is new or secret information!

Why didn't you run HPL again and post a higher score to beat Aurora? Because the day after that HPL run completed, that system was put into production. Once systems are in production, people are paying to use them, and taking a time-out to re-run HPL costs a ton of money in either real dollars (if a customer runs it) or lost revenue (if the HPL run is blocking customer workloads). This is quite different from public-sector HPC systems which never have to pay for themselves.
Can I get access to Eagle for a Gordon Bell run or to test software? That's not really how it works. Whereas a traditional supercomputer might allow users to ssh in and submit jobs to a Slurm queue, cloud-based supercomputers allow users to deploy virtual machines through a REST API. Those virtual machines can allow ssh, run Slurm, and support MPI jobs like HPL, but that OS environment is managed by Azure users, not Azure itself. You can get a taste for what's required to run a basic MPI job by reading some instructions I wrote on provisioning an MPI cluster on Azure.
Is it just a bunch of GPU nodes scattered around a bunch of data centers? No, all the nodes on any given Azure HPC cluster (like Eagle) share an InfiniBand fabric. There are countless InfiniBand clusters in Azure, but each one is a real supercomputer by any definition of a supercomputer, and they are designed to run tightly coupled job across all their GPUs.
What parallel file system does it use? Don't think about it that way. You can provision a Lustre file system and mount that to any or all cluster nodes if you want to, or you can access data directly from object storage.
Are there any photos of it? You can see a photo of one of the Microsoft-designed nodes that comprise the system on my SC'23 recap blog post. Beyond that, there's not much to look at because Azure HPC clusters are not meant to be photogenic like, say, Cray supercomputers. There's no rack graphics (or even rack doors!). It's just tons and tons of air-cooled racks with InfiniBand optics coming out of each one. Maybe the only unique thing is that the racks are painted white instead of the typical black. Not sure why.

Getting back to that false separation between "HPC" and "cloud," Eagle is strong evidence that they aren't different. What the "hyperscalers" do is not that different from what traditional HPC centers do. Perhaps the biggest difference is that cloud supercomputers get all the benefits of cloud infrastructure like software-defined infrastructure like virtual machines and virtual networking, integration with identity and access management that transcends simple Linux UIDs/GIDs, and the flexibility to integrate with whatever storage systems or ancillary services you want from any compute node.

Other notable tidbits

It is tradition for Erich Strohmaier to talk through some highlights and trends of the latest Top500 list every time a new one is announced, and in the past, I've been critical of how he's presented conclusions from the list with this implicit assumption that computers that never post to Top500 simply don't exist. This year felt different, because Dr. Strohmaier made the explicit statement that China has completely stopped submitting to Top500. Their exascale systems aren't listed, but neither are any new systems in the past three years at the bottom. They simply don't play the game anymore, making it undeniable that Top500 is no longer an authoritative list.

Just as the whole conference's theme was reinventing HPC, I felt a sense that even the most stalwart proponents of Top500 are now recognizing the need to reinvent the Top500 list. Kathy Yelick said as much during her keynote ("Shall we replace Top500? What are the metrics in post-exascale computing that are important?"), and Erich implored the audience to help expand the HPL-MxP (formerly HPL-AI; an HPL-like benchmark that can use the mixed-precision capabilities of tensor cores) list. Nobody seems to know how to quantify what makes a leadership supercomputer nowadays, but accepting that HPL scores (or appearing on the Top500 list!) won't cut it is a good first step.

That all said, Top500 is still a valuable way to track technology trends in the industry. For example, this edition of the list where NVIDIA's new Grace-Hopper node started appearing in force. The only new entrant in the Top 10 was the 270 PF GH200 component of CSCS's Alps system, and HPEhad these EX254n GH200 blades on display on the show floor.

To HPE/Cray's credit, they seem to have gotten the system up and running with Slingshot without the delays that plagued early Cray EX systems like Frontier and Aurora. Hopefully this is a sign that the Cray EX platform and Slingshot-11 have graduated from being risky and not-quite-production-ready.

The other notable entrants on this year's Top500 are a trio of early MI300A APU-based Cray systems being built around the El Capitan program at Lawrence Livermore National Laboratory. This is a positive sign that MI300A is up and running at modest scale, and HPE also had one of these EX255a blades on display at their booth:

The strong showing of MI300A suggests that we may see El Capitan take the top spot in the next edition of the Top500 list coming in November.

Everyone is an AI expert!

Since I now work on a team responsible for AI infrastructure, I tried attending as many of the AI-focused talks and panels as I could this year. Unsurprisingly, these sessions largely carried the same undertones of "reinventing HPC," and speakers opined on how AI would affect scientific computing and offered examples of what their institutions were doing to extend their leadership in the HPC space into the AI space. There was a fair amount of grasping going on (as there always is when AI is discussed at non-AI conferences), but this year I was struck by how confused so many speakers and attendees were about concepts related to applying AI.

To be clear: I am no expert in AI. However, my day job requires that I be steeped in some of the largest AI training workloads on the largest AI supercomputers on the planet, and I have to have a cursory understanding of the latest model architectures and techniques to anticipate how future system designs will have to evolve. It's from this perspective that I made the following observation: there are a lot of HPC people speaking very confidently about AI based on an outdated understanding of the state of the art. The AI industry generally moves much faster than the government-funded research community, and I couldn't help but wonder if some community leaders assumed that the AI industry today is the same as it was the last time they wrote their AI grant proposal.

Of course, there were also some really insightful perspectives on AI for science shared as well. Let's talk through some examples of both.

The Exascale AI Synergies LLM Workflows BOF

This realization that the ISC community is not keeping up with the AI community first slapped me in the face when I ducked into a BOF session titled, "Tales of Exascales – AI and HPC Supercomputing Platforms Synergies for Large Language Models (LLMs) and Scientific Workflows." I sometimes wonder if the organizers who propose titles like that are intentionally creating word salad, but in this case, it was apt session name; the discourse around HPC and AI was all over the board throughout the hour.

The session started on a strong, positive note by Simon McIntosh-Smith describing Bristol's new Isambard-AI system, a GH200-based Cray supercomputer funded under the broad charge of "AI research." While I'm usually skeptical of such nebulously defined "AI research" machines, Dr. McIntosh-Smith's description of the project quickly checked a bunch of boxes on how a real AI research platform should be developed. In particular,

Isambard-AI was developed and deployed at the pace of AI rather than HPC for scientific computing. Whereas government-funded, large-scale HPC systems typically take years to procure, Simon said that the first discussions started in August 2023, and in the nine months that followed, they had built the site, the team, and the system itself to the degree that a piece of the final system is already on Top500. By comparison, LLNL's El Capitan supercomputer also debuted on Top500 this month, but its contract was signed five years ago, and its procurement began at least two years before that. The AI industry would not exist if the systems it trains on took seven years to procure.

Isambard-AI deliberately avoided exotic AI accelerators to remain future-proof. Simon rightly pointed out that the AI industry moves too quickly to anticipate whether a bespoke AI accelerator would even be relevant to whatever the hottest model architecture will be in a year. GPUs were chosen because they are the most flexible way to accelerate the widest range of AI workloads, regardless of if they are dense models, sparse models, inferencing, training, and whatever level of quantization makes sense. The reality is that cutting-edge research is done on GPUs, so aligning an AI supercomputer on the same technology will ensure that the algorithms developed by industry are immediately usable for scientific research.

A reasonable definition of "AI for science" was defined from the outset. Rather than blurting out "we need to research AI!" and asking for a sack of money to buy GPUs, Simon outlined a vision of training AI models using data generated by physical simulation on a more conventional HPC system. Training models on models to create surrogate models is not particularly new, but it does establish a few reasonable architectural decisions such as having a robust data management and sharing platform, close coupling to the HPC system performing simulation, and aligning software stacks and programming environments as closely as possible.

Simon's contribution to the discussion stood out to me as the most impressive, and the discourse seemed to fall into a trap of familiarity following. Rather than focusing on the new and exciting prospects of AI, some panelists and audience members wanted to focus on the aspects of AI they understood. For example, an uncomfortable time was spent on a back-and-forth on how HPC centers can support Kubernetes and random I/O (which is what defines AI vs. HPC?) instead of Slurm and Lustre. If your biggest challenge in delivering infrastructure to support AI workloads is figuring out how to deploy both Kubernetes and Slurm, you haven’t even reached the starting line. This is a trivial issue in cloud environments, where entire AI clusters can be built up and torn down in minutes. Again, this is evidence that the scientific computing community isn’t ready to keep pace with the AI industry.

I jotted down a few of the questions and comments that I heard during this BOF that seem to reflect the level of familiarity the average ISC attendee has with AI:

"Would be nice if there were more models for science." I wasn't sure sure what this means. All the leading LLMs are pretty good at "science," and domain-specific models aren't readily transferable between different science domains or problems.
Scientific problems "have to validate outputs for correctness, unlike LLMs." I think the speaker was making a sidelong reference to hallucinations, but like with any model (large language or physics-based), validating outputs for correctness is certainly necessary and readily possible.
"The demands of inference of LLMs are completely different from those for training. How do you buy inference infrastructure?" I wonder where this notion came from. If your infrastructure can train a model, it can definitely inference that model. Cost-optimizing infrastructure for inferencing is a separate matter (you can cut corners for inferencing that you wouldn't want to cut for training), as is building the service infrastructure around inferencing to deliver inferencing as a service. But I don't think that's what this question was about.
"Working safely with sensitive data / isolating workloads on big shared clusters." This is a problem that arises only when you try to wedge AI workloads into infrastructure designed for traditional physics-based simulation. If you have sensitive data, don't use big shared clusters. Provision separate clusters for each security domain on a shared, zero-trust infrastructure.
"How different are the files and filesystem access while training for LLMs, image generation models, reinforcement learning?" This question reflects a general misunderstanding of data and storage in HPC overall; how data is organized into files and how that data is accessed by a workload is an arbitrary decision made by the application developer. You can organize piles of text into one giant file or a million little files.

There were a few questions that came up that touched on deeper issues on which the HPC community should reflect:

"What are the first steps for scientific groups wanting to get ready for using AI in the future?" This is probably the purest question raised in the entire session, and I think this is something the scientific computing community as a whole needs to figure out. What does "using AI" really mean for scientific groups? Is it training models? Fine-tuning models? Inferencing using pre-trained models on HPC infrastructure? Is it integrating simulation applications with separately managed inferencing services? Who manages those inferencing services? Does inferencing even require HPC resources, or can suitable models run on a few CPU cores? I think the first step to answering this question is ensuring that the scientific computing community reaches a common baseline level of understanding of "using AI" means. And a lot of that probably means ignoring what some self-professed AI experts in the HPC community claim is the future.
"Care to predict what that ChatGPT moment will be for AI for Science? Had it already happened?" This question was addressed directly by panelist Séverine Habert who rightly pointed out that the ChatGPT moment occurred when a complex and esoteric topic was suddenly put in the hands of hundreds of millions of laypeople across the world. It was the moment that the common person walking on the street could suddenly interact with the most cutting-edge technology that had been previously understandable only to the headiest of researchers in industry and academia. That will likely never happen in AI for science because science, by definition, requires a higher baseline of education and understanding than the average layperson has.
"How to effectively train the existing workforce when we are already struggling to retain talent in research/academia?" This question strikes at the same theme that Kathy Yelick's opening keynote confronted: what is the role of the scientific computing community now that it turns out that you don't need decades of institutional experience to deploy and use HPC resources at leadership scale? As offensive as it may sound, perhaps the public-sector HPC community should accept that their role is not training future researchers and academics, but training future practitioners of AI in industry. This is how the wider tech industry generally works; neither startups nor tech giants make hires assuming those people will still be around in ten years. Why does the public-sector HPC industry think otherwise?

Finally, I was also struck but how fiercely the discourse clung to the idea that large language models are the answer to all AI problems in science. I get that this panel was focused on exascale, and LLM training is one of the rare cases where AI requires exascale computing capabilities. But there was no acknowledgment that trillion-parameter models are not actually a good idea for most scientific applications.

AI Systems for Science and Zettascale

This singular focus on creating massive LLMs for science was front-and-center in a talk given by Rick Stevens titled "The Decade Ahead: Building Frontier AI Systems for Science and the Path to Zettascale." The overall thesis that I heard was something like...

Science needs its own trillion-parameter foundation models
Training trillion-parameter foundation models requires a lot of GPUs
We need $25 billion from the U.S. government

However, Stevens never answered a very basic question: what does a foundation model for science do that any other foundation model cannot do?

He showed slides like this which really don't sound like foundation models for science as much as a generic AI assistants:

Foundation models for science, per Rick Stevens

Is the scientific computing HPC community really the most qualified bunch to reinvent what existing foundation models like GPT-4 or Claude 3 have already done? Even if you argue that these proprietary models aren't as good at "science" as they could be, who would have a better chance of addressing this with a billion dollars of federal funding: the companies who developed GPT or Claude, or a collection of government scientists starting from scratch?

I think the answer to this question was in other parts of Stevens' talk. For example, he started with this slide:

Rick Stevens' requirements gathering slide

While robust requirements are good when there's no urgency, this slide is also a tacit admission that the government takes years to general a perspective on AI. Do you think the creators of Llama-3 or Mistral Large gathered wide community input from over 1,300 researchers before deciding to build a supercomputer and train a model? Even if science needs its own foundation models, this slide is strong evidence that, by the time the scientific HPC community agrees on a path forward, that path will be years out of date relative to what the commercial AI industry is doing.

A great example of this already happening is the basic premise that creating a foundation model with a trillion parameters is the best way to apply AI to solve science problems. This certainly was the leading thought two years ago, when transformer scaling laws were published that suggested that the best way to get better-performing LLMs was to simply add more parameters to your transformer and train on more data. But there's a reason all the leading models have stopped advertising how many parameters they use.

Dealing with massive transformers is really expensive. They're not only really expensive to train, but they're really expensive to use for inferencing too. This has led to a bunch of innovation to develop model architectures and approaches to training that result in dramatically higher quality outputs from a fixed parameter count. Dense transformer architectures with a trillion parameters have become the blunt instrument in developing foundation models since 2022, so it took me by surprise to hear Stevens put so much stock into this notion that the need for a trillion-parameter model is essential for science.

To repeat myself, I am no expert in AI. I've never been called in front of Congress to talk about AI or been invited to give talks on the topic at ISC. There might be something basic that I am missing here. But when I look at the science drivers for AI:

Slide showing the cross-cutting themes in AI for science

I know that you do not need to train your own trillion-parameter model to do most of this stuff. Even the use cases that do require generative AI, like code generation and math theory, don't actually require trillions of parameters. Small language models, such as that described in Textbooks Are All You Need (published in 2023, after the reports Stevens cited in his talk), can produce amazing results with very small models when you train them using high-quality data instead of garbage from Reddit. And when you create or fine-tune a small language model for a specific science domain, not only do you save yourself from having to buy a billion-dollar supercomputer for training, but you get a model that is much more accessible to scientists around the world because they won't need a million dollars' worth of GPUs to inference with it.

So, if there's one question that was never answered across any of the AI-themed sessions at ISC this year, it is this: Why does science need to train its own large language models? My intuition is that either fine-tuning existing large language models or training small language models for domain-specific applications, would be a better investment in actually advancing science. However, if we cynically assume the real goal of LLMs-for-science is to justify buying massive GPU systems, suddenly a lot of the talks given at ISC on this topic make a lot more sense.

Real applications of generative AI for science

As frustrated as I got sitting through sessions on AI where it sometimes felt like the blind leading the blind, there was one really good session on actual applications of generative AI for science.

Mohamed Wahib of RIKEN gave an insightful presentation on the unique challenges of using generative AI in science. His summary slide touched on a lot of the key challenges:

Mohamed Wahib's slide on challenges of generative AI in science

And his actual talk focused largely on the model and data aspects of generative AI. What struck me is that the challenges he described reflected the experience of someone who has actually tried to do what many other AI experts at the conference were claiming would be the future. For example,

He recognized the importance of training scientific models with high-quality datasets, not just garbage scraped off of social media. This means not only scraping or generating high quality data for training, but curating and attributing that data and applying reinforcement learning with human feedback as the model is being trained. This is uniquely challenging when creating models for scientific applications, as managing the quality of scientific data requires deep domain expertise. This contrasts with a generic chat bot whose inputs and outputs can often be assessed by anyone with a basic education.
He also talked about the tendency of scientific data to be highly multimodal and multidimensional. Whereas multimodal chatbots may combine text and vision, scientific data often contains observations of the same phenomenon from many different sensors (for example, pressure, temperature, density, strain fields, ...), and the output of a generative model for science may require multiple modalities as well. These capabilities are not well developed in LLMs designed for human language.
Dr. Wahib also pointed out that scientific datasets tend to be huge compared to text and images, and this may require developing ways for models to have context windows can fit multi-petabyte datasets' tokens to identify long-range correlations. Relatedly, he also pointed out that tokenization of scientific data is a new set of challenges unique to this community, since industry has been focused on tokenizing low-dimensional data such as text, audio, and images.

The good news is that industry's quest towards both commercializing generative AI and achieving AGI will touch on some of these challenges soon. For example, training domain-specific models using high-quality datasets is an essential component of the small language models I described in the previous section, and these small language models are what will enable privacy-preserving and cost-effective generative AI on laptops and phones. Effectively infinite context windows are also a major hurdle on the path to AGI, as industry is hard at work developing AI agents that can remember every conversation you've ever had with them. Finding more scalable approaches to attention that do not sacrifice accuracy are a part of this.

François Lanusse, currently at the Flatiron Institute, also gave a nice presentation that clearly explained how generative AI can be used to solve inverse problems—that is, figuring out the causes or conditions that resulted in a collection of measurements. A precise example he used applied generative AI to figure out what an image distorted by gravitational lensing might look like in the absence of those distortions. As I understood it, he trained a diffusion model to understand the relationship between images that are affected by gravitational lensing and the masses that cause lensing through simulation. He then used that model instead of an oversimplified Gaussian model as part of a larger method to solve the inverse problem of un-distorting the image.

The details of exactly what he did were a little over my head, but the insight piece for me is that combining generative AI and science in practice is not as straightforward as asking ChatGPT what the undistorted version of a telescope image is. Rather, almost all of the standard, science-informed approach to solving the inverse problem remained the same; the role of generative AI was simply to replace an oversimplified part of the iterative process (the Annealed Hamiltonian Monte Carlo method) to help it converge on better answers. It really is a combination of simulation and AI, rather than an outright substitution or surrogate model.

Dr. Lanusse also showed this slide which demonstrated how this approach can be generalized to other scientific domains:

François Lanusse's view of how generative AI can improve scientific discovery

The general approach of pretraining, fine-tuning ("adapt"), and combining foundation models with other physics-based models seems reasonable, although I admit I have a difficult time wrapping my head around exactly how broadly scoped he envisions any given pretrained foundation model to be. I can see such a model trained on extensive sky survey data being useful for a number of astrophysical and cosmological tasks, but it's less clear to me how such a model might be useful in unrelated domains like, say, genomics.

You might also ask why I think this vision of foundation models for science is reasonable while Rick Stevens' vision didn't ring true; the difference is in scale! The foundation models cited on Lanusse's slide are vision transformers which have many orders of magnitude fewer parameters than the trillion-parameter models that others talk about. Whereas a trillion-parameter model might need to be distributed over dozens of H100 GPUs just to produce one inference result, the largest of the vision transformers can probably be squeezed on to a single high-end desktop GPU. Again, you don't need billion-dollar supercomputers to train these models for science.

Frank Noé from Microsoft Research then talked about how generative AI can be applied to solve problems in simulating biological systems. Like the talk before his, Dr. Noé followed this pattern where a larger, physics-based framework had one statistical technique replaced by a method based on generative AI, and then a physics-based model is used to quantify the likelihood that the result is reasonable. He contrasted this with convention approaches (to, say, protein folding) where you just simulate for really long times in the hopes that your simulation randomly wanders into a situation where you capture a rare event.

His talk wasn't about generative AI as much as the previous speakers, but he offered a litany of ways in which AI models can be useful to molecular modeling:

Markov state models provide a statistical framework that lets you replace one long simulation (that hopefully captures every possible scenario) with a bunch of short, chopped-up simulations that hopefully capture every possible in parallel. He cited an example that took 20,000 GPU-days on V100 GPUs that would've otherwise taken a million GPU-years if done in one long simulation.
Coarse-grained models use machine learning to develop surrogate models to simulate the physics of relatively uninteresting parts of molecular systems. The example he used was simulating the water molecules surrounding a biomolecule; water can be very difficult to accurately model, and the example he cited led to a surrogate model that was 100x faster than directly simulating water molecules.
Boltzmann generators can generate 3D molecular structures based on a known probability distribution defined by the energy states of the system. This is another fast way to find rare but stable molecular configurations without having to throw darts at a dartboard.

What struck me is that, in all these cases, the AI model is never generating results that are blindly trusted. Instead, they generate molecular configurations which are then fed into physics-based models which can quantify how likely they are to be valid.

Both Lanusse's and Noé's examples of combining AI and simulation painted a picture to me where generative AI can be really useful in solving problems where a researcher would otherwise have to make educated guesses about what physical phenomenon is really happening based on incomplete information. So long as there is a way to apply a physics-based model to check the accuracy of each guess, generative AI can be trained to predict the relationships between incomplete information and what's really going on and get to probable answers much faster than relying on physics alone.

More broadly, I couldn't help but think about the Sora video showing pirate ships battling in a cup of coffee as I left this session. Like that video, these talks demonstrated that it's possible to train generative AI models to reproduce physical phenomena (like the fluid dynamics of coffee) without explicitly embedding any laws of physics (like the Navier-Stokes equations) into the model itself and still get really compelling results. The part of this that was lacking from the Sora video—but was present in these talks—was closing the loop between generated results and the laws of physics by feeding those generated results back into the laws of physics to figure out if they are probable.

High Performance Software Foundation

ISC'24 wasn't all about AI though! I wound up attending the launch of the High Performance Software Foundation (HPSF), a new Linux Foundation effort spearheaded by Todd Gamblin and Christian Trott (from Livermore and Sandia, respectively) aimed to promote the sustainability of the software packages relied upon within the high-performance computing community.

I haven't paid close attention to HPC software in a long time since most of my work was in platform architecture and storage systems, so a lot of the background context remains a little murky to me. That said, it seems like HPSF was formed to be like the Cloud Native Computing Foundation for the HPC community in that:

it will serve as a neutral home for software projects that aren't tied to any single university or government institution
it provides mechanisms to ensure that critical HPC software can continue to be maintained if its original author gets hit by a bus
it will help with the marketing, promotion, and marketing of HPC software

Its governance seems pretty reasonable, with different levels of membership being accompanied by different levels of rights and obligations:

There is a Governing Board is comprised of paying members (and predominantly those who pay the most), while the Technical Advisory Council carries out the more technical tasks of forming working groups and onboarding projects.

There are three levels of membership, and the highest (premier) has a $175,000 per year buy-in and comes with a seat on the Governing Board. Right now, the founding seats are held by AWS, HPE, LLNL, and Sandia.

Below that is a general membership tier whose cost is on a sliding scale based on the organization size, and AMD, Intel, NVIDIA, Kitware, ORNL, LANL, and Argonne have all committed at this level. The associate tier is below that, and it is free to nonprofits but comes with no voting rights.

It seemed like the exact functions that HPSF will have beyond this governing structure are not fully baked yet, though there were six "prospective" working groups that provide a general scope of what the HPSF will be doing:

My read of the description of these working groups is that

CI/testing will supply resources (GPUs) on which HPSF projects' code can be automatically tested.
Software stacks will maintain E4S.
User engagement sounds like it will figure out what users of HPSF projects' software are looking for. It sounds like this will provide some product management-like support for projects.
Facility engagement is probably like user engagement, but for the sites deploying code on behalf of their users. Again, this sounds like product management functions.
Security sounded like stewarding SBOM-like stuff for member projects' software.
Benchmarking would make a framework for benchmarking HPC applications.

That all said, it still wasn't clear what exactly HPSF would do; what would all those membership dues go towards supporting? Based on some Q&A during this BOF and follow-up afterwards, I pieced together the following:

HPSF will not be funding developers, much in the same way that OpenSFS doesn't fund Lustre development. That said, Todd Gamblin later said that not funding software development was a financial constraint more than a policy one, with the implication that if more members join, there may be opportunity for HPSF to fund projects.
HPSF likely will be hosting events and conferences (perhaps like the CNCF hosts KubeCon), providing scholarships, developing and providing training related to member projects, and "increasing collaboration" (whatever that may mean!).

HPSF also has some influence and ownership over its member projects:

HPSF will co-own its projects' GitHub repos to ensure continuity in case the other repo owner abandons it.
HPSF will own the domain for the project for the same reasons as above.
Member projects still manage their own software development, roadmaps, releases, and the like. The HPSF won't dictate the technical direction of projects.
HPSF will own the trademark and logos of its member projects so it can prevent corporations from profiting off of repackaging products without respecting trademark.

This establishes an interesting new direction for the sorts of software projects that are likely to become member projects. Historically, such projects developed by the member organizations (i.e., DOE labs) have been wholly controlled by the labs that funded the work, and those software projects lived and died at the whims of the government funding. The HPSF offers a new vehicle for software projects to live on beyond the end of the grants that created them, but at the same time, it requires that the DOE surrender control of the work that it sponsored.

I left the session still wondering a few pretty major things, likely borne out of my own ignorance of how similar organizations (like CNCF or the Apache Foundation) work:

How does a software project actually become a member project? The HPSF folks said that the Technical Advisory Committee onboards new projects, but what is the bar if I have an open-source project used by the community that I no longer want to maintain myself? I assume it's not a pay-to-play arrangement since that defeats the purpose of sustaining software after its seed funding runs out.
What do stakeholders actually get out of joining HPSF? I see obvious value for organizations (like the DOE labs) who develop open-source software but may not want to be exclusively responsible for sustaining it forever. But would an HPC facility get any obvious benefit from joining and paying dues if it is simply a consumer of member projects' software? What does a cloud vendor like AWS get by being a premiere member? Is HPSF just a way to get someone else to cover the overheads of maintaining open-source software that comes out of, say, R&D organizations rather than product organizations?

Hopefully the answers to these questions become clearer as the foundation gets off the ground and we get to see what member organizations contribute under the HPSF banner.

Ultimately though, I see this as a really positive direction for the HPC software community that might help resolve some uncertainty around key pieces of HPC software that have uncertain ownership. For example, I wound up as a maintainer of the IOR and mdtest benchmark because I was the last one to touch it when its previous maintainer lost interest/funding. I don't even work in I/O performance anymore, but the community still uses this benchmark in virtually every procurement of parallel file systems either directly or through IO500. It would be wonderful if such an important tool didn't rest on my shoulders and had a more concrete governance structure given how important it is.

Quantum computing

Besides AI and cloud, quantum computing was cited in Kathy Yelick's opening keynote as the third disruptor to HPC for scientific computing. At the time, I thought citing quantum was just an obligation of any opening keynote speaker, but quantum computing was particularly high-profile at ISC this year. I was surprised to see over a dozen quantum computing companies on the vendor exhibition floor, many of whom were Europe-based startups.

In addition, this year's Hans Meuer award (for best research paper) was given to a paper on quantum computing by Camps et al. This is particularly notable since this is the first time that the Meuer award has ever been given to a paper on a topic that isn't some hardcore traditional HPC like MPI or OpenMP advancements; by comparison, this award has never been given to any papers on AI topics. Granted, the winning paper was specifically about how to use conventional HPC to solve quantum problems, but this recognition of research in quantum computing makes a powerful statement: quantum computing research is high-performance computing research.

Reinvent HPC to include urgent computing?

I was invited to give a lightning talk at the Workshop on Interactive and Urgent High-Performance Computing on Thursday, and urgent/interactive HPC is not something I'd really paid attention to in the past. So as not to sound like an ignorant fool going into that workshop, I opted to sit in on a focus session titled "Urgent Computing" on Tuesday. I had two goals:

Make sure I understood the HPC problems that fall under urgent and interactive computing so I could hold an intelligent conversation on this topic at the Thursday workshop, and
See if there are any opportunities for cloud HPC to provide unique value to the challenges faced by folks working in urgent HPC

I'll describe what I came away with through these lenses.

The Urgent Computing focus session

What I learned from the focus session is that urgent computing is not a very well-defined set of application areas and challenges. Rather, it's another manifestation of reinventing HPC to include any kind of computation for scientific purposes.

Much to my surprise, this "Urgent Computing" focus session was actually a session on IoT and edge computing for science. Several speakers spoke about getting data from edge sensors on drones or telephone poles into some centralized location for lightweight data analysis, and the "urgent" part of the problem came from the hypothetical use cases of analyzing this sensor data to respond to natural disasters. There wasn't much mention of anything requiring HPC-like computing resources; at best, a few talks made unclear references to using AI models for data analysis, but it felt like grasping:

The above conclusion slide was presented by one of the speakers, and to be honest, I don't understand what any of it means. Granted, I know very little about urgent computing, IoT, or edge computing so there may be some domain jargon here that's throwing me off. But based on this, as someone working in the area of HPC and AI in the cloud, I don't think I have a role to play here. I'm sure cloud computing can help, but the challenges would be in general-purpose cloud rather than HPC.

The Interactive and Urgent HPC workshop

Fortunately for me, the Thursday workshop on Interactive and Urgent HPC was much less about edge/IoT and more about developing software infrastructure and workflows that allow scientific data analysis of large datasets to happen before the results become obsolete. It was a fascinating workshop for learning about specific science drivers that require fast access to HPC resources, and how different HPC providers are enabling that through non-traditional services and policies. Below are a few highlights.

Sam Welborn (NERSC) presented his team's efforts to convert a streaming data workflow from its current file-based approach into one that streamed directly into compute node memory. The specific use case was the initial data processing for image information coming off of a scanning transmission electron microscope at 480 Gbps, totaling 750 GB per shot. As he described it, the current technique involves streaming those data to files at the microscope, then copying those files to the parallel file system of a remote supercomputer, then reading, processing, and writing that data within the HPC environment to prepare it for downstream analysis tasks. And for what it's worth, this is how I've always seen "streaming" HPC workflows actually work; they're actually using file transfers, and the performance of both the file system at the source and destination are in the critical path.

The problem with this approach is that parallel file systems on HPC systems tend to be super flaky, and there's no real reason to bounce data through a storage system if you're just going to pick it up and process it. So, Dr. Welborn showed a true streaming workflow that skipped this file step and used ZeroMQ push sockets at the microscope and pull sockets on the HPC compute nodes to do a direct memory-to-memory transfer:

Streaming memory workflow slide from Sam Welborn

Seeing software like ZeroMQ used to enable communication in an HPC environment instead of forcing this workflow to fit into the MPI paradigm is an encouraging sign in my eyes. ZeroMQ, despite not using purpose-built HPC technology like RDMA, is the right tool for this sort of job since it supports much better resilience characteristics than messaging libraries designed for tightly coupled HPC jobs. Workflows like this that combine beefy GPU nodes with software developed in the commercial tech space suggest that the world of HPC is willing to abandon not-invented-here ideology.

It wasn't clear to me that there's a great opportunity for cloud HPC to be uniquely useful in use cases like this; while you certainly can provision beefy CPU and GPU nodes with InfiniBand in Azure, cloud services can't obviously simplify this ZeroMQ-based workflow beyond just supplying general-purpose VMs on which the orchestration services can run. Had this team stuck with a file-based streaming mechanism, the performance SLAs on cloud storage (like object or ephemeral Lustre) would provide a more reliable experience to ensure the data transfer happened in near-real-time. But the better solution to unpredictable file system performance is to do exactly what was done here: skip the file system entirely.

Just to keep the speaker honest, I asked why this computation couldn't simply be done at the same place as the telescope generating the data. After all, if the telescope always generates 750 GB per shot, you should be able to buy a couple GPU servers that are ideally sized to process that exact workload in the time between images. There were actually two answers: one from Sam and one from an audience member:

Sam said that you can process this workflow locally, but that the goal of this work was to prepare for a future microscope (or another instrument) that could not. He also insightfully pointed out that there's tremendous value in getting the data into the HPC environment because of all the services that can be used to work on that data later. I envisioned doing things like using a Jupyter notebook to further process the data, serve it up through a web UI, and similar tasks that cannot be done if the data is stuck inside a microscope room.
An audience member also pointed out that sticking GPU nodes in the same room as electron microscopes can result in enough noise and vibration to disrupt the actual scope. This was a great point! In the days before I started working in HPC, I was training to become an electron microscopist, and I worked in a lab where we had water-cooled walls to avoid the problems that would be caused by air conditioning breezes. There's no way a loud server would've worked in there.

Toshio Endo (Tokyo Tech) gave an interesting talk on how they enable urgent/interactive compute jobs on their batch-scheduled TSUBAME4.0 supercomputer by doing, frankly, unnatural things. Rather than holding aside some nodes for interactive use as is common practice, his work found that a lot of user jobs do not completely use all resources on each compute node they reserve:

Toshio Endo's slide on TSUBAME3.0 GPU utilization

I had to do a double-take when I saw this: even though 65%-80% of the nodes on the supercomputer were allocated to user jobs, less than 7% of the GPUs were actually being utilized.

Dr. Endo's hypothesis was that if nodes were suitably subdivided and jobs were allowed to oversubscribe CPUs, GPUs, and memory on a compute node without impacting performance too much, they could deliver real-time access to HPC resources without having to create a separate pool of nodes only for interactive uses. He defined success as the slowdown of a shared job being 1/k if k jobs shared the same node; for example, if four jobs were all running on the same node, each one taking four times as long to complete would be acceptable, but any longer would not. He then went on to show that the best way to accomplish this is using Slurm's gang scheduling, where each job takes turns having exclusive access to all the CPUs and GPUs on a node. The alternative (just letting the OS context switch) was no good.

While a fascinating study in how to provide zero wait time to jobs in exchange for reduced performance, this whole mechanism of using gang scheduling to exploit low resource utilization seems like jamming a square peg into a round hole. If a workload doesn't (or can't) use all the GPUs on a node, then that's not the right node for the job; I feel like a more appealing solution would simply be to offer a heterogeneous mix of nodes based on the demands of the workload mix. This is hard to do if you're buying monolithic supercomputers since you're stuck with whatever node mix you've got for five years, but there is another way to buy supercomputers!

I won't pretend like dynamically provisioning different flavors of CPU- and GPU-based nodes interconnected with InfiniBand in the cloud doesn't come with a cost; the convenience of being able to slosh a cluster makeup between CPU-heavy and GPU-heavy nodes will be more expensive than committing to use the same makeup of node flavors for multiple years. But if you're paying for GPUs that are only being used 7% of the time, surely it's cheaper to pay a higher cost for GPUs when you need them if it also allows you to not pay for them 93% of the time when they're idle.

Bjoern Enders (NERSC) gave the first lightning talk where he presented the exploration they're making into enabling real-time and urgent computation. They're currently going in three parallel directions to provide this capability:

Reservations, a process by which a user can request a specific number of nodes for a specific period of time, and Slurm ensures that many nodes are available for the exclusive use of that user by the time the reservation starts. He said that implementing this at NERSC is costly and rigid because it requires a human administrator to perform manual steps to register the reservation with Slurm.
Realtime queues, where a few nodes are held from the regular batch queue and only special real-time users can submit jobs to them. Dr. Enders said that NERSC is extremely selective about who can access this queue for obvious reasons: if too many people use it, it will back up just like the regular batch queues do.
Jupyter Hub, which utilizes job preemption and backfill under the hood. If a user requests a Jupyter job, Slurm will pre-empt a job that was submitted to a preemptible queue to satisfy the Jupyter request. However, if there are no preemptible jobs running, the Jupyter job will fail to launch after waiting for ten minutes.

To provide compute resources to back up these scheduling capabilities, they also deployed a new set of compute nodes that can be dynamically attached to different supercomputers they have to support urgent workloads even during downtimes. Called "Perlmutter on Demand" (POD), it sounded like a separate set of Cray EX racks that can be assigned to either the Perlmutter supercomputer, or if Perlmutter is down for maintenance, either their smaller Alvarez or Muller supercomputers which share the same Cray EX architecture. What wasn't clear to me is how the Slingshot fabrics of these nodes interact; perhaps POD has its own fabric, and only the control plane owning those racks are what changes.

He showed a slide of explorations they're doing with this POD infrastructure, but as with Dr. Endo's talk, this seemed a bit like a square peg in a round hole:

Bjoern Enders' slide on experiments with POD

All of this sounds aligned with the strengths of what HPC in a cloud environment can deliver, and some of the big challenges (like figuring out the ideal node count to reserve for interactive jobs) are problems specific to Slurm and its mechanism for scheduling. There's a lot more flexibility to rapidly provision HPC resources in cloud environments because, unlike the case where Slurm is scheduling jobs on a single cluster, cloud resource managers can schedule across any number of clusters independently. For example, if an urgent workload needing only four GPU nodes suddenly appears, it doesn't necessarily have to be scheduled on the same InfiniBand fabric that a large hero job is running on. Since the urgent job and the hero job don't need to talk to each other, cloud resource managers can go find a GPU cluster with a little more flex in them to provision those resources quickly.

Automating the process of reservations is also a bit of a game of catch-up, though my guess is that this is more a matter of someone having a weekend to sit down and write the REST service that manages incoming reservation requests. Although there's not a direct analog for reservations like this in Azure, AWS has a feature called AWS Capacity Blocks that does exactly this: if you know you'll want a certain number of GPU nodes sometime in the future, Capacity Blocks let you reserve them ahead of time through an API.

Finally, I represented Microsoft and gave a lightning talk that riffed on a lot of what I've been writing about in this blog post: HPC seems to be reinventing a lot of things that the cloud has already figured out how to do. The illustrious Nick Brown was kind enough to snap a photo of one of my slides and post it on Twitter:

A slide I presented on the parallels between inferencing as a service and urgent HPC workflows

My thesis was that the way urgent HPC workflows are triggered, scheduled, run, and reported on follows the same pattern that inferencing-as-a-service services (like Copilot and ChatGPT) are implemented under the hood, right down to executing multi-node jobs on InfiniBand clusters. The difference is that these cloud workflows are built on the foundation of really nice cloud services that provide security, scalability, monitoring, and hands-free management that were originally developed for commercial (not HPC!) customers. My argument was that, even if you don't want to pay cloud providers to run urgent HPC workflows as a managed service, you can use these services (and the software infrastructure on which they're built) as a blueprint for how to build these capabilities in your own HPC environments.

Concluding thoughts

The ISC'24 conference was fantastic, and I am glad it has not lost the unique elements that made me want to attend in the years prior to the pandemic. It's still that smaller, intimate, and focused HPC conference that brings the community together. Although a lot of my synopsis above may sound critical of the content presented over the four days I attended, the fact that I've had so much to write down in this blog post is a testament to the value I really get out of attending: it makes me sit down and think critically about the way the HPC community is evolving, what the leading minds in the field are thinking, and where I might be able to contribute the most in the coming year.

I never much paid attention to the annual taglines of conferences like ISC, but this year's "Reinvent HPC" really resonated. The HPC community is at a crossroads. Exascale computing for science is now in the rear-view mirror, and large-scale AI is all the rage across the computing industry at large. But for the first time ever, this new direction in at-scale computing is happening without the inclusion of the people and organizations who've historically driven innovation in HPC. Whereas institutions like Oak Ridge, RIKEN, Cray, and Fujitsu defined the future of computing for decades, hundred-person startups like OpenAI and Anthropic are now paving the way in partnership with companies like Microsoft and Amazon.

HPC needs to be reinvented, if for no other reason than to decide whether the HPC community wants to be inclusive of new frontiers in computing that they do not lead. Does the HPC community want AI to be considered a part of HPC?

Judging from many speakers and panelists, the answer may be "no." To many, it sounded like AI is just another industry that's sucking all the air (and GPUs) out of the room; it's a distraction that is pulling funding and public interest away from solving real problems. It's not something worth understanding, it's not something that uses the familiar tools and libraries, and it's not the product of decades of steady, government-funded improvements. AI is "them" and HPC is "us."

Personally, I'd like the answer to be "yes" though. Now that I'm on the other side of the table, supporting AI for a cloud provider, I can say that the technical challenges I face at Microsoft are the same technical challenges I faced in the DOE. The desire to deeply understand systems, optimize applications, and put world-class computing infrastructure in the hands of people who do amazing things is the same. And as the days go by, many of the faces I see are the same; instead of wearing DOE or Cray badges, my lifelong colleagues are now wearing NVIDIA or Microsoft badges.

All this applies equally to whether cloud is HPC or not. The HPC community needs to reinvent itself to be inclusive of everyone working towards solving the same problems of computing at scale. Stop talking about people who work on commercial AI in cloud-based supercomputers as if they aren't in the room. They are in the room. Often near the front row, snapping photos, and angrily posting commentary on Twitter about how you're getting it all wrong.

Picture of me near the front row, angrily posting on Twitter

HPC has historically been used to solve scientific problems, whether to expand our understanding of the university, to find the next best place to drill an oil well, or to model the safety of aging nuclear weapons. The fact that HPC is now being used to solve squishier problems related to natural language or image generation does not change the essence of HPC. And whether that HPC is delivered through physical nodes and networks or virtualized nodes and networks is irrelevant, as long as those resources are still delivering high performance. AI is just as much HPC as scientific computing is, and cloud is just as much HPC as OLCF, R-CCS, or CSCS is.

So perhaps HPC doesn't need to be reinvented as much as the mindset of its community does.

That all said, I am genuinely impressed by how quickly ISC'24 has been reinventing itself in recent years. It wasn't too long ago that all its keynote speakers were greybeards from a predictable pool of public HPC centers all saying the same things year after year. It's wonderful to see a greater diversity of perspectives on the main stage and torches passing on to the next generation of leading figures in the field. And it was not lost on me that, for the first time in the history of this conference, Thomas Sterling did not deliver the closing keynote. As much fun as I had poking fun at his meandering and often-off-the-mark conjectures every year, it was delightful to be exposed to something new this year.

I'm hopeful that ISC will continue to get better year over year, and ISC'25 will feel more inclusive of me despite the fact that I am now one of those hyperscale cloud AI people. So long as I still feel like it's my community, though, I will keep showing up in Germany every summer.

↧

How has life after leaving the Labs been going?

August 4, 2024, 1:21 pm

≫ Next: A critique of the call for public AI

≪ Previous: ISC’24 recap

June 2024 marked two years since I left my job at one of the world's most prestigious government HPC centers for a job in one of the world's largest technology corporations. In that time, the world of HPC has changed dramatically; just six months after I started, ChatGPT was released and triggered a gold rush in AI that is now overshadowing traditional scientific computing. This shift brought about massive HPC deployments led by hyperscalers, challenging the long-held belief that only national governments could deploy and operate world-leading supercomputers. My experiences at ISC'24 this past summer made clear to me that the traditional HPC community is now rethinking their role the industry, and some individuals who built their careers in public HPC are revisiting their assumption that world-class HPC systems are limited to the public institutions that have historically dominated the top of the Top500 list. I had no idea things would unfold this way when I left my job at NERSC back in 2022, and I've been remarkably lucky to now be a part of the largest forces driving this huge shift in HPC.

One of my new offices. Nicer than my old government office, and it has free food. — One of my new offices. Nicer than my old government office, and it has free food, but it's a ninety-minute drive each way.

In the spirit of openness and helping others who are facing similar career decisions, I thought I would follow up on my Life and leaving NERSC post by sharing how my professional journey from DOE HPC into cloud HPC has been going. I'll first explain the path I've traveled over these past two years, then answer some of the most common questions I've been asked about this transition.

As a forewarning, this is not a typical technology-focused post, and most of this might be obvious to people who already work in Big Tech. Here are the questions on which I reflected:

What happened during my first two years in Corporate America?

I published my Life and leaving NERSC blog post on a Thursday, which was my last day working at NERSC. The following Monday was my first day at the new job, and being hired as 100% remote, it didn't feel that different; I was just booting up a Lenovo laptop (yuck) instead of a MacBook, using Teams and Outlook instead of Slack, GSuite, and Zoom, and that sort of thing.

However, the job was undeniably different; whereas I used to be an engineer at NERSC, I was hired to be a "Principal Product Manager" within the cloud storage organization which was responsible for all object, disk, and file storage services. Although my title was "product manager," I wasn't a people manager, and I didn't manage any specific storage products. Rather, my responsibility was to act as an HPC-focused overlay across all cloud storage services, and my job was to represent the interests of HPC users to all the people who did manage specific storage products. I didn't define product or feature roadmaps myself, but I could help those responsible for each product or service understand how to shape their roadmaps to benefit HPC workloads.

I struggled in this position for a variety of reasons, so after I gave the new role an honest six to nine months, I decided that being a storage product manager just wasn't a good fit for me. Unfortunately, I reached this decision after the yield curve inverted and mass-layoffs and hiring freezes were implemented, so there weren't a lot of places to go other than back to a government lab. Although I wasn't thriving as a storage product manager, I did have allies that helped me navigate my day-to-day struggles, and I decided to wait until more opportunities opened up and learn as much about product management as I could in the meantime.

The yield curve inverted a month after I started my new job. Not great timing.

After a little over a year as a storage product manager, a new engineering role opened up within a sister team in our HPC/AI infrastructure organization. After discussing the needs and nature of the work with the hiring manager, I applied for the job, went through the interview process, and was eventually given a verbal offer to join his team in June 2023. Unfortunately, the global economic outlook was still uncertain, and I wound up sitting in a holding pattern (as a storage product manager) from June 2023 to November 2023. It wasn't until the week of SC'23 that I finally got the written offer letter, and I spent December wrapping up loose ends within the storage organization.

On January 2, 2024, I began my new (and current) role within the company. The move was completely lateral, but I changed job titles from "Product Manager" to "Software Engineer," and I changed organizations from storage to specialized compute.

I say all this because my experiences in making the professional transition from government HPC to cloud HPC are colored by the fact that I really changed jobs twice. I've had both product management and engineering/development roles, and I've been in both storage and HPC organizations.

So what do I actually do?

I've had two very different roles within the same orbit of HPC/AI infrastructure, so I'll describe them separately to give you a sense of the breadth of HPC roles possible.

Storage product management

As a storage product manager (PM), I was an inch deep but a mile wide on every storage service, every commercial HPC workload, and all the ways in which those two could touch each other. I'd guess that only 25% of my day-to-day work required deep expertise in HPC; the remainder was either business-centric or required only understanding HPC in broad strokes. This was quite unlike the things I'd done earlier in my career in the public sector, since there's not an equivalent to what a product manager does within the DOE Labs.

For example, I spent a lot of my time as a storage PM explaining the basics of HPC I/O to different teams within the company. When most cloud people think "storage," they are really thinking about either enterprise storage (things like virtual disks for virtual machines) or content distribution (think serving up content for web apps). The concept of hundreds or thousands of VMs all writing to the same place at the same time is standard practice in the HPC world, but in the cloud world, this is a DDoS attack. Since my organization was responsible for all storage, not just HPC storage, there were a lot of people who simply never had to think about the challenges that HPC people take for granted, and it could be challenging (as the new guy) to convince seasoned cloud storage PMs that some workloads legitimately need hundreds of gigabytes per second of bandwidth.

As a PM, I also wound up doing a fair amount of business reporting. For example, object storage is used by all manner of cloud customers, so prioritizing features that specifically help HPC customers required understanding how many HPC customers actually used it. How do you define whether a workload is really an HPC workloads or not? In DOE, we'd waste hours debating stuff like this for no real purpose, but when I became a product manager, I had to define this to make the business case that we needed to develop a certain feature that would only be used by HPC workloads.

Finally, I did a fair amount of actual product and project management work. Get on the phone with a customer, write down what they do, and turn those into requirements. Do that a bunch of times, then synthesize a more general requirements document. Review it with leadership. Get approval to assign developers to work on the features to meet those requirements. Ask other teams to develop features you need for your feature. Negotiate with everyone on development priorities in the next six months. Track progress of the development team. Produce demos to show that progress is being made. Present progress to leadership. That sort of thing. It's similar to being a PI on a research grant, except I had customers, dependencies, and ultimate accountability.

As far as technical work, a lot of it revolved around meeting customers and internal partner teams where they were in terms of their knowledge of HPC. I did a fair amount of technical marketing; I would come up with the ways people should think about combining storage services together in their HPC workflows, then figure out how to communicate that to audiences with vastly different levels of technical understanding. For example, I didn't own our Lustre product, object storage product, or HPC CPU node product, but I owned the story around how we envisioned all three services worked well together. This meant I would create slides and narratives around this, then present them to anyone from our sales teams (who often had limited HPC-specific experience) to the world's leading HPC centers.

I also sometimes helped development teams accurately test their storage systems against HPC workloads. For example, when ChatGPT exploded, everyone wanted to know how well their storage service worked for training large language models. I would talk to the engineers who trained LLMs, infer what their I/O patterns would be based on their description of how they did training, then design a benchmark that our developers could follow to emulate that LLM training workflow. Since I understood both the workload and the storage technology, it was often faster for me to translate between AI engineers and storage engineers rather than have them speak directly.

HPC/AI development

As an HPC/AI engineer, my work is a lot more technical and focused. I'm on a "white-glove support team" that works directly with large, strategic customers in HPC and AI, so rather than working with dozens of customers and connect them to dozens of storage technologies, I work with one or two customers and the specific technologies on which they build their HPC or AI clusters. Because of this, I'd wager 95% of my day-to-day work is technical.

I don't spend much time in a terminal by virtue of my relative seniority. Instead, I sit in on a lot of internal meetings and represent the perspective of our strategic HPC and AI customers. For example, if we are trying to decide which CPU to include in our next HPC-optimized CPU node, I might work with our benchmarking engineers to develop a representative benchmark and then interpret the results with the node's product managers. I'm not the person running the benchmark myself; instead, I might ask hard questions that the customer might ask, help decide the next experiments to run, and backstop our engineers if the customer starts poking too many holes in the work.

I also function as a system architect at times; if a customer shows up with unusually large or complex HPC system requirements, I'll help translate the customer requirement (e.g., "We need 10 TB/s of storage bandwidth) for individual product teams (e.g., "they will be using N compute nodes and accessing storage via a network with this topology and tapering, likely running an application that has this pattern, ..."). This often requires understanding what the compute, network, and storage product teams are doing and being able to explain it all in whatever terms each team understands. I also wind up sitting in on customer meetings and asking critical questions so that we can make informed design tradeoffs.

I do write code, but no more than I did when I was a system architect at NERSC. For example, I might pull PDU telemetry from across a data center to help determine if oversubscribing the power for a future cluster would impact workloads. The code itself is pretty straightforward statistical analysis, but interpreting it requires an understanding of a bunch of things ranging from the workload running on the nodes to how nodes are distributed across PDUs, racks, rows, halls, and buildings.

The remaining 5% of my work is not very technical and involves things I opt into because it's interesting or the right thing to do. This might be spending time providing historical context for a business strategy document or showing up at a meeting to help explain the customer perspective to a finance or sales team.

Am I happy with my decision and the new job?

Yes, no, and yes.

Broadly, yes

I am glad I made the decision to leave NERSC and take on a job in Big Tech for a couple of high-level reasons.

As a product manager, I learned a lot about how businesses and corporations work to a degree that I never did when I worked at a startup and I never would have if I stayed with the government. Not only do I now know what the difference between gross and operating margin is, but I get it because I've had to build COGS and pricing models that could sustain and grow a new product. I know exactly how to price cloud services (or any product or service, really) and where that money goes. I now pay much more attention to quarterly earnings reports, and I have a more confident opinion on what different elements of these reports say about a technology company's trajectory. This has equipped me with what feels like a much more complete understanding of the HPC industry overall.

I'm also glad to work at a company that generally tries to do the right things. It is investing heavily towards being carbon negative (rather than just buying carbon offsets) while others are burning gas inefficiently in a race to be #1. It also matches every donation I make to 501(c)3 nonprofits which is a huge benefit that matches up with the ways in which I try to share my good fortune with others. And it beats employees over the heads with a strong, positive corporate culture which holds managers and leaders accountable for the wellness of their employees. These sorts of things don't meaningfully exist in government, and there are a lot of big corporations out prioritize short-term profits over the longer-term benefits that come from investing in sustainability and philanthropy.

But for a long time, no

However, I was unhappy for my first eighteen months.

I took a gamble on storage product management being as interesting and fulfilling as engineering when I decided to step into this new job, and I lost that bet. I quickly came to realize that there's a big difference between being a storage person in an HPC organization and being an HPC person in a storage organization.

When I worked in an HPC organization like NERSC, I was used to being the odd man out because parallel storage is a complicated topic that most HPC folks don't really understand. Despite that, everyone is still generally like-minded and appreciates the same things; everyone knows what MPI and InfiniBand are, and everybody knows what a checkpoint and restart might look like.

Conversely, when I worked in a storage organization, I was an odd man out because nobody really understood HPC. The average engineer only had a vague notion of what MPI or InfiniBand accomplished. If you don't understand that MPI is what lets hundreds of servers all work on the same distributed problem at once, it's easy to forget that an MPI application will also cause hundreds of servers to all write data at once. And if you've never used an MPI barrier, it's hard to internalize the fact that the whole application stops until the slowest process finishes writing.

Instead of worrying about tightly coupled applications, I realized that storage people worry about data availability and durability above all else. After all, storage's #1 job is to not lose data. In contrast, it's not unusual for an HPC user to have hundreds of terabytes of data vanish because they forgot to copy it off of scratch before it got purged. This sharp difference in priorities--data durability versus performance--causes friction, because at the end of the day, what's good for HPC (high bandwidth and low latency) is usually bad for storage (high durability and availability).

The landscape of storage for HPC and storage for enterprises. — The landscape of storage for HPC and storage for enterprises as I see it. If you care about one but work with people who care about the other, expect friction.

These are technological differences, but they result in a persistent, elevated level of latent stress that never goes away. People tend to worry about the things they understand, and people tend to ask for help about the things that worry them. What this meant for me is that I spent a lot of time focusing on things that everyone understood (like market trends, revenue, and general indicators of performance) instead of hard problems unique to large-scale HPC. And because I was never solving the hard problems, I never got the gratification of feeling like I accomplished something that, as I learned, is an important motivator to me.

To be clear, I realize that I made the decision to focus on problems that other people brought me rather than carve out a space for me to work on the problems I felt were important. I'm sure that someone who was more tenacious and unafraid to pursue challenges that nobody else understood would have a very different experience as a PM. But after about a year, I realized that what I value and enjoy doing just isn't aligned with what a successful storage PM needs to be successful. I realized I didn't want to keep doing what I was doing for another five years, so I decided to stop.

Finally, yes

I quite enjoy my role in HPC/AI engineering and development now, as it's similar to what I used to do in the DOE. I have to learn about how different hardware, software, and systems work, and I have a lot of room to focus on challenges that play to my strengths and interests. For example, I love engaging with the HPC community, and my job still allows me to go out to the big HPC conferences to do that. At the same time, I also like getting into the guts of system behavior, and I still get to spend at least an hour or two a week doing something quantitative.

My day-to-day is also steeped in that familiar feel of working in an HPC organization. Every cluster has a name that gets bandied about in meetings, and they have the same familiar challenges--fabric disruptions, firmware upgrades, flaky nodes, and the like. The standard responsibilities are also all there; some teams perform system administration, others support users, and some of us focus on future system designs. But the cluster names aren't nearly as creative as those in the public sector (Eagle's real name sounds like a serial number). And they look pretty boring too; there are no fancy rack graphics.

Five racks of a cloud GPU cluster. It's mostly just boring servers and optical cables. — Five racks of a cloud GPU cluster that runs ND H100 v5-series VMs. Source

There are also teams that have no analogue in the traditional HPC world, like those who are responsible for things ranging from the smart NICs and software-defined networks to profits and losses. This is what keeps things interesting; I can just as easily spend an hour reviewing benchmark results from the latest GPU with my teammates as I can learning how the control systems for liquid heat exchangers affect system reliability or data center safety. When things are quiet and no fires are burning, going to work can sometimes feel like going to a big playground full of HPC and HPC-adjacent technology.

Don't get me wrong; it's still a job, and there are still unpleasant tasks and uncomfortable situations. Working at a cloud provider means a lot of processes are designed to be slow and steady, and some teams struggle to understand why anyone would want to reboot every node in a cluster at once--such an event would be a massive outage in general-purpose cloud! But working in an HPC organization means that when these situations arise, I'm no longer the odd HPC guy--I'm on the odd HPC team.

What does industry do better than the Labs?

Accountability

Organizational planning happens twice a year, and this planning is the time when teams all get on the same page about what work to prioritize in the next six months (a semester). Teams coordinate dependent work with each other, trades horses on what the priority of each request is, and at the end of planning, have committed agreements about what work will be done in the next semester. The progress on that work is tracked throughout the semester, delays and interrupts are accounted, and there's an escalation path up through the ranks of management and leadership if priorities cannot be agreed upon by individual teams.

The DOE Labs operate much more loosely in my experience. There, people tend to work on whatever pet projects they want until they lose interest. If a project is funded by a research grant, there are loose deliverables and timelines (write X papers per year), but at the end of the day, nothing really bad happens if the work progresses slowly or its quality is poor. There's no penalty if a research grant results in a piece of software that nobody uses or a paper that nobody reads. The value of the work is largely intellectual, and as a result, it's perfectly possible to have a long career at a DOE lab, churning out papers and software, that lacks any lasting impact.

Tying money to the value of work can make accountability much more black and white. If you pay a team of engineers a million dollars a year to develop a new service that only increases revenue by a million dollars a year, that service is going to be scrutinized every time prioritization happens. Is there a way to increase its revenue through better features or better positioning? It'll be a product manager's job to go figure that out. If the answer comes back as "no," then that service might be put on a shelf and its engineering team reassigned to work on something that has a greater impact. Those engineers don't get to decide that they keep wanting to work on the service that has limited demonstrable value.

At the same time, managers are accountable for the wellbeing of their team and the teams underneath them. All employees fill out regular, semi-anonymized surveys on different aspects of job satisfaction, and the results of these surveys roll up all the way to the top of the company. If employees are disgruntled, their managers know it, and those managers' managers know it, and everyone up the chain is accountable for improving those scores. Sometimes that results in increased hiring so engineers don't feel overworked. Other times it means reorganizing people and teams to align them with the work they are good at performing. And if nothing works and a team's morale keeps declining, maybe it's because of the manager--and the manager gets replaced.

Pace and decision making

Because managers and leaders are accountable, I've also found them to be much more empowered to just do what they feel is the right thing to do. Whereas no big decision in the DOE Labs can be made without reviews, panels, strategic offsites, more reviews, and presentations to headquarters--all of which could add months or years to a project--the direction can move on a dime because all it takes is one executive to sign off and accept full responsibility for the consequences of their decision. Getting the approval to staff up and pursue a good idea often requires only winning over one or two key people, not an army of feds in Germantown or an anonymous review panel who isn't conversant in what you're proposing.

And again, sometimes money makes decisions much easier to make. For example, a few people at ISC'24 asked me why we didn't re-do the Top500 run for Eagle to beat Aurora since the SC'23 scoring was so close. The decision process can be as simple as this:

According to the Top500 list's raw data, Eagle achieved 561,200 TFlop/s using an Nmax of 11,796,480.
Knowing that HPL's walltime is (flop count / Rmax) and HPL's flop count is (2/3 * Nmax^3), you can calculate that the HPL walltime for this run was 1,950 seconds or 0.512 hours.
The public list price for an Eagle node (ND96isr H100 v5) is something like $60 an hour.
The HPL run used 1,800 such nodes.

Give the above, during the half hour it would take to run HPL, those same nodes could be running a production workload which would have resulted in $58,000 in revenue. That is, the opportunity cost of re-running HPL is at least $58,000 in lost revenue. In reality, it would take time to boot up and configure the cluster of virtual machines and do a few scale-up runs which would tie up the nodes for a couple hours, making this opportunity cost closer to a couple hundred thousand dollars.

Is getting a marginally higher Top500 score worth a couple hundred thousand dollars if your machine is already listed and had its day in the sun? I don't need an executive to answer that question. But in the public HPC space, who's to say what the opportunity cost is? If HPL wasn't running twice a year on Frontier, are the dozen or so lattice QCD jobs that would be running instead worth a couple hundred thousand dollars?

Relevance

I might be more vain than I thought when I worked for the government, because I really enjoy being able to talk about the work that I do with the general public now. When people ask, "What work do you do?" and I respond with, "Have you ever heard of Copilot or ChatGPT?" there is almost always a conversation that follows. People may not really understand how artificial intelligence and large language models work, but they've played with those technologies and have opinions and questions. Sometimes the conversation is about big-picture stuff like "will AI take over the world?" At other times it's specific like "what do you think about AI's effect on global climate change?" Because I am steeped in all aspects of AI in my day-to-day work, I can usually speak intelligently about any dimension of the AI industry when my neighbors ask.

A picture that captures the essence of explaining AI concepts to neighbors in a friendly, approachable setting — Every blog post these days needs at least one AI-generated picture, so here is a picture generated by DALL-E that "captures the essence of explaining AI concepts to neighbors in a friendly, approachable setting." But more poignantly, my team directly supports the supercomputers that trained the model that generates these pictures.

This was a much bigger challenge when I worked in the public sector. When I told people that I worked at Lawrence Berkeley National Lab, nobody knew what I was talking about half of the time. The other half of the time, people would think I worked on nuclear weapons because Lawrence Livermore National Lab has a confusingly similar name and geography. And if the conversation ever got as far as what people did on the supercomputers I supported, it would rapidly tail off once all parties (including me) realized that cosmological hydrodynamics and quantum Monte Carlo don't really make for great conversation since they don't touch people's everyday lives.

This isn't to say that the work done at the Labs isn't important. But the general public doesn't understand it, and to a large degree, doesn't really care about it. I realize that being able to impress your neighbors with what you do isn't at the top of the list of most people's job requirements, but I get a lot of satisfaction out of it.

Technically: security

HPC doesn't really worry about cybersecurity. Every HPC center has a security group and does scans and threat modeling, but at the end of the day, the security practices on all the largest supercomputers in the public sector are roughly the same as they were twenty years ago. Users ssh into a login node, and once you're inside, you have access to everything. You can see everyone else who's logged in, you can see everyone who chmodded their home directory to be +777, and the only thing separating you from everyone else is the Linux kernel. Passwordless ssh is everywhere, and often times, passwordless ssh for the root user is everywhere.

This does not fly with paying commercial HPC and AI customers in the cloud who use supercomputing to develop better products faster than their competitors. For example, both Arm and AMD have publicly stated that they perform a lot of their silicon design simulations using HPC in the cloud. What would happen if both AMD and Arm used the same cluster and one accidentally made their project directory world-readable? Should domain scientists' understanding of how POSIX file permissions work really be the last line of defense against a next-generation CPU or GPU's specs being leaked to the competition?

I had to quickly learn about modern security practices when I started doing HPC in the commercial cloud out of necessity. I'm still nowhere close to being a security expert, but two years has been long enough for me to now cringe when I talk to my colleagues in the traditional HPC community about how they protect against threats. It's not really their fault that most of the HPC community hasn't adopted modern practices, because the tools and practices required to do it right aren't easy to set up, automate, and maintain from scratch.

For example, basic LDAP is a short path to allowing users to log into a cluster's nodes, but if those users also need to authenticate themselves to REST services that support an HPC workflow across multiple clusters, you have to start building a Rube Goldberg machine of software on top of LDAP. Similarly, sticking every user on their own overlay network is great to limit the blast radius of a compromised account. However, automating the configuration of VXLAN tunnel endpoints as nodes get allocated and deallocated to jobs requires a lot of fancy orchestration that is either very complicated to build and maintain yourself or very expensive to buy and maintain. As a result, HPC just accepts the risk. Cloud has figured all this out though, and the price of providing this security infrastructure is included in the cost of cloud-based supercomputers.

But the pay is good, right?

Like I said before I left the public sector, my base salary is comparable to what I got at the lab. It's actually gotten less competitive because all salaries were frozen when I was first eligible for a raise. So, after considering the effects of inflation, my paycheck is a little lower than what it was in the government two years ago.

What's different is the bonus structure which simply does not exist in the government or university world. For those who aren't familiar with how bonuses work in the tech industry, I'll share how it works for me:

In the first year, I was awarded two signing bonuses: one in cash, one in stock. Half of the cash bonus was paid out up-front, and the other half was paid out after I had been there a year. The stock grant cannot be touched during the first year because it had a one-year "cliff."
On my one-year anniversary, I got the second half of my cash signing bonus, and my signing stock grant began "vesting."
After a year, I was also eligible for an annual performance-based raise, cash bonus, and stock bonus.

Because of the economy, my annual raise was zero.
The cash bonus was paid out in a lump sum, similar to my cash signing bonus.
The stock bonus was awarded all at once but follows a multi-year "vesting schedule" which means I am only actually given fractions of the total award over time. However, these bonuses don't have a "cliff" and begin vesting immediately.

Every year thereafter, I am eligible for an annual raise, cash bonus, and another stock bonus.

The way stock bonuses work was the least intuitive part to me, but since it's such a significant part of total compensation, it's worth spelling out for anyone who's considering an offer that includes this:

Stock bonuses are defined in terms of dollar values. For example, let's say I got a signing stock bonus of $1000 with a one-year cliff that vests quarterly (every three months) over five years.
On the day that stock bonus is awarded, my employer converts that $1000 value into company stock based on the market value that day. If stocks are $50 per share, I am awarded 20 shares. My employer hangs on to those shares on my behalf, so I can't actually do anything with them yet.
Since I have a five-year vesting schedule and the award vests quarterly, my shares will vest twenty times (four quarters, five years). Coincidentally, since I have 20 shares, I will get one share per quarter.
However, because I have a one-year cliff, I get all four quarters of my first year at my one-year anniversary. So, four shares should appear in my brokerage account on my one-year anniversary. Once a share is in my brokerage account, I can do whatever I want with it (like sell it immediately!)
Every quarter thereafter, one more share vests and appears in my brokerage account.

Assuming I get a stock bonus as part of my overall annual bonus, this means that stock awards pile up and vest every year. This is tricky for two reasons:

Although my initial stock award was $1,000 in the above example, that amount was converted to stock the day it was awarded. Assuming I am doing a good job and increasing the value of my employer's stock, the value of those shares will increase while they're vesting. This means by the time the first four shares of my award vested at my one-year anniversary, they were worth more than the $50 per share they represented when they were awarded. More broadly, the value of a stock bonus tends to increase over time, making the true cash value of a $1000 stock bonus worth a lot more than $1000 by the time it completely vests.
Every year's stock award comes with its own multi-year vesting period, which means at any given time, I have multiple years' bonuses all vesting at once. This also means that at any given time, I have a bunch of unvested stock that's worth a lot of money that I can't yet spend. If I quit my job though, all these unvested shares vanish into thin air.

These two factors make up the golden handcuffs that people often talk about in industry. The longer I stick around, the more unvested stock I have hanging over my head, and it usually becomes increasingly valuable (yet inaccessible!) over time. The reality is that if you've put in a few years in Big Tech, you might have years' worth of base salary tied up in unvested stock that all goes away if you quit.

The end result is that although base salary is competitive with what you can make in a government HPC facility, there's a significant cash bonus that falls out of the sky once a year, and an appreciable amount of stock appears in your brokerage account every couple of months which you can turn around and sell for more cash. Depending on seniority and performance, these bonuses can add up to a significant fraction of base salary.

Finally, the above is consistent with what I've seen firsthand at two companies in Big Tech but may be different based on the role and the company. For example, field-facing roles in sales and support may be completely different beasts, and private companies and startups load things differently due to the value of equity.

How's work-life balance?

It hasn't been different than working in the government. Just like at a lab or university, some people work around the clock while others stick pretty close to the standard workday. There may be a higher concentration of Type A personalities who put in a lot of time in Big Tech, and this may pressure others to keep up and also put in long hours, but there's rarely been an occasion where a manager expects staff to routinely work nights and weekends. Doing so would probably result in negative employee satisfaction scores which would roll up and eventually have to be addressed.

Of course, there are cases where working odd hours is required to get the job done. Because I work for a global organization, I've had to get up early to meet with teams or customers in Europe. I've also had to stay up late to meet with teams or customers in Asia. And in some particularly annoying days, I've had to do both and wind up working from 5am to 8pm. But I never felt that I had no choice in the matter; I pulled these hours because it was the right thing to do at the time. And I don't see this as being too different from the days when I'd work sixteen-hour days, seven days a week, for the entire month of March to put together a paper for SC. Or days when I'm at SC and am preparing talks, meeting with partners, and otherwise hustling from 8am to 1am for five days straight.

One big difference is the fact that my employer offers discretionary time off ("unlimited vacation"). This is a divisive topic in industry, but I see it as a positive for work-life balance because it underscores an emphasis on outcomes rather than output. I can take an afternoon off or enjoy a long weekend with little fanfare, because productivity is infinitely more valuable that presence. As long as I do what needs to get done, I don't have to worry about timing vacations to ensure I am banking enough time off in between.

Do you miss anything about working at the lab?

Absolutely. There are a bunch of appealing things about working in a DOE lab (or an NSF center) that I've had to give up since coming to industry.

Freedom to have an off day

Right before I finished graduate school, I had a conversation with Professor Edmund Webb soon after he became a professor after a decade-long career at Sandia National Labs about life at the Labs. He said that, after becoming a professor, he lost the ability to just close the door to his office and focus on something he needed to get done for a day. I didn't really grasp what this meant at the time, but I totally get it now. The DOE might be one of the few places where you can take a day--maybe even a week--and just close your door to everything else that's going on around you to focus on what you want to do. In the case of professorship, there's always students requiring attention; in industry, it's customers and partners.

I think this difference results from two factors: very few things in public HPC are very urgent, and the Labs are stocked full of independent, free-thinking Ph.D. types. There's rarely a penalty if something is late by a day (or two years! Remember when Aurora was called "A21?"), but there can be huge payoff in prestige if one of your wacky side projects turns out to be something useful (this is how Shifter came to be). By comparison, working at a giant corporation often means there are a bunch of interdependencies on others, and the odds of any one of your 200,000 coworkers sending you a Teams message asking for help is just a lot higher than it is at a 70-person supercomputer center. The culture is much more team-oriented, and being a one-person army isn't incentivized as much.

Travel

Part of my job within the DOE complex was to go around the country (and the world) and be smart, and secondarily, show that my lab hired smart people and did smart things. If headquarters wanted to make sure that the supercomputer they were about to spend $500M on was technically sound, I'd sometimes get invited to go sit in on a review and try to poke holes in the design. If a European HPC project wanted to ensure they were including a global perspective on some dimension of future HPC strategy, I'd sometimes get invited to give a talk about how I view the world of data. And if these reviews and workshops happened to be in awesome places around the world--oh well!

I feel a lot more self-conscious about requesting approval to attend these sorts of boondoggles as an engineer now because the first question I have to answer is, "Is this trip business critical?" If there's a direct line of sight between me giving a talk at a workshop and a specific business strategy, I can say "yes" with a straight face. But it's hard to accept an invitation to fly off to Switzerland to give a 30-minute talk when I know that my attendance isn't going to move any needles.

Openness

Just like it's no longer my job to travel the world and just be smart, it's not my job to write about the work that I (or my team) does. I miss writing papers and giving technical talks, because the process of putting together coherent thoughts around a technical topic is one of the ways I really come to understand it. There's also a lot of really wild ideas that we're pursuing at scale that the scientific computing community has never considered, but there are two factors that work against being open about these things:

In terms of prioritization, my time is always better spent solving problems, or at least documenting them for internal audiences who fully grasp the context around them, than writing about them in a way that the rest of the world can understand. It's hard to justify the time to write a retrospective or a study unless there's a strategic advantage behind it.
The customers I support typically do not want the world knowing what they're doing. There is an AI arms race happening right now, and having the technical sophistication to utilize massive-scale supercomputers effectively is a competitive advantage. In the traditional HPC community, only national security is comparable to the level of secrecy involved, and none of the intelligence agencies are openly contributing to the state of the art in HPC either.

So instead of making conference papers and presentations, these days I make more internal papers and presentations. I'm trying to figure out ways to publish interesting technical anecdotes on my website (for example, I maintain a collection of LLM training requirements as I am exposed to them), but it's a lot of extra work to disentangle the proprietary bits from my work notes to do this.

Related to openness is also freedom to speak my mind in public forums. I had the most latitude to blast my opinions out on to the Internet when I was still early in my career and nobody listened to me, but I've had to get progressively less opinionated over the years. At this point, I abide by a written corporate social media policy which, although very reasonable in what it requests (don't slander competitors, always be transparent about who employs you), it stops me from commenting on news as much as I used to since so many tech companies qualify as competitors in some dimension.

Would you still have left knowing what you know now?

Yes. I still stand by just about everything I wrote in my original blog post; at the time, I just needed a change, and I found the change that I was looking for. Without immersing myself in the world of cloud, I would have never learned about virtualization, physical infrastructure, or modern security to the degree that I have. And the fact that I stumbled into what has become one of the leading companies in AI at the dawn of generative AI was an extremely lucky coincidence.

However, this doesn't mean that I now turn my nose up at doing HPC in the public sector. There are many unique aspects to working at a DOE lab or NSF center that have no parallel in industry. I also believe that I am the sum of the experiences that led me to where I work today, and I would never have gotten the opportunity to write this retrospective if I didn't learn everything I did working in the DOE and NSF.

And perhaps above all else, there is something attractive about public service that I haven't been able to shake in the last two years. I still dial in to ASCAC meetings to see what the world of public HPC and scientific computing is thinking and doing, and I still try to contribute time and attention to working groups like NITRD's MAGIC. I write lengthy blog posts in a futile attempt to caution the leaders in public-sector HPC against rejecting AI workloads in commercial clouds as HPC. And every time I learn some slick way we deal with hard technological or sociological issues at work, I still file it away in the "good ideas for when I go back" folder in the back of my mind.

I don't have any near-term plans on going anywhere though. Like I said before, there are still plenty of days when dialing into work is like going to the playground. Amazing things are happening in the world of HPC infrastructure at scale now that the world is pouring money into AI, and the rate of scale and innovation is no longer constrained to 40 MW and $500M per supercomputer like it was when public-sector HPC was setting the bar for leadership. There is a whole new exciting world of challenges and possibilities when you start thinking about building supercomputers that consume hundreds of megawatts of power.

Like I wrote two years ago, I don't think any government has the appetite to build data centers for scientific computing that are larger than today's 50 MW exascale facilities. This means that government HPC centers will never have a reason to explore the exciting world of 100+ MW supercomputers or work on the wacky problems that arise at that scale. Consequently, the biggest and most challenging problems in HPC--at least in terms of infrastructure and systems design at scale--are becoming unique to industry, not public HPC.

I got into HPC because I enjoy working on large, complex systems. Considering where I am at this stage of my life, what I want to accomplish in the rest of my career, and what gets me out of bed in the morning, I feel like I wound up in the right place for now. I have no regrets.

↧

A critique of the call for public AI

October 3, 2024, 11:04 pm

≫ Next: FASST will be DOE's opportunity to adapt, align, or...

≪ Previous: How has life after leaving the Labs been going?

As I spend more time in the AI infrastructure business, I've been thinking more and more about the government's role in AI. There's no shortage of opinions and position papers on the topic, and sadly, most of them are written from the perspective of the government rather than the AI industry. As a result, they are often full of misunderstandings, misleading statements, or ideas about the world that are six months out of date—the blink of an eye to the government, but an eternity in the AI industry.

The latest to cross my desk is a report titled The National Security Case for Public AI which was released on September 27, 2024 by the Vanderbilt Policy Accelerator. At a high level, the authors try to make the case that the U.S. government should build its own vertically integrated AI stack (from silicon to data centers to applications) to compete with (“complement”) effort in private industry. They also suggest regulating the AI industry in a way analogous to how public utilities are regulated. It's full of hyperbole, no doubt to stimulate thought and debate, and is a pretty easy read with a helpful bunch of references I'd never read before.

But as often happens when I read these sorts of things, I started angrily annotating the PDF as I read. It's full of gaps, logical fallacies, and dishonest representation sprinkled throughout, and by the time I realized that it probably wasn't worth my time to peel apart such a flawed position paper, I was already committed to marking up the whole thing. So as not to feel like I completely wasted an evening on this, I've decided to post my annotations here.

The biggest hurdle is workforce

Although it wasn't the lynchpin of this paper, the most egregious problem with this whole paper is its frequent call to hiring more AI expertise to develop a public AI capabilities. The tone is completely ignorant of what it's really like to work in industry versus government in a high-tech space, and the paper repeatedly makes recommendations that imply that hiring and retaining staff who are leading experts in all aspects of the AI stack is just a matter of cutting a bigger check.

As I have been whining about for years, this is a misguided belief, and a vision that is built on this premise is a house of cards. Until the people who write these sorts of papers understand why it's hard to attract and retain people who have skills which have dual use in public and private sector, these sorts of grand visions to go head-to-head with the AI industry will make incremental progress at best. If you are reading this and are ever tempted to write a position paper that includes hiring more HPC or AI experts into the government, please talk to someone who's worked in both worlds first!

The obliviousness to partnership

The position of the authors is also a very stark, black-and-white view where the government is unambiguously good and private industry is just out to stack cash, fleece the government at every opportunity, and let the world deteriorate around them. I kept finding myself making the following points:

The defense industry and its relationships with subcontracting is not the only way government can partner with industry. The authors completely ignore the fact that the NSF and DOE each have their own successful models for funding national-scale infrastructure for the public good, and those subcontractors (UT-Battelle which runs Oak Ridge, LANS which runs Los Alamos) or their subcontractors who build specific computing solutions (IBM, HPE, etc) have decades of history partnering with the technology industry.
The existing "AI stack" in private industry is not 100% proprietary. Significant pieces of it--likely the majority of it--are open-source, openly developed, and managed through a neutral foundation of supporters. The PyTorch foundation is a prime example of this; it is the foundation on which much of the training at scale has been done, and there's nothing stopping anyone (including the government) from participating in its development.

These two points painted a picture where the authors are trying to apply what they know (likely work in the defense sector, building physical widgets) to something they only peripherally understand (developing hardware and software technologies, and deploying and maintaining them at scale).

My detailed notes

That all said, I am not an expert in any of this either; I am neither an expert in AI nor a policy wonk, and I don't really understand how the government (especially those parts outside of DOE and NSF!) work. What follows are just a loose collection of quotes from the paper and my personal thoughts in response. All the usual disclaimers apply as well. These views are mine alone, they do not reflect those of any past or present employers, and so on.

Let's dive in.

Altman frames the choice as between two futures: “Will it be one in which the United States and allied nations advance a global AI that spreads the technology's benefits and opens access to it, or an authoritarian one, in which nations or movements that don't share our values use AI to cement and expand their power?
...
By public AI, we mean two things: publicly-provided, -owned and -operated layers in the AI tech stack, such as cloud infrastructure, data, and model development; and public utility-style regulation of the private AI industry that fosters competition and prevents abuses of power.

What would stop AI innovation from moving to countries that simply do not impose public utility-style regulation that isn't the hyperbolic “authoritarian” government described above? Honest question.

If space, power, cooling, and money are the only things stopping private AI industry, I can think of several places in the world (that aren't the USA) that could make attractive landing spots. Given the political extremism and volatility in the US, one could reasonably argue that there are better places than the US where such innovation could happen.

Regulation won't work nearly as well when the workforce is remote and the regulations are not necessarily aligned with global societal norms.

Investing in people with technological expertise has the potential to create a virtuous cycle: a more affordable mission-driven staff would not only build public-interested AI systems for a wide variety of public- uses but could also evaluate private sector AI services more accurately and reduce the likelihood that government contracts will suffer from cost and quality problems.

This is one of the most nonsensical things I've read on this topic, and it reads like the perspective of someone who's never worked on both sides of technology. Likening leadership in AI innovation to the rollout of a generic web service like healthcare.gov reflects a complete lack of understanding of how AI, and the specialized expertise it requires, differs from general IT functions.

There is no such thing as “more affordable mission-driven staff” when it comes to AI. Do you think people working at Meta, OpenAI, and other leading AI labs are “affordable” by the typical American taxpayer's standards?

I believe in the mission more than most people working in Big Tech, but this claim is patently absurd.

Here, too, it seems that the DOE will rely on some private sector AI infrastructure and partnerships (including cloud, data centers, and likely software designers).

This implies that private sector partnership is abhorrent to the notion of public AI. I have bad news—the public sector cannot stand on its own and create its own shadow version of what the AI industry has collectively done.

Part of this is because the AI industry itself has benefited tremendously from public-private partnership. The authors here are ignorant of open-source software and the effective public-private partnership that goes into industry, governments, and universities all contributing to a common foundation.

More robust federal investment in the infrastructure and human capacity for public AI is needed.

The government cannot “invest” its human capacity problems away. This is such a simplistic view.

The reason is obvious: the railroad would only serve its own vertically-integrated coal company or would charge prohibitive prices to competitors, thereby pushing them out of business. A competitive coal sector required preventing vertical integration with railroads. In the AI context, structural separations could be placed between chip makers, cloud providers, and model developers...

Is this a real threat? History has shown that being vertically integrated is often a very bad thing; compare Intel, which is vertically integrated in chip design and fab, and NVIDIA, which is not.

The coal and railroad analogy is imperfect because railroads and coal are both independently useful to many market segments. By comparison, a data center is not useful unless there are GPUs in it, and a GPU isn't useful unless there is a model to train on it. Again, this betrays the fact that the authors do not actually know how the supply chain underneath AI models actually works.

Nondiscrimination rules, or neutrality mandates, require that infrastructural providers serve all comers neutrally without favoritism or price discrimination.

So GSA prices for everyone?

Nondiscrimination rules ensure a level competitive playing field for entrepreneurs and non-profit, academic, or public sector competitive playing field for entrepreneurs and non-profit, academic, or public sector customers to access critical resources. In the AI context, these rules would apply to customers to access critical resources.

Doesn't antitrust cover this, since the authors claim that the whole AI industry is monopolistic?

Without competition or regulation, an AI oligopoly is likely to box out innovative start-ups, lose their innovative edge, offer worse quality of service to government clients, and raise costs for the American taxpayer. Regulating market structure to prevent the abuses of monopoly American taxpayer.

At what level of the stack is “AI oligopoly” being defined here? Or is it all of them?

What in the world is an “innovative start-up” when it comes to building multi-billion-dollar data centers?

What is an “innovative start-up” in the context of chipmakers who all rely on TSMC fab capacity to make their chips, and who can design chips from locations around the world?

The problem with the public utility analogy is that public utilities are geographically anchored to their consumers in the US. By comparison, the AI supply chain faces global competition. American AI companies will not “lose their innovative edge” because they're getting fat off of government contracts; they'll lose them because other countries are playing on the same field and can move faster.

First and foremost, public AI would bolster innovation. As Mariana Mazzucato has shown, the federal government has been an engine of innovation – and particularly technological innovation - throughout its history. Research and development programs, national missions and industrial policies, and other publicly-resourced and programs, national missions and industrial policies, and other publicly - resourced and often publicly-run programs have led to considerable breakthroughs. We should often publicly-run programs have led to considerable breakthroughs.

I'd love to hear the long-form version of the argument that the AI industry would move faster if the government was involved.

This is such a broad, nonspecific argument that washes over all the nuanced differences between AI as a societally revolutionary technology and other revolutionary technologies that got off the ground with government support.

It is textbook economics that firms facing little competition and no regulation to discipline them will both abuse their power and fail to innovate.

There is an AI land grab happening right now between a combination of AI startups and large technology firms. How can the authors say that there is “little competition” in one of the most fiercely competitive technology races that private industry has ever seen?

If there is “little competition,” why are so many people in the AI business working 60+ hour weeks?

Of course, I can envision a few cases where the government might feel like they're getting a bad deal from AI companies. For example, imagine an opportunity to work with the government was presented to an AI company, but it was written with the presumptions of this paper. Ridiculous claims that imply that finding talent was just a matter of money and that the government expects "affordability" would be received by any private bidder as a customer who will have unreasonable and misguided expectations. Why would a company, who itself is struggling to retain talent and build infrastructure faster than its competitors, divert its constrained resources to work with a customer who is that far out of touch?

Perhaps the issue is that there is too much competition, and people are willing to pay higher prices and have more reasonable expectations than the government. That isn't a sign of "little competition and no regulation," it's a sign that the government needs to catch up.

...we should expect these firms to continue pursuing anticompetitive actions that undermine innovation as they move into the AI space. Robust, independent public AI capacity also allows for more bespoke into the AI space.

This is quite disingenuous, because it implies that these companies' existing businesses and the markets in which they compete are completely transferable to the AI industry.

The AI industry does not even have a clear path to net profitability yet, so how can the authors claim that monopolies or oligopolies will form unless the government steps in? There are plenty of arguments to make for the government to regulate AI, but this isn't one of them.

Consider Elon Musk's control of Starlink for example. Whatever one thinks of Musk's political views or the war in Ukraine, should one person – or one firm – be able to undermine U.S. government policy with respect to a major conflict simply because they want to?

I would like to understand how this was undermining U.S. policy since Elon Musk isn't an agent of the government. Does this statement argue that the federal government should be operating or regulating its own Starlink? If so, why isn't it?

...quite real prospect of a contractor withholding critical products and services if the firm's leadership has a policy or political difference with the U.S. government.

Citation needed. When has a major tech firm ever done this? This is a genuine question; I may be naive since I've only worked in the high-end supercomputing space of the government.

That said, companies last a lot longer than presidencies. The damage of withholding critical products and services to spite one president or congressional session would endure far beyond, and I can't picture a successful company ever doing this.

dependence by government or critical infrastructure entities (such as utilities or airlines) on sole source providers for foundational operations services creates national security risk.

But this wasn't the case where there was a sole-source provider.

You can't dual-source email service, and there is no top-down mandate that all government agencies use one email service over another. For every hack of an Exchange account, there is a hack of a Gmail account.
Similarly, not all airlines were affected by Crowdstrike, because not all airlines chose to use it. In fact, the access that Crowdstrike had to cause the failure that it caused was a result of Microsoft opening up kernel access so other companies could compete with Microsoft's own security software. If the argument is that not everyone should use Windows, well, why hasn't the government addressed this by regulating the operating system business or mandating an alternative? Honest question.

And for what it's worth, I don't use Windows at work.

Public AI stacks create an independent option for government, one free from conflicts of interest or the whims of powerful private citizens. It ensures that national security goals cannot be dictated or determined by private actors.

To claim that anything government-made will be “free from conflicts of interest or the whims of powerful private citizens” is patently absurd given the country's campaign finance regulations and the tendency for some people in power (public or private) to abuse that power for personal profit.

To claim that never happens, which this statement does, undercuts a significant chunk of the whole argument here. It indicates the authors make this argument from an idealized world, not the one in which we live.

When government does need to leverage the private sector, a robust, independent public AI capacity will improve its ability to effectively partner with industry to advance the national interest.

How? With a magic wand? I don't understand this claim.

In short, these regulations would help keep the AI ecosystem healthy for the situations in which contracting out is necessary.

This is a good place to point out that much of this report is a giant slap in the face to the DOE and NSF supercomputing programs.

These organizations rely heavily on contractors and subcontractors to deliver the closest thing to a national AI infrastructure today. To suggest that they should be absorbed into the federal government—and be even more constrained than they currently are in the choices they can make, the costs they must incur for compliance, and the excess oversight and process that erodes their agility—is completely out of touch with reality.

DOE and NSF have shown that the government does not need to be vertically integrated and own all its own chipmaking, system integration, data centers, and applications to advance science for the public good. Perhaps more than any other single sentence in this report, the tone of this statement makes me question whether it was a good use of my time to even respond to this report, because it is in no way grounded with the decades of success that the government has already had in maintaining technology and infrastructure, largely through public-private partnership, for the national interest.

The tech platform example is instructive: countless hours and billions of dollars have been spent optimizing what videos and advertisements people should see. Far less effort in our age of technological progress has gone toward improving veterans benefits or social welfare programs – because that's not where the money is.

Quite hyperbolic, but fine.

However, this is an application of AI which is at the very tip of the vertically integrated public AI stack that this paper is calling for. Billions of dollars invested in putting eyeballs on ads is not the same as hundreds of billions of dollars invested in building out nation-wide infrastructure to support these applications.

And as I'll detail below, improving the lives of people is where money is for companies who can afford to compete at the top end of the AI game. Business is good when society is happy and productive.

Cost-overruns and delivery delays are standard. Quality of the output is sometimes a problem.

This is a non sequitur. Is this because of contracting, or is it incidental to contracting?

I don't understand how bringing these capabilities in-house will somehow make the process on-time and under-budget. What are examples of government functions which are handled in-house that are successful and efficient? Jury duty and going to the DMV?

Even if the system does not replicate all of these pathologies, once national security needs are identified, contracting to private actors still takes a considerable amount of time compared to in-house development and delivery of solutions.

Citation needed. This is not true.

One can imagine researchers and developers using public AI resources to develop and deploy AI solutions to address thorny problems of poverty and food insecurity, climate change, and disease – and without problems of poverty and food insecurity, climate change, and disease – and without the imperative to commercialize those solutions or achieve a return on the investment of time and money.

Again, these are applications of AI. The majority of the investment required to make a vertically integrated public AI stack is not in developing AI applications to solve public problems! The majority of the investment is in duplicating the massive infrastructure build-out, operations, and development of models which can be used by applications.

And to suggest that these “thorny problems” are not of interest to the corporations who can build the needed AI infrastructure is near-sighted. Food insecurity, climate change, and disease are good for nobody. If people are starving and dying, profits are down. It is true that some companies do not see societal challenges as aligned with shareholder value, but those companies are playing the short game and are unlikely to have the vision and capital required to build AI infrastructure in the first place.

If private companies understand that the government has the ability to develop national and homeland security solutions in-house, they would have to be develop national and homel and security solutions in - house, they would have to be more competitive in their pricing and more sensitive to delivering on time and on budget.

So the claim here is that private companies are late and over budget because the government lets them?

Show me a case in the history of leadership supercomputing where this was true. Stuff is late because measured bets are made and developing first-of-a-kind technology to solve groundbreaking problems is fundamentally hard and risky.

I feel like the authors want it both ways; they either want to develop in-house alternatives to commodities available on the open market so they aren't fleeced by nefarious subcontractors, or they want to compete directly with a fast-paced global AI industry developing new technologies at unprecedented cost and scale. Which is it? One comes with competitive pricing, and the other comes with risk-adjusted pricing.

...reliance on outsourcing to contractors and consultants saps the government of knowledge, talented people, and focus on public problems.

Explain how DOE ASCR and NNSA/SC programs work given this statement.

You cannot apply generic findings from the defense sector and claim it applies to AI when you already have a much more realistic analog (national supercomputing efforts in DOE, NSF, and other agencies) in the government already.

Moreover, having serious in house AI expertise and capacity will improve federal agencies' capacity to evaluate private contractors' AI proposals and products, and in turn, ensure that the government gets the products and services it needs at a fair price. This is one reason why experts have recommended building up federal tech capacity and personnel across agencies.

Again—wave a magic wand and it will be so.

You can't go down to the local Walmart and just buy AI expertise. You also cannot train up AI expertise and expect them to not consider other options when they realize their skills are in demand and met with higher value in the private sector. Until the government provides

Competitive total compensation
Clear, compelling mission
A workplace culture that is supportive of the highest performers

there will be a net egress of AI (and tech) talent from the government to private sector.

At best, big tech companies have a mixed record when it comes to public safety and welfare and democratic practices. The list of inadequate comes to public safety and welfare and democratic practices.

The same thing could be said about the government with equal weight and credibility. Any long-lived organization is going to have blemishes; to present this as if it's unique to Big Tech, and therefore relying on Big Tech must not be trusted and government is the only alternative, is disingenuous.

Some frontier AI companies have already been sued for training their models using massive amounts of copyrighted materials without permission or payment.

A little off-topic, but this rings a little hollow given how much research for the public good gets locked behind the paywalls of journals and major publishers.

As I said above, so many of these points about how Big Tech isn't to be trusted can be turned right back around at the government. These problems are not unique to private sector; they are a function of the way the country and society incentivizes the behavior of people regardless of who employs them.

Of course, the federal government is not perfect either, especially in the national security context. The U.S. government has undertaken its fair share of undemocratic and rights-abusing actions from domestic surveillance of civil rights leaders to bulk data collection. For this reason alone, public AI efforts should be accompanied by strict privacy rules and independent oversight to ensure Americans' rights. But in creating a public option for AI, lawmakers have the opportunity to advance, rather than diminish, democratic values and establish layers of oversight and transparency, which importantly - and unlike private companies - are democratically accountable.

This started out good and then took a hard turn. Why is public AI the only one that should be accompanied by strict privacy rules? This statement reads like “we should have public AI so that we can regulate data privacy” when the real statement should be “we should regulate data privacy.”

Also, “democratically accountable” doesn't exactly mesh well with all the claims that private sector is only out to “maximize shareholder profits.” I think I get what the authors are trying to say here, but it's not as if there's no accountability. When a company does something that's bad for society, generally speaking, its share price reflects that. There are exceptions, of course.

Some firms also seem to treat AI safety as an afterthought, which has led to a number of alternative firms created by disaffected and worried former employees. Leading figures in the AI sector, including the heads of frontier AI companies, have warned that generative AI models pose catastrophic and potentially existential risks to humanity - including the risk of “large-scale destructions” within a few years. Some have even declared that the future generative AI models will be so powerful and risk-laden that they should not be in private hands.

Doesn't this statement undercut the idea that private industry cannot be trusted to care about AI safety? It didn't take a government to tell these people to create their own firms or to get venture capitalists to fund them. The problem is being addressed exclusively by private industry, and by the same evil Big Tech and VC firms that "treat AI safety as an afterthought."

Regarding “the risk of large-scale destruction,” that's not what the testimony says.

And citing a podcast, which has a financial incentive to drive listenership by making controversial claims, as an authority on the risks of AI severely undercuts the credibility of this paragraph. Shame on the authors.

then the U.S. government should be at the cutting edge of AI safety research. And to conduct cutting-edge AI safety research, the federal government needs its own AI capabilities on which public employees and outside independent non-profit researchers can build frontier models and conduct safety testing.

Unless the government prevents it, frontier models will be proprietary, so collaboration with private industry will be necessary to actually have a material impact on AI safety to prevent “large-scale destruction.”

Developing its own vertically integrated AI safety capabilities means necessarily going head-to-head with the largest AI companies in the world to develop models that can be deeply inspected. This is not tractable, full stop.

The focus should be on building trust with industry through partnership, not decrying private sector as nefarious and claiming you'll just do what they do but better, faster, and cheaper. Developing parallel capabilities to train frontier models just makes no sense here. It's really expensive, even by government standards.

Moreover, if existential risks or emergent properties do materialize, it would likely be better for the first people to encounter and engage with such models to be public sector AI developers and national security professionals, who can be held publicly accountable, rather than corporate engineers and executives with primarily economic incentives. There are three reasons for this. First, the government would most likely encounter and engage with any so-called AI “superintelligence” in a closed, classified facility rather than a more open corporate environment.

There is no “open corporate environment” in which a superintelligence will be developed. The authors clearly have no clue how leading-edge AI development is happening. The security of the facilities training frontier models are at least as comprehensive as classified data centers, because they are just as worried about their secrets being stolen by adversarial state actors as the government is. To suggest otherwise is ignorant.

Second, corporate incentives will likely push in the direction of release without sufficient testing or controls.

Again, the authors have clearly never talked to anyone who is credibly working on AGI. I don't know anyone in the industry who has this in their business plan when AGI or superintelligence is reached.

A system running a superintelligence will be phenomenally expensive to own and operate. To suggest that any person off the street would be given access to a superintelligence as soon as it is activated ignores the financial realities of how this will play out.

As much as it would make this narrative more convincing, the AI industry is not this carefree and reckless.

Third, and relatedly, the government has decades of experience (and is generally quite good at) maintaining security for extremely dangerous materials and sensitive information – from nuclear and cyber weapons to disease samples and state secrets. Indeed, this is one reason why these activities are either publicly run and public managed capabilities or are highly regulated.

Do you think corporations aren't good at keeping secrets too? Show me evidence that the government is better than industry at these things.

The specific cases mentioned here are places where the private sector is not allowed to compete. Of course the government will have a better track record, because nobody else is on the track.

...tech companies seek to maximize profits for their shareholders. But the profit motive does not necessarily overlap with the United States's national security interests or with the public interest.

They do not necessarily, but they often do. American tech companies require a stable and successful nation to “maximize profits for their shareholders” so acting in the national interest is often aligned with financial incentives.

Arguments about tech patriotism in the AI race with China are particularly questionable given that most of the big tech companies operate in China, are dependent on China for production of their hardware, or have consistently attempted to get into Chinese markets (and simply been thwarted by Chinese officials).

I agree with the sentiment, but I don't think this statement is as true as the authors wish it to be. As relations between the US and China get frostier, companies have a natural incentive to distance themselves.

It is not unrealistic to worry that such commercial ties to adversarial or diplomatically transactional countries could, if enough money or market share was at stake, undermine or at least complicate American firms' services to the U.S. government.

I don't disagree with this. There is a concerning amount of “free money” flowing into the US tech sector from nations with checkered human rights records, for example. This is geopolitical and far beyond the scope of AI though.

Rather, it is simply to say that profit seekers are likely to argue for policies that benefit their shareholders, not the American public, when these two sets of interests are at odds.

A reasonable person could argue that a profit seeker could also be president, a member of congress, or any other elected or career member of the US government at any given time. This is a pretty weak argument when used to argue that the government will do a better job than corporations or startups.

First, the sprint to build public AI would complement – not prevent, preclude, or crowd out – private AI infrastructure and investment. It would coexist with the private sector and address national security challenges and public goods.

There is zero threat that public AI would “crowd out” private AI. And “complement” is very hard to distinguish from “compete against, poorly” when it comes to paying smart people to do innovative things that have dual use.

It would also ensure a dedicated, resilient, and uncompromised AI capacity that would meaningfully strengthen national security and advance public AI capacity.

Resilient? How will that work when existing government HPC resources are completely unresilient? I would say that the government's ability to deliver resilient, large-scale infrastructure for HPC is far behind the capabilities of commercial AI supercomputers. I would love to see a Top 10 supercomputer at a government lab train a trillion-parameter model to convergence (as opposed to training it for just a few steps and writing a paper about it!). It would be an eye-opening experience for the government.

What does “uncompromised AI capacity” even mean?

To put a fine point on this: what happens to this infrastructure when Congress can't pass a budget? When this happened during my time in government, I was fortunate to be a contractor and have my employer carry my salary until the politicians got their act together. Do you know how much money is lost when a data center full of GPUs goes idle for days or weeks in the private sector? Industry, and the shareholders holding them accountable, does not stand for that level of dysfunction.

the U.S. Government has historically been a transformational innovator and enabler of public-interested technological innovation where there is an urgent and compelling national interest. Finally, to the extent that building public AI would require transforming government - by hiring many new people with technological experience and expertise and increasing state capacity for public activities - this is a feature, not a bug. For too long, the government's capacity to act, and especially to act on technology, has been underdeveloped, slow, and outsourced.

The U.S. Government has historically been a transformational innovator when there is no commercial interest in doing something. Going to the moon is not profitable. Nuclear weapons are not profitable (because they're so highly regulated). AI is profitable and transformational because it is a feature of products that are already profitable. To liken the government's role in AI to the government's role in the moon landing is a joke.

As far as "hiring many new people with technological experience and expertise," should someone (Congress?) just wave their magic wand and make working in government at least as desirable as working in private industry for AI research?

There are so many things wrong with this.

The government is slow to move because it works by consensus. Do you think AI innovation would happen if it moved at the pace of the slowest thinker?
What would motivate a smart and ambitious AI practitioner to work in a slow-moving environment, mired in bureaucracy, where the penalty for underperforming is a lifelong salary with no critical responsibilities? This is a demoralizing environment to work in when it happens, and the government offers little recourse when one bad employee poisons the well.
Pay is an obvious challenge. How can the government justify the highest-paid government employees—who would have to be paid more than the US president to be competitive with industry—be working on nebulous AI initiatives, dictated in part by clueless bureaucrats, that are in direct competition with a focused and driven private sector?

Like I said earlier - you can't just go to Walmart and buy AI expertise. The authors completely fail to acknowledge that and speak as if they have a magic wand.

Our current, largely unregulated ecosystem of one GPU manufacturer, three Big Tech cloud providers, and a handful of AI labs at or affiliated with Big Tech companies will not provide the AI that the United States needs to safeguard national security and serve the public.

This seems intentionally hyperbolic.

Don't tell AMD investors that there's only one GPU provider. Their quarterly financials don't seem to reflect that.
Even if there was even competition in the market, what will you do about TSMC? This isn't a one-dimensional issue.
Don't tell Meta AI that they are affiliated with a cloud provider. Or Anthropic. In fact, OpenAI and Google are the only two AI shops that fit this categorization, and OpenAI is already branching out.

↧

FASST will be DOE's opportunity to adapt, align, or...

November 11, 2024, 12:47 am

≫ Next: SC'24 recap

≪ Previous: A critique of the call for public AI

These are some personal thoughts I’ve had in response to Notice of Request for Information (RFI) on Frontiers in AI for Science, Security, and Technology (FASST) Initiative.

The premise of the RFI includes the following:

The Department of Energy’s Office of Critical and Emerging Technologies (CET) seeks public comment to inform how DOE and its 17 national laboratories can leverage existing assets to provide a national AI capability for the public interest.

This RFI seeks public input to inform how DOE can partner with outside institutions and leverage its assets to implement and develop the roadmap for FASST, based on the four pillars of FASST: AI-ready data; Frontier-Scale AI Computing Infrastructure and Platforms; Safe, Secure, and Trustworthy AI Models and Systems; and AI Applications; as well as considerations for workforce and FASST governance.

If you are reading this, you are reading an incomplete version of this post, as I am still picking away at it. I will delete this caveat once I am satisfied that this introduction is sufficiently loaded up with context and disclaimers (I am writing this as my personal opinion and as a concerned citizen; any overlap with the interests of my employer is coincidental) and I have a satisfying final section. However, most of the content below will remain unchanged aside from minor grammatical fixes.

1. Data

I do not have an informed opinion on scientific data, so I have no response to these questions.

2. Compute

(a) How can DOE ensure FASST investments support a competitive hardware ecosystem and maintain American leadership in AI compute, including through DOE’s existing AI and high-performance-computing testbeds?

DOE must first define “American leadership in AI compute” very precisely. At present, American leadership in AI has happened in parallel to the US Exascale efforts; the race to achieve artificial general intelligence (and the AI innovation resulting from it) is being funded exclusively by private industry. For example, NVIDIA Tensor Cores first appeared in the Summit supercomputer in 2018, but the absence of this capability in Summit’s launch press release and subsequent scientific accomplishments paint a picture that, despite being the first flagship supercomputer to feature Volta GPUs, Summit had no bearing on the hardware innovation that resulted in the now-indispensable Tensor/Matrix Cores found in data center GPUs.

Directly supporting a competitive hardware ecosystem for AI compute will be a challenge for FASST. Consider that NVIDIA, which holds an overwhelming majority of the market share of AI accelerators, recently disclosed in a 10-Q filing that almost half of its quarterly revenue came from four customers who purchased in volumes that exceed the purchasing power of ASCR and NNSA programs. It follows that the hardware ecosystem is largely shaped by the needs of a few key corporations, and the DOE no longer serves as a market maker with the purchasing power to sustain competition by itself.

Thus, the DOE should acknowledge this reality and align its approach to AI technology with the needs of the AI industry to the fullest extent possible. Areas for alignment include:

Computational approaches such as using the same model architectures, approaches to scaling jobs, and using available arithmetic logic units. Numerical approaches to solving physical problems may have to fundamentally change to realize the next generation of scientific insights from modeling and simulation.
Orchestration and management of resources which includes using existing approaches to security, authentication, and federation. I estimate that at least 75% of the software required to realize national infrastructure like IRI already exists in commercial computing, and retrofitting or modernizing DOE supercomputers to work with that infrastructure is likely easier than reinventing a parallel infrastructure designed for the peculiar ways in which HPC approaches orchestration and management.
Infrastructural philosophies such as optimizing more holistically across the entire AI technology value chain by co-designing hardware not only with applications, but with power, cooling, data center, real estate, energy providers, and global supply chain. National-scale infrastructure must be viewed as a holistic, national-scale optimization.
Policy approaches that avoid the substantial oversight and lengthy reviews that precede one-time capital acquisitions and inhibit agility to adapt to rapidly changing technology needs that accompany the breakneck pace of AI innovation. This agility comes at a higher cost/performance ratio than DOE's belt-and-suspenders approach to supercomputing, but the AI industry is betting that the realized value will outweigh those costs.

There are undoubtedly more opportunities for alignment, and there is overlap between the above as well. But the unifying theme is that DOE's approach to AI technology should not continue its long history of taking a top-down approach to supercomputing. The DOE cannot approach HPC technology with the attitude of "DOE's way is the answer, what is the question?" as it wades into the AI technology space.

(b) How can DOE improve awareness of existing allocation processes for DOE’s AI-capable supercomputers and AI testbeds for smaller companies and newer research teams? How should DOE evaluate compute resource allocation strategies for large-scale foundation-model training and/or other AI use cases?

The DOE’s ERCAP model for allocations is already aligned with the way in which private sector matches AI compute consumers with AI compute providers. When an AI startup gets its first funding round, it is often accompanied with connections to one or more GPU service providers as part of the investment since such startups’ success is contingent upon having access to reliable, high-performance computing capabilities (see examples here and here). Continuing this model through FASST is the most direct way to raise awareness amongst those small businesses and researchers who stand to benefit most from FASST resources.

Evaluating allocation strategies should follow a different model, though. Recognizing that the centroid of AI expertise in the country lies outside of the government research space, FASST allocations should leverage AI experts outside of the government research space as well. This approach will have several benefits:

It reduces the odds of allocated resources being squandered on research projects that, while novel to the scientific research community, may have known flaws to the AI community.
It also keeps DOE-sponsored AI research grounded to the mainstream momentum of AI research, which occurs beyond the ken of federal sponsorship.

DOE should also make the process fast because AI moves quickly. This may require DOE accepting a higher risk of failure that arises from less oversight but higher research velocity.

(d) How can DOE continue to support the development of AI hardware, algorithms, and platforms tailored for science and engineering applications in cases where the needs of those applications differ from the needs of commodity AI applications?

To the extent that scientific uses for AI diverge from industry’s uses for AI, the DOE should consider partnering with other like-minded consumers of AI technology with similarly high tolerances for risk to create a meaningful market for competition.

Collaborations like the now-defunct APEX and CORAL programs seemed like a step in this direction, and cross-agency efforts such as NAIRR also hold the potential for the government to send a unified signal to industry that there is a market for alternate technologies. If formally aligning FASST with parallel government efforts proves untenable, FASST should do all in its power to avoid contradicting those other efforts and causing destructive interference in the voice of the government to industry.

The DOE should also be very deliberate to differentiate:

places where science and engineering applications truly diverge from industry AI applications, and
places where science and engineering applications prefer conveniences that are not offered by hardware, algorithms, and platforms tailored for industry AI applications

This is critical because the success of FASST is incompatible with the pace of traditional scientific computing. Maintaining support for the multi-decadal legacy of traditional HPC is not a constraint carried by the AI industry, so the outdated, insecure, and inefficient use modalities around HPC resources must not make their way into the requirements of FASST investments.

As a specific example, the need for FP64 by science and engineering applications is often stated as a requirement, but investments in algorithmic innovation have shown that lower-precision data types can provide scientifically meaningful results at very high performance. Instead of a starting position of “FP64 is required” in this case, FASST investments should start from places like, “what will it take to achieve the desired outcomes using BFLOAT16?”

This aligns with the AI industry’s approach to problems; the latest model architectures and algorithms are never perfectly matched with the latest AI hardware and platforms due to the different pace at which each progress. AI model developers accept that their ideas must be made to work on the existing or near-future compute platforms, and hard work through innovation is always required to close the gaps between ambition and available tools.

How can DOE partner with other compute capability providers, including both on-premises and cloud solution providers, to support various hardware technologies and provide a portfolio of compute capabilities for its mission areas?

The DOE may choose to continue its current approach to partnership where, to a first-order approximation, it is a customer who buys goods and services from compute capability providers. In this scenario, the role of those providers is to reliably deliver those goods and services, and as a part of that, periodically perform non-recurring engineering or codesign with its customers to align their products with the needs of their customers. There is a wealth of infrastructure providers who either sell hardware platforms or provide GPUs-as-a-hosted-service who will happily operate in this familiar mode, and the cost of their capital or services will be similarly aligned with the level of AI-specific value they deliver to the DOE.

However, partnership with true AI technology providers--those who develop hardware platforms for their own AI research, development, and service offerings--will bring forth two new challenges: misalignment of mission and mismatch of pace.

Alignment of mission

The DOE Office of Science’s mission is “to deliver scientific discoveries and major scientific tools to transform our understanding of nature and advance the energy, economic, and national security of the United States.” Broadly, its mission is to benefit society.

This mission naturally maps to the mission statements of the technology companies that have traditionally partnered with DOE. For example,

HPE does not have a clear mission statement, but their goals involve “helping you connect, protect, analyze, and act on all your data and applications wherever they live, from edge to cloud, so you can turn insights into outcomes at the speed required to thrive in today’s complex world.”
IBM also does not have a clear mission statement, but they state they “bring together all the necessary technology and services to help our clients solve their business problems.”
AMD’s mission is to “build great products that accelerate next-generation computing experiences.”

These technology companies’ missions are to help other companies realize their visions for the world. Partnership comes naturally, as these companies can help advance the mission of the DOE.

However, consider the mission statements of a few prominent AI companies:

OpenAI’s mission is “to ensure that artificial general intelligence benefits all of humanity.”
Microsoft’s mission is “to empower every person and every organization on the planet to achieve more.”
Anthropic’s mission is “to ensure transformative AI helps people and society flourish.”

AI companies’ missions are to benefit society directly, not businesses or customers. The AI industry does not need to partner with the DOE to realize its vision, because its mission is to “do,” not “help those who are doing.”

As such, the DOE and the AI industry are on equal footing in their ambition to directly impact everyday lives. It is not self-evident why the AI industry would want to partner with DOE, so if it is the ambition of the DOE to partner with the AI industry, it is incumbent upon DOE to redefine its role and accept some aspect of being the “helper” rather than exclusively the “doer.” The DOE must answer the question: how will the DOE help its AI industry partners achieve their mission?

The tempting, cynical answer may be “revenue,” but this would only be true companies whose mission is to “help” (and sell), not “do.” The following was stated on Microsoft’s Q1 FY2025 earnings call by Microsoft CEO Satya Nadella:

One of the things that may not be as evident is that we are not actually selling raw GPUs for other people to train. In fact, that’s a business we turn away, because we have so much demand on inference…

The motives of the AI industry are to solve problems through inferencing using world-class models. Selling AI infrastructure is not a motive, nor is training AI models. AI infrastructure and training frontier models are simply prerequisites to achieving their stated goals. Thus, the DOE should not treat these AI industry partners as vendors, nor should it expect the AI industry to react to traditional partnerships (capital acquisitions, non-recurring engineering contracts) with the same level of enthusiasm as HPC vendors. It must rethink the definition of partnership.

Mismatch of priorities

Once the DOE and its AI industry partners have found common foundations for productive partnership, the DOE must be ready to move with the same agility and urgency as its partners. Historically, the DOE has not done this; for example, while the step from hundreds of petaflops (Summit) to exaflops (Frontier) was accomplish in four years at OLCF, Microsoft accomplished the same in less than two. Similarly, xAI and NVIDIA were able to deploy an exascale supercomputer, from first rack to first job, in less than three weeks. This is not to say the DOE is incapable of moving with this speed; rather, it is a reflection of differing priorities.

For example, keeping pace with the AI industry will require DOE to accept the compromises that, in the context of scientific computing, were not justifiable. The Frontier RFP was issued in 2018, and the system first appeared on Top500 in June 2022. Consider: what would it have taken to cut this time in half by imposing a 2020 deployment deadline? What additional science could have been accomplished, albeit at reduced scale, had Frontier been up and running for the two years between 2020 and 2022?

Alternatively, what would Frontier look like if the procurement didn't begin until 2020 with a 2022 deployment date? What risks and change-order requests would have been avoided had Frontier been based on more well-defined, near-term technologies on vendor roadmaps?

Ultimately, what advantages did those extra two years of procurement lead-time offer, and were the outcomes afforded by those two years worth the opportunity cost? Would those advantages still be present in the world of AI, where major technology decisions are being made much closer to delivery?

Another example of this mismatch is in the productive outputs of DOE's stakeholders when compared to the AI industry. Many DOE researchers' careers are graded based on metrics such as papers published and grants received, and this incentivizes DOE to spend time writing, serving on review panels, creating quad charts, and focusing on other tasks that either generate these outputs or help others generate outputs. The outcomes of these outputs, frankly, are less clear--how many times do these outputs get used outside of the "Related Work" sections of other papers? By comparison, AI practitioners in industry are incentivized to contribute to products that find use; writing a paper is rarely as valuable as shipping a useful feature, and there is a much higher ratio of heads-down-engineers to principal-investigator-like managers.

This is not to say that one way has more societal value or impact than the other, but it does pose significant friction when those two worlds come together in partnership. A DOE engineer pursuing a novel idea with the goal of authoring a paper will struggle to get the support of an industry AI engineer facing a customer deadline. In the HPC context, the DOE engineer would be the customer, so the partnership (and the monetary exchange underlying the partnership) provided equal incentive for both parties. However, as stated above, partnering with the AI industry will not necessarily be the same customer-supplier relationship, and the DOE will have to find ways in which it can align its constituents' values with those of their industry partners. There are ways in which this could work (for example, DOE ideates and industry productizes, or industry ideates and DOE publishes), but the implications on intellectual property, risks, costs, and other non-technical factors will not fit the same mold as traditional partnerships.

There are many more examples of this mismatch, and as with other aspects of public-private partnership, there may be valid reasons for DOE to continue doing what it has been doing. However, it is critical that DOE not assume that the "DOE way" is the right way when approaching AI. The discussion within DOE should start from a perspective that assumes that the "AI way" is the right way. What would have to change about the way DOE defines its mission and the way scientists use DOE resources to adopt? The "DOE way" may truly offer the highest overall value even in the context of AI for science, but it is incumbent upon the DOE to prove that (to the rigor of the AI industry) if it intends to enter partnership with a "DOE-way" position.

3. Models

(b) How can DOE support investment and innovation in energy efficient AI model architectures and deployment, including potentially through prize-based competitions?

While energy-efficient technologies are undoubtedly important, the DOE's laser focus on energy efficiency (and downstream metrics such as PUE) reflects an unsettling blindness to the real issue at hand--carbon emissions, not energy consumption, drive global climate change. For example, a 20 MW supercomputer sited in California may have the same carbon footprint as...

a 16 MW supercomputer in Illinois
a 13 MW supercomputer in Tennessee
a 9 MW supercomputer in New Mexico

according to a rough approximation from the 2022 eGRID data.

Really addressing the underlying environmental challenges of AI will require DOE to accept that developing the most energy-efficient algorithms and hardware in the world is a half-measure as long as it keeps housing its supercomputers in parts of the country that still rely on coal and natural gas for over half of their energy fuel mix.

By contrast, leading AI infrastructure providers all have aggressive climate goals:

Furthermore, these companies are not limiting these ambitions to the carbon footprint of their direct operations and the energy their datacenters consume (Scope 2 emissions). Rather, their goals include managing Scope 3 emissions, which include carbon emissions from shipping, employee travel, waste streams like packaging, and everything else that touches their business.

DOE's focus on energy-efficient AI model architectures and deployment seems grossly insufficient by comparison. If DOE is to take seriously the environmental impacts of AI, it must look beyond nudging algorithms and silicon since the AI industry is already pursuing goals that are significantly more aggressive and comprehensive than the DOE. Instead, it should look towards partnerships that allow it to leverage the investments made by the AI industry in achieving these net-zero or net-negative emissions goals. This may include divorcing itself from deploying supercomputers in specific congressional districts, moving beyond simplistic metrics such as PUE, and aligning itself with commercial partners with broad sustainability strategies that address Scope 3 emissions.

4. Applications

(b) How can DOE ensure foundation AI models are effectively developed to realize breakthrough applications, in partnership with industry, academia, and other agencies?

The public discourse from leading experts within the DOE often conflates "foundation models for science" with the foundation models that power leading AI applications such as GPT-4o. Presentations discuss "trillion-parameter""foundation models specifically optimized for science" on one slide, then cite papers that present "foundation models for science" to justify the success. However, these promising "foundation models for science" do not have anywhere near trillions of parameters; rather, they have a few billion of parameters and can be trained in just a couple hours. As such, it is unclear what "foundation AI models" mean to DOE. The DOE must clarify whether it wants to train trillion-parameter models, or if it wants to train foundation models for science. The two have little overlap.

If the DOE wishes to define foundation models to be N-trillion-parameter models, it effectively does not have to do anything new to ensure that they are effectively developed to realize breakthrough applications. Although purpose-trained or purpose-fine-tuned LLMs can produce better results than non-fine-tuned models of equivalent size, LLM scaling laws have been consistently upheld and shown that a fine-tuned, domain-specific LLM is simply not as good as a larger, non-domain-specific LLM.

DOE's focus should not be on training or fine-tuning N-trillion-parameter models, as this is exceptionally expensive, and industry is already doing this at scales that surpass DOE's current or future capabilities. For example, Meta's Llama-3.1 405-billion-parameter model took 54 days of computation to train on an exascale (~950 FP64 PFLOPS estimated) supercomputer. To train this same model on Frontier would have taken, at minimum, anywhere from three to six months of uninterrupted, full-system computation. The opportunity cost (all the scientific workloads that would not run during this time) is staggering, and pursuing a trillion-parameter model would double this at minimum.

If the DOE wishes to define foundation models to be much smaller, domain-specific models, its strongest assets are in the scientific data required to train such a model. A successful approach to training smaller foundation models for science may involve combining DOE's domain expertise with industry-developed frontier models to perform distillation, the process by which high-quality, synthetic data is generated to train small, efficient models. Data generation would require access to frontier models which may only be available through an industry-provided inferencing API, but training and downstream tasks can be performed on DOE computational resources at smaller scales.

These notional opportunities are contingent on DOE clarifying its ambitions though--partnerships cannot be formed without clearly aligned goals at the outset.

5. Workforce

(a) DOE has an inventory of AI workforce training programs underway through our national labs. What other partnerships or convenings could DOE host or develop to support an AI ready scientific workforce in the United States?

Brad Smith, president of Microsoft, recently published an essay titled "The Next Great GPT: Advancing Propserity in the Age of AI" that builds upon a book, "Technology and the Rise of Great Powers," by Jeffrey Ding and describes the critical steps in which transformative technologies pervade society. Although not the goal of the essay, it paints a compelling picture of what is required to establish an AI-ready scientific workforce, and the DOE is uniquely positioned to lead many of those efforts.

Brad's essay states that "an advanced skilling infrastructure is indispensable in expanding the professions that create applications that make broad use of new technologies," and cites the example of ironworking in the 18th century, which benefitted England disproportionately due to its adoption of trade societies and other opportunities for hands-on learning. He further advocates for national AI skilling strategies that are designed not only to train engineers and system designers but also to enhance AI literacy throughout the entire economy. Bridging the gap between deep technical topics and the general public is something that DOE is uniquely experienced with over the course of its long history at the frontier of particle physics, nuclear technology, cosmology, and similarly difficult-to-understand technical areas, and increasing AI fluency is a mutual interest where industry and DOE have complementary strengths.

Brad also highlights that "new technology never becomes truly important unless people want to use it," and a major barrier to this is trust and safety. Citing the example of Edison and Westinghouse competing to show how safe their visions of electrification were, he rightly points out that a prerequisite to an AI-ready workforce is assurance that AI is aligned with societal and ethical values. While it is incumbent upon the AI industry to establish and maintain public confidence, the DOE has broader experience in maintaining public trust in dual-use technologies such as nuclear fission across multiple decades, demographics, and administrations. As such, public-private partnership towards safe and trustworthy AI that is aligned with the interests of the public is a key area of promise.

More tactically, the DOE can leverage its global reputation and extensive network to establish AI talent exchange programs, mirroring its successful collaborations in the HPC field with European and Asian counterparts. AI, like other Big Science domains, is a high-stakes technology being aggressively pursued by leading nations for its transformative societal and economic benefits. Attracting top talent from around the world will be critical to maintaining a globally competitive, AI-ready workforce in the United States.

6. Governance

(a) How can DOE effectively engage and partner with industry and civil society? What are convenings, organizational structures, and engagement mechanisms that DOE should consider for FASST?

This has been described in Section 2(a). DOE must find the ways in which it can keep pace with the AI industry, recognizing that the AI industry does not limit itself with months-long FOAs and years-long procurement cycles. Industry also does not gauge progress based on traditional peer review in conferences and journals (as evidenced by the number of foundational AI papers only published on arxiv). Partnerships with industry will require finding the areas where the interests, pace, and values of DOE and the AI industry overlap, and the DOE must acknowledge that this overlap may be small.

(b) What role should public-private partnerships play in FASST? What problems or topics should be the focus of these partnerships?

FASST must establish vehicles for public-private partnerships that go beyond the conventional large-scale system procurements and large-scale NRE projects. Both of these vehicles reinforce a customer-supplier relationship where money is exchanged for goods and services, but the new reality is that the AI industry is least constrained by money. FASST must provide incentives that outweigh the opportunity cost that partners face when dedicating resources to DOE and its mission instead of the commercial AI industry, and the promise of low-margin large-scale capital acquisitions is simply not sufficient.

To put this in concrete terms, consider the dilemma faced by individuals in industry who are conversant in both DOE HPC and AI: is their time better spent writing a response to a government RFI such as this one, or is it better spent developing insight that will improve the overall reliability of the next flagship training cluster?

Realistically, investing time in the government may slightly increase the chances of being awarded a major HPC system procurement, but that business would have

High personnel cost due to the dozens of pages of technical requirements that touch on dozens of different competencies and experts
High risk due to the need to price and bid on technologies that are beyond any credible hardware roadmap
Weak value proposition due to the value of the procurement being heavily biased towards delivering commoditized hardware rather than reliability, agility, and speed of deployment

By comparison, a similarly sized opportunity could surface with a twenty-person AI startup that offers a much lower opportunity cost:

Low personnel cost due to the general requirement of a specific AI training capability (X exaflops and Y petabytes of storage to train a model of Z capability) with an understanding that the solution will work by the delivery deadline, or the customer can simply walk away and go to another provider
Moderate risk due to the delivery date aligning with current- or next-gen (not next-next-gen) hardware which is already well-defined and in development (or already in production and hardened)
Strong value proposition because the procurement values a rapid deployment (weeks, not years), the agility to move on to the next-generation processor after a year, and the reliability required to run a full-system job for weeks, not hours.

When confronted with this choice, the right answer is usually to prioritize the latter opportunity because it aligns most closely with the values and competencies of the AI industry.

↧

SC'24 recap

December 1, 2024, 11:30 pm

≫ Next: LLM training without a parallel file system

≪ Previous: FASST will be DOE's opportunity to adapt, align, or...

The premiere annual conference of the high-performance computing community, SC24, was held in Atlanta last week, and it attracted a record-shattering number of attendees--nearly 18,000 registrants, up 28% from last year! The conference felt big as well, and there seemed to be a lot more running between sessions, meetings, and the exhibition floor. Despite its objectively bigger size though, the content of the conference felt more diffuse this year, and I was left wondering if this reflected my own biases or was a real effect of the AI industry beginning to overflow into AI-adjacent technology conferences like SC.

Illuminated SC24 sign on display in the convention center

Of course, this isn't to say that SC24 was anything short of a great conference. Some exciting new technologies were announced, a new supercomputer beat out Frontier to become the fastest supercomputer on the Top500 list, and I got to catch up with a bunch of great people that I only get to see at shows like this. I'll touch on all of these things below. But this year felt different from previous SC conferences to me, and I'll try to talk about that too.

There's no great way to arrange all the things I jotted down in my notes, but I've tried to arrange them by what readers may be interested in. Here's the table of contents:

Before getting into the details though, I should explain how my perspective shaped what I noticed (and missed) through the conference.

My approach to SC this year

Although this is the eleventh SC conference I've attended, it was the first time that I:

attended as a practitioner of hyperscale AI rather than traditional HPC and scientific computing
attended as a Microsoft engineer (I represented Microsoft as a product manager at SC22 and SC23)
did not attend SC as a designated storage person (since 2013)

Because of these changes in my identity as an attendee, I approached the conference with a different set of goals in mind:

As a hyperscale/AI person, I felt that I should prioritize attending all the cloud and AI sessions whenever forced to choose between one session or another. I chose to focus on understanding the traditional HPC community's understanding of hyperscale and AI, which meant I had to spend less time in the workshops, panels and BOFs where I built my career.

As an engineer rather than a product manager, it wasn't my primary responsibility to run private briefings and gather HPC customers' requirements and feedback. Instead, I prioritized only those meetings where my first-hand knowledge of how massive-scale AI training works could have a meaningful impact. This meant I focused on partners and practitioners who also operate in the realm of hyperscale--think massive, AI-adjacent companies and the HPC centers who have historically dominated the very top of the Top500 list.

One thing I didn't anticipate going into SC24 is that I've inherited a third identity: there are a new cohort of people in HPC who see me as a long-time community member. This resulted in a surprising amount of my time being spent talking to students and early career practitioners who were looking for advice.

These three identities and goals meant I don't many notes to share on the technical program, but I did capture more observations about broader trends in the HPC industry and community.

New technology and announcements

HPC is all about cutting-edge technology, so that's a fine place to start talking about what was new.

Top500 and a new #1 system

A cornerstone of every SC conference is the release of the new Top500 list on Monday, and this is especially true on years when a new #1 supercomputer is announced. As was widely anticipated in the weeks leading up to SC24, El Capitan unseated Frontier as the new #1 supercomputer this year, posting an impressive 1.74 EFLOPS of FP64. In addition though, Frontier grew a little (it added 400 nodes), there was a notable new #5 system (Eni's HPC6), and a number of smaller systems appeared that are worth calling out.

#1 - El Capitan

The highlight of the Top500 list was undoubtedly the debut of El Capitan, Lawrence Livermore National Laboratory's massive new MI300A-based exascale supercomputer. Its 1.74 EF score resulted from a 105-minute HPL run that came in under 30 MW, and a bunch of technical details about the system were disclosed by Livermore Computing's CTO, Bronis de Supinski, during an invited talk during the Top500 BOF. Plenty of others summarize the system's speeds and feeds (e.g., see The Next Platform's article on El Cap), so I won't do that. However, I will comment on how unusual Bronis' talk was.

Foremost, the El Capitan talk seemed haphazard and last-minute. Considering the system took over half a decade of planning and cost at least half a billion dollars, El Capitan's unveiling was the most unenthusiastic description of a brand-new #1 supercomputer I've ever seen. I can understand that the Livermore folks have debuted plenty of novel #1 systems in their careers, but El Capitan is objectively a fascinating system, and running a full-system job for nearly two hours across first-of-a-kind APUs is an amazing feat. If community leaders don't get excited about their own groundbreaking achievements, what kind of message should the next generation of HPC professionals take home?

In sharp contrast to the blasé announcement of this new system was the leading slide that was presented to describe the speeds and feeds of El Capitan:

I've never seen a speaker take the main stage and put a photo of himself literally in the center of the slide, in front of the supercomputer they're talking about. I don't know what the communications people at Livermore were trying to do with this graphic, but I don't think it was intended to be evocative of the first thing that came to my mind:

The supercomputer is literally named "The Captain," and there's a photo of one dude (the boss of Livermore Computing, who is also standing on stage giving the talk) blocking the view of the machine. It wasn't a great look, and it left me feeling very uneasy about what I was witnessing and what message it was sending to the HPC community.

In case it needs to be said, HPC is a team sport. The unveiling of El Capitan (or any other #1 system before it) is always the product of dozens, if not hundreds, of people devoting years of their professional lives to ensuring it all comes together. It was a big miss, both to those who put in the work, and those who will have to put in the work on future systems, to suggest that a single, smiling face comes before the success of the system deployment.

#5 - Eni HPC6

The other notable entrant to the Top 10 list was HPC6, an industry system deployed by Eni (a major Italian energy company) built on MI250X. Oil and gas companies tend to be conservative in the systems they buy since the seismic imaging done on their large supercomputers informs hundred-million to billion-dollar investments in drilling a new well, and they have much less tolerance for weird architectures than federally funded leadership computing does. Thus, Eni's adoption of AMD GPUs in this #5 system is a strong endorsement of their capability in mission-critical commercial computing.

#16 and #17 - SoftBank CHIE-2 and CHIE-3

SoftBank, the Japanese investment conglomerate who, among other things, owns a significant stake in Arm, made its Top500 debut with two identical 256-node DGX H100 SuperPODs. While not technologically interesting (H100 is getting old), these systems represent significant investment in HPC by private industry in Japan and signals that SoftBank is following the lead of large American investment groups in building private AI clusters for the AI startups in their portfolios. In doing this, SoftBank's investments aren't dependent on third-party cloud providers to supply the GPUs to make these startups successful and reduces their overall risk.

Although I didn't hear anything about these SoftBank systems at the conference, NVIDIA issued a press statement during the NVIDIA AI Summit Japan during the week prior to SC24 that discussed SoftBank's investment in large NVIDIA supercomputers. The press statement states that these systems will be used "for [SoftBank's] own generative AI development and AI-related business, as well as that of universities, research institutions and businesses throughout Japan." The release also suggests we can expect B200 and GB200 SuperPODs from SoftBank to appear as those technologies come online.

#18 - Jülich's JUPITER Exascale Transition Instrument (JETI)

Just below the SoftBank systems was the precursor system to Europe's first exascale system. I was hoping that JUPITER, the full exascale system being deployed at FRJ, would appear in the Top 10, but it seems like we'll have to wait for ISC25 for that. Still, the JETI system ran HPL across 480 nodes of BullSequana XH3000, the same node that will be used in JUPITER, and achieved 83 TFLOPS. By comparison, the full JUPITER system will be over 10x larger ("roughly 6000 compute nodes" in the Booster), and projecting the JETI run (173 TF/node) out to this full JUPITER scale indicates that JUPITER should just squeak over the 1.0 EFLOPS line.

In preparation for JUPITER, Eviden had a couple of these BullSequana XH3000 nodes out on display this year:

And if you're interested in more, I've been tracking the technical details of JUPITER in my digital garden.

#32 - Reindeer!

Waay down the list was Microsoft's sole new Top500 entry this cycle, an NVIDIA H200 system that ran HPL over 120 ND H200 v5 nodes in Azure. It was one of only two conventional (non-Grace) H200 clusters that appeared in the top 100, and it had a pretty good efficiency (Rmax/Rpeak > 80%). Microsoft also had a Reindeer node on display at its booth:

An astute observer may note that this node looks an awful lot like the H100 node used in its Eagle supercomputer, which was on display at SC23 last year. That's because it's the same chassis, just with an upgraded HGX baseboard.

Reindeer was not super exciting, and there were no press releases about it, but I mention it here for a couple reasons:

One of my teammates did the HPL run and submission, and his group got to come up with the name of the system for the purposes of HPL. As it turns out, generating a public name for a Top500 submission involves a comical amount of legal and marketing process when it comes from a giant corporation like Microsoft. And as it turns out, naming a cluster "Reindeer" has a low probability of offending anyone.
Reindeer is pretty boring--it's a relatively small cluster with a bunch of GPUs. But when you're building out AI infrastructure at a pace of 5x Eagles (70,000 GPUs!) per month, you want the clusters that those GPUs go into to be as boring, predictable, and automatable as possible. Seeing as how Reindeer only used 960 GPUs but still got #32, it doesn't require much math to realize that the big hyperscalers could flood the Top500 list with these cookie-cutter GPU clusters and (in this case) make any ranking below #32 completely irrelevant. Heaven help the Top500 list if they ever publish an API for submitting new systems; cloud providers' build validation automation could tack a Top500 submission on at the end of burn-in and permanently ruin the list.

On a personal note, the supercomputer grant that gave me my first job in the HPC business debuted at #48. It's mind-boggling that I now work in a place where standing up a #32 system is just day-to-day business.

Technology on the exhibit floor

The exhibit floor had a few new pieces of HPC technology on display this year that are worthy of mention, but a lot of the most HPC-centric exciting stuff actually had a soft debut at ISC24 in May. For example, even though SC24 was MI300A's big splash due to the El Capitan announcement, some MI300A nodes (such as the Cray EX255a) were on display in Hamburg. However, Eviden had their MI300A node (branded XH3406-3) on display at SC24 which was new to me:

I'm unaware of anyone who's actually committed to a large Eviden MI300A system, so I was surprised to see that Eviden already has a full blade design. But as with Eni's HPC6 supercomputer, perhaps this is a sign that AMD's GPUs (and now APUs) have graduated from being built-to-order science experiments to a technology ecosystem that people will want to buy off the rack.

There was also a ton of GH200 on the exhibit hall floor, but again, these node types were also on display at ISC24. This wasn't a surprise since a bunch of upcoming European systems have invested in GH200 already; in addition to JUPITER's 6,000 GH200 nodes described above, CSCS Alps has 2,688 GH200 nodes, and Bristol's Isambard-AI will have 1,362 GH200 nodes. All of these systems will have a 1:1 CPU:GPU ratio and an NVL4 domain, suggesting this is the optimal way to configure GH200 for HPC workloads. I didn't hear a single mention of GH200 NVL32.

GB200

SC24 was the debut of NVIDIA's Blackwell GPU in the flesh, and a bunch of integrators had material on GB200 out at their booths. Interestingly, they all followed the same pattern as GH200 with an NVL4 domain size, and just about every smaller HPC integrator followed a similar pattern where

their booth had a standard "NVIDIA Partner" (or "Preferred Partner!") placard on their main desk
they had a bare NVIDIA GB200 baseboard (superchip) on display
there wasn't much other differentiation

From this, I gather that not many companies have manufactured GB200 nodes yet, or if they have, there aren't enough GB200 boards available to waste them on display models. So, we had to settle for these bare NVIDIA-manufactured, 4-GPU + 2-CPU superchip boards:

What struck me is that these are very large FRUs--if a single component (CPU, GPU, voltage regulator, DRAM chip, or anything else) goes bad, you have to yank and replace four GPUs and two CPUs. And because all the components are soldered down, someone's going to have to do a lot of work to remanufacture these boards to avoid throwing out a lot of very expensive, fully functional Blackwell GPUs.

There were a few companies who were further along their GB200 journey and had more integrated nodes on display. The HPE Cray booth had this GB200 NVL4 blade (the Cray EX154n) on display:

It looks remarkably sparse compared to the super-dense blades that normally slot into the Cray EX line, but even with a single NVL4 node per blade, the Cray EX cabinet only supports 56 of these blades, leaving 8 blade slots empty in the optimal configuration. I assume this is a limitation of power and cooling.

The booth collateral around this blade suggested its use case is "machine learning and sovereign AI" rather than traditional HPC, and that makes sense since each node has 768 GB of HBM3e which is enough to support training some pretty large sovereign models. However, the choice to force all I/O traffic on to the high-speed network by only leaving room for one piddly node-local NVMe drive (this blade only supports one SSD per blade) will make training on this platform very sensitive to the quality of the global storage subsystem. This is great if you bundle this blade with all-flash Lustre (like Cray ClusterStor) or DAOS (handy, since Intel divested the entire DAOS development team to HPE). But it's not how I would build an AI-optimized system.

I suspect the cost-per-FLOP of this Cray GB200 solution is much lower than what a pure-play GB200 for LLM training would be. And since GB200 is actually a solid platform for FP64 (thanks to Dan Ernst for challenging me on this and sharing some great resources on the topic), I expect to see this node do well in situations that are not training frontier LLMs, but rather fine-tuning LLMs, training smaller models, and mixing in traditional scientific computing on the same general-purpose HPC/AI system.

Speaking of pure-play LLM training platforms, though, I was glad that very few exhibitors were trying to talk up GB200 NVL72 this year. It may have been the case that vendors simply aren't ready to begin selling NVL72 yet, but I like to be optimistic and instead believe that the exhibitors who show up to SC24 know that the scientific computing community likely won't get enough value out of a 72-GPU coherence domain to justify the additional cost and complexity of NVL72. I didn't see a single vendor with a GB200 NVL36 or NVL72 rack on display (or a GH200 NVL32, for that matter), and not having to think about NVL72 for the week of SC24 was a nice break from my day job.

Perhaps the closest SC24 got to NVL72 was a joint announcement at the beginning of the week by Dell and CoreWeave, who announced that they have begun bringing GB200 NVL72 racks online. Dell did have a massive, AI-focused booth on the exhibit floor, and they did talk up their high-powered, liquid-cooled rack infrastructure. But in addition to supporting GB200 with NVLink Switches, I'm sure that rack infrastructure would be equally good at supporting nodes geared more squarely at traditional HPC.

Slingshot 400

HPE Cray also debuted a new 400G Slingshot switch, appropriately named Slingshot 400. I didn't get a chance to ask anyone any questions about it, but from the marketing material that came out right before the conference, it sounds like a serdes upgrade without any significant changes to Slingshot's L2 protocol.

There was a Slingshot 400 switch for the Cray EX rack on display at their booth, and it looked pretty amazing:

It looks way more dense than the original 200G Rosetta switch, and it introduces liquid-cooled optics. If you look closely, you can also see a ton of flyover cables connecting the switch ASIC in the center to the transceivers near the top; similar flyover cables are showing up in all manner of ultra-high-performance networking equipment, likely reflecting the inability to maintain signal integrity across PCB traces.

The port density on Slingshot 400 remains the same as it was on 200G Slingshot, so there's still only 64 ports per switch, and the fabric scale limits don't increase. In addition, the media is saying that Slingshot 400 (and the GB200 blade that will launch with it) won't start appearing until "Fall 2025." Considering 64-port 800G switches (like NVIDIA's SN5600 and Arista's 7060X6) will have already been on the market by then though, Slingshot 400 will be launching with HPE Cray on its back foot.

However, there was a curious statement on the placard accompanying this Slingshot 400 switch:

It reads, "Ultra Ethernet is the future, HPE Slingshot delivers today!"

Does this suggest that Slingshot 400 is just a stopgap until 800G Ultra Ethernet NICs begin appearing? If so, I look forward to seeing HPE Cray jam third-party 800G switch ASICs into the Cray EX liquid-cooled form factor at future SC conferences.

Grace-Grace for storage?

One of the weirder things I saw on the exhibit floor was a scale-out storage server built on NVIDIA Grace CPUs that the good folks at WEKA had on display at their booth.

Manufactured by Supermicro, this "ARS-121L-NE316R" server (really rolls off the tongue) uses a two-socket Grace superchip and its LPDDR5X instead of conventional, socketed CPUs and DDR. The rest of it seems like a normal scale-out storage server, with sixteen E3.S SSD slots in the front and four 400G ConnectX-7 or BlueField-3 NICs in the back. No fancy dual-controller failover or anything like that; the presumption is that whatever storage system you'd install over this server would implement its own erasure coding across drives and servers.

At a glance, this might seem like a neat idea for a compute-intensive storage system like WEKA or DAOS. However, one thing that you typically want in a storage server is high reliability and repairability, features which weren't the optimal design point for these Grace superchips. Specifically,

The Grace-Grace superchip turn both CPU sockets into a single FRU. This means that if one CPU goes bad, you're shipping the whole board back to NVIDIA rather than just doing a field-swap of a socket.
Grace uses LPDDR5X, whose ECC is not as robust as DDR5. I'm not an expert on memory architecture, but my understanding is that the ECC scheme on Grace does not provide ChipKill or row failures. And as with CPU failure, if a single DRAM chip goes back, you're throwing out two CPUs and all the DRAM.
There's no way to value-engineer the exact quantity of cores, clock, and DRAM to be optimal for the storage software installed on top of these servers.

On the upside, though, there might be a cost advantage to using this Grace-Grace server over a beefier AMD- or Intel-based server with a bunch of traditional DIMMs. And if you really like NVIDIA products, this lets you do NVIDIA storage servers to go with your NVIDIA network and NVIDIA compute. As long as your storage software can work with the interrupt rates of such a server (e.g., it supports rebuild-on-read) and the 144 Neoverse V2 cores are a good fit for its computational requirements (e.g., calculating complex erasure codes), this server makes sense. But building a parallel storage system on LPDDR5X still gives me the willies.

I could also see this thing being useful for certain analytics workloads, especially those which may be upstream of LLM training. I look forward to hearing about where this turns up in the field.

Microsoft and AMD's new HBM CPU

The last bit of new and exciting HPC technology that I noted came from my very own employer in the form of HBv5, a new, monster four-socket node featuring custom-designed AMD CPUs with HBM. STH wrote up an article with great photos of HBv5 and its speeds and feeds, but in brief, this single node has:

384 physical Zen 4 cores (352 accessible from within the VM) that clock up to 4 GHz
512 GB of HBM3 (up to 450 GB accessible from the VM) with up to 6.9 TB/s STREAM bandwidth
4x NDR InfiniBand NICs clocked at 200G per port
200G Azure Boost NIC (160G accessible from the VM)
8x 1.84 TB NVMe SSDs with up to 50 GB/s read and 30 GB/s write bandwidth

The node itself looks kind of wacky as well, because there just isn't a lot on it:

There are the obvious four sockets of AMD EPYC 9V64H, each with 96 physical cores and 128 GB of HBM3, and giant heat pipes on top of them since it's 100% air-cooled. But there's no DDR at all, no power converter board (the node is powered by a DC bus bar), and just a few flyover cables to connect the PCIe add-in-card cages. There is a separate fan board with just two pairs of power cables connecting to the motherboard, and that's really about it.

The front end of the node shows its I/O capabilities which are similarly uncomplicated:

There are four NDR InfiniBand cards (one localized to each socket) which are 400G-capable but cabled up at 200G, eight E1.S NVMe drives, and a brand-new dual-port Azure Boost 200G NIC. Here's a close-up of the right third of the node's front:

This is the first time I've seen an Azure Boost NIC in a server, and it looks much better integrated than the previous-generation 100G Azure SmartNIC that put the FPGA and hard NIC on separate boards connected by a funny little pigtail. This older 100G SmartNIC with pigtail was also on display at the Microsoft booth in an ND MI300X v5 node:

And finally, although I am no expert in this new node, I did hang around the people who are all week, and I repeatedly heard them answer the same few questions:

Is this MI300C? It is if you want it to be. You can call it Sally if you want; I don't think it will care. But Microsoft calls it HBv5, and the processor name will show up as AMD EPYC 9V64H in /proc/cpuinfo.
Is its InfiniBand 1x800 port, 2x400 ports, ...? There are four NDR InfiniBand HCA cards, and each card has one full 400G NDR InfiniBand port. However, each port is only connected up to top-of-rack switching at 200G. Each InfiniBand HCA hangs off of a different EPYC 9V64H socket so that any memory address can get to InfiniBand without having to traverse Infinity Fabric. Running four ports of NDR InfiniBand at half speed is an unusual configuration, but that's what's going on here.
How can I buy this CPU? EPYC 9V64H are "custom AMD EPYC processors only available in Azure." This means the only way to access it is by provisioning an HBv5 virtual machine in Azure.

Amidst all the unrelenting news about new GPUs optimized for AI workloads, it was nice to see something new and unique launched squarely for the benefit of traditional scientific computing workloads.

The HPC industry overall

New technology announcements are always exciting, but one of the main reasons I attend SC and ISC is to figure out the broader trends shaping the HPC industry. What concerns are top of mind for the community, and what blind spots remain open across all the conversations happening during the week? Answering these questions requires more than just walking the exhibit floor; it involves interpreting the subtext of the discussions happening at panels and BOF sessions. However, identifying where the industry needs more information or a clearer picture informs a lot of the public-facing talks and activities in which I participate throughout the year.

What I learned about the average SC technical program attendee

The biggest realization that I confirmed this week is that the SC conference is not an HPC conference; it is a scientific computing conference. I sat in a few sessions where the phrase "HPC workflows" was clearly a stand-in for "scientific workflows," and "performance evaluation" still really means "MPI and OpenMP profiling." I found myself listening to ideas or hearing about tools that were intellectually interesting but ultimately not useful to me because they were so entrenched in the traditions of applying HPC to scientific computing. Let's talk about a few ways in which this manifested.

People think sustainability and energy efficiency are the same thing

Take, for example, the topic of sustainability. There were talks, panels, papers, and BOFs that touched on the environmental impact of HPC throughout the week, but the vast majority of them really weren't talking about sustainability at all; they were talking about energy efficiency. These talks often use the following narrative:

Energy use from datacenters is predicted to reach some ridiculous number by 2030
We must create more energy-efficient algorithms, processors, and scheduling policies
Here is an idea we tested that reduced the energy consumption without impacting the performance of some application or workflow
Sustainability achieved! Success!

The problem with this approach is that it declares victory when energy consumption is reduced. This is a great result if all you care about is spending less money on electricity for your supercomputer, but it completely misses the much greater issue that the electricity required to power an HPC job is often generated by burning fossil fuels, and that the carbon emissions that are directly attributable to HPC workloads are contributing to global climate change. This blind spot was exemplified by this slide, presented during a talk titled "Towards Sustainable Post-Exascale Leadership Computing" at the Sustainable Supercomputing workshop:

I've written about this before and I'll write about it again: FLOPS/Watt and PUE are not meaningful metrics by themselves when talking about sustainability. A PUE of 1.01 is not helpful if the datacenter that achieves it relies on burning coal for its power. Conversely, a PUE of 1.5 is not bad if all that electricity comes from a zero-carbon energy source. The biggest issue that I saw being reinforced at SC this year is that claims of "sustainable HPC" are accompanied by the subtext of "as long as I can keep doing everything else the way I always have."

There were glimmers of hope, though. Maciej Cytowski from Pawsey presented the opening talk at the Sustainable Supercomputing workshop, and he led with the right thing--he acknowledged that 60% of the fuel mix that powers Pawsey's supercomputers comes from burning fossil fuels:

Rather than patting himself on the back at his low PUE, Dr. Cytowski's described on how they built their datacenter atop a large aquifer from which they draw water at 21°C and return it at 30°C to avoid using energy-intensive chillers. To further reduce the carbon impact of this water loop, Pawsey also installed over 200 kW of solar panels on its facility roof to power the water pumps. Given the fact that Pawsey cannot relocate to somewhere with a higher ratio of zero-carbon energy on account of its need to be physically near the Square Kilometer Array, Cytowski's talk felt like the most substantive discussion on sustainability in HPC that week.

Most other talks and panels on the topic really wanted to equate "sustainability" to "FLOPS per Watt" and pretend like where one deploys a supercomputer is not a part of the sustainability discussion. The reality is that, if the HPC industry wanted to take sustainability seriously, it would talk less about watts and more about tons of CO₂. Seeing as how the average watt of electricity in Tennessee produces 2.75x more carbon than a watt of electricity in Washington, the actual environmental impact of fine-tuning Slurm scheduling or fiddling with CPU frequencies is meaningless when compared to the benefits that would be gained by deploying that supercomputer next to a hydroelectric dam instead of a coal-fired power plant.

I say all this because there are parts of the HPC industry (namely, the part in which I work) who are serious about sustainability. And those conversations go beyond simply building supercomputers in places where energy is low-carbon (thereby reducing Scope 2 emissions). They include holding suppliers to high standards on reducing the carbon impact of transporting people and material to these data centers, reducing the carbon impact of all the excess packaging that accompanies components, and being accountable for the impact of everything in the data center after it reaches end of life (termed Scope 3 emissions).

The HPC community--or more precisely, the scientific computing community--is still married to the idea that the location of a supercomputer is non-negotiable, and "sustainability" is a nice-to-have secondary goal. I was hoping that the sessions I attended on sustainability would approach this topic at a level where the non-scientific HPC world has been living. Unfortunately, the discussion at SC24, which spanned workshops, BOFs, and Green 500, remains largely stuck on the idea that PUE and FLOPS/Watt are the end-all sustainability metrics. Those metrics are important, but there are global optimizations that have much greater effects on reducing the environmental impact of the HPC industry.

AI sessions are really scientific computing sessions about AI

Another area where "HPC" was revealed to really mean "scientific computing" was in the topic of AI. I sat in on a few BOFs and panels around AI topics to get a feel for where this community is in adopting AI for science, but again, I found the level of discourse to degrade to generic AI banter despite the best efforts of panelists and moderators. For example, I sat in the "Foundational Large Language Models for High-Performance Computing" BOF session, and Jeff Vetter very clearly defined what a "foundational large language model" was at the outset so we could have a productive discussion about their applicability in HPC (or, really, scientific computing):

The panelists did a good job of outlining their positions. On the upside, LLMs are good for performing source code conversion, documenting and validating code, and maximizing continuity in application codes that get passed around as graduate students come and go. On the downside, they have a difficult time creating efficient parallel code, and they struggle to debug parallel code. And that's probably where the BOF should have stopped, because LLMs, as defined at the outset of the session, don't actually have a ton of applicability in scientific computing. But as soon as the session opened up to audience questions, the session went off the rails.

The first question was an extremely basic and nonspecific question: "Is AI a bubble?"

It's fun to ask provocative questions to a panel of experts. I get it. But the question had nothing to do with LLMs, any of the position statements presented by panelists, or even HPC or scientific computing. It turned a BOF on "LLMs for HPC" into a BOF that might as well have been titled "Let's just talk about AI!" A few panelists tried to get things back on track by talking about the successes of surrogate models to simulate physical processes, but this reduced the conversation to a point where "LLMs" really meant "any AI model" and "HPC" really meant "scientific simulations."

Perhaps the most productive statement to come out of that panel was when Rio Yokota asserted that "we" (the scientific community) should not train their own LLMs, because doing so would be "unproductive for science." But I, as well as anyone who understands the difference between LLMs and "AI," already knew that. And the people who don't understand the difference between an LLM and a surrogate model probably didn't pick up on Dr. Yokota's statement, so I suspect the meaning of his contribution was completely lost.

Walking out of that BOF (and, frankly, the other AI-themed BOFs and panels I attended), I was disappointed at how superficial the conversation was. This isn't to say these AI sessions were objectively bad; rather, I think it reflects the general state of understanding of AI amongst SC attendees. Or perhaps it reflects the demographic that is drawn to these sorts of sessions. If the SC community is not ready to have a meaningful discussion about AI in the context of HPC or scientific computing, attending BOFs with like-minded peers is probably a good place to begin getting immersed.

But what became clear to me this past week is that SC BOFs and panels with "AI" in their title aren't really meant for practitioners of AI. They're meant for scientific computing people who are beginning to dabble in AI.

AI for operations is not yet real in scientific computing

I was invited to sit on a BOF panel called "Artificial Intelligence and Machine Learning for HPC Workload Analysis" following on a successful BOF in which I participated at ISC24. The broad intent was to have a discussion around the tools, methods, and neat ideas that HPC practitioners have been using to better understand workloads, and each of us panelists was tasked with talking about a project or idea we had in applying AI/ML to improve some aspect of workloads.

What emerged from us speakers' lightning talks is that applying AI for operations--in this case, understanding user workloads--is nascent. Rather than talking about how we use AI to affect how we design or operate supercomputers, all of us seemed to focus more on how we are collecting data and beginning to analyze that data using ML techniques. And maybe that's OK, because AI won't ever do anything for workload characterization until you have a solid grasp of the telemetry you can capture about those workloads in the first place.

But when we opened the BOF up to discussion with all attendees, despite having a packed room, there was very little that the audience had. Our BOF lead, Kadidia Konaté, tried to pull discussion out of the room from a couple of different fronts by asking what tools people were using, what challenges they were facing, and things along those lines. However, it seemed to me that the majority of the audience was in that room as spectators; they didn't know where to start applying AI towards understanding the operations of supercomputers. Folks attended to find out the art of the possible, not talk about their own challenges.

As such, the conversation wound up bubbling back up to the safety of traditional topics in scientific computing--how is LDMS working out, how do you deal with data storage challenges of collecting telemetry, and all the usual things that monitoring and telemetry folks worry about. It's easy to talk about the topics you understand, and just as the LLM conversation reverted back to generic AI for science and the sustainability topic reverted back to FLOPS/Watt, this topic of AI for operations reverted back to standard telemetry collection.

Some are beginning to realize that HPC exists outside of scientific computing

Despite the pervasive belief at SC24 that "HPC" and "scientific computing" are the same thing, there are early signs that the leaders in the community are coming to terms with the reality that there is now a significant amount of leadership HPC happening outside the scope of the conference. This was most prominent at the part of the Top500 BOF where Erich Strohmaier typically discusses trends based on the latest publication of the list.

In years past, Dr. Strohmaier's talk was full of statements that strongly implied that, if a supercomputer is not listed on Top500, it simply does not exist. This year was different though: he acknowledged that El Capitan, Frontier, and Aurora were "the three exascale systems we are aware of," now being clear that there is room for exascale systems to exist that simply never ran HPL, or never submitted HPL results to Top500. He explicitly acknowledged again that China has stopped making any Top500 submissions, and although he didn't name them outright, he spent a few minutes dancing around "hyperscalers" who have been deploying exascale class systems such as Meta's H100 clusters (2x24K H100), xAI's Colossus (100K H100), and the full system behind Microsoft's Eagle (14K H100 is a "tiny fraction").

Strohmaier did an interesting analysis that estimated the total power of the Top500 list's supercomputers so he could compare it to industry buzz around hyperscalers building gigawatt-sized datacenters:

It was a fun analysis where he concluded that there are between 500-600 megawatts of supercomputers on the Top500 list, and after you factor in storage, PUE, and other ancillary power sources, the whole Top500 list sums up to what hyperscalers are talking about sticking into a single datacenter facility.

Although he didn't say it outright, I think the implication here is that the Top500 list is rapidly losing relevance in the broad HPC market, because a significant amount of the world's supercomputing capacity and capability are absent from the list. Although specific hyperscale supercomputers (like Meta's, xAI's, and Microsoft's) were not mentioned outright, their absence from the Top500 list suggests that this list might already be more incomplete than it is complete--the sum of the FLOPS or power on the Top500 supercomputers may be less than the sum of the giant supercomputers which are known but not listed. This will only get worse as the AI giants keep building systems every year while the government is stuck on its 3-5 year procurement cycles.

It follows that the meaning of the Top500 is sprinting towards a place where it is not representative of HPC so much as it is representative of the slice of HPC that serves scientific computing. Erich Strohmaier was clearly aware of this in his talk this year, and I look forward to seeing how the conversation around the Top500 list continues to morph as the years go on.

NSF's broad front vs. DOE's big bets in HPC and AI

My career was started at an NSF HPC center and built up over my years in the DOE, so I feel like I owe a debt to the people who provided all the opportunities and mentorship that let me get to the place of privilege in the hyperscale/AI industry that I now enjoy. As a result, I find myself still spending a lot of my free time thinking about the role of governments in the changing face of HPC (as evidenced by my critiques of thinktank reports and federal RFIs...) and trying to bridge the gap in technical understanding between my old colleagues (in DOE, NSF, and European HPC organizations) and whatever they call what I work on now (hyperscale AI?).

To that end, I found myself doing quite a bit of business development (more on this later) with government types since I think that is where I can offer the most impact. I used to be government, and I closely follow the state of their thinking in HPC, but I also know what's going on inside the hyperscale and AI world. I also have enough context in both areas to draw a line through all the buzzy AI press releases to demonstrate how the momentum of private-sector investment in AI might affect the way national HPC efforts do business. So, I did a lot of talking to both my old colleagues in DOE and their industry partners in an attempt to help them understand how the hyperscale and AI industry thinks about infrastructure, and what they should expect in the next year.

More importantly though, I also sat in on a couple of NSF-themed BOFs to get a better understanding of where their thinking is, where NAIRR is going, how the NSF's strategy contrasts with DOE's strategy, and where the ambitions of the Office of Advanced Cyberinfrastructure might intersect with the trajectory of hyperscale AI.

What I learned was that NSF leadership is aware of everything that the community should be concerned about: the growth of data, the increasing need for specialized silicon, the incursion of AI into scientific computing, new business models and relationships with industry, and broadening the reach of HPC investments to be globally competitive. But beyond that, I struggled to see a cohesive vision for the future of NSF-funded supercomputing.

A BOF with a broad range of stakeholders probably isn't the best place to lay out a vision for the future of NSF's HPC efforts, and perhaps NSF's vision is best expressed through its funding opportunities and awards. Whichever the case may be, it seems like the NSF remains on a path to make incremental progress on a broad front of topics. Its Advanced Computing Systems and Services (ACSS) program will continue to fund the acquisition of newer supercomputers, and a smorgasbord of other research programs will continue funding efforts across public access to open science, cybersecurity, sustainable software, and other areas. My biggest concern is that peanut-buttering funding across such a broad portfolio will make net forward progress much slower than taking big bets. Perhaps big bets just aren't in the NSF's mission though.

NAIRR was also a topic that came up in every NSF-themed session I attended, but again, I didn't get a clear picture of the future. Most of the discussion that I heard was around socializing the resources that are available today through NAIRR, suggesting that the pilot's biggest issue is not a lack of HPC resources donated by industry, but awareness that NAIRR is a resource that researchers can use. This was reinforced by a survey whose results were presented in the NAIRR BOF:

It seems like the biggest challenges facing the NSF community relying on NAIRR (which has its own sample bias) is that they don't really know where to start even though they have AI resources (both GPUs and model API services) at their disposal. In a sense, this is a great position for the NSF since

its users need intellectual help more than access to GPU resources, and the NSF has been great at promoting education, training, and workforce development.
its users are unlikely to demand the same cutting-edge GPUs that AI industry leaders are snapping up. For example, the largest pool of GPUs in NAIRR are A100 GPUs that NVIDIA donated via DGX Cloud; the big AI companies moved off of Ampere a year ago and are about to move off of Hopper.

However, it also means that there's not a clear role for partnership with many industry players beyond donating resources to the NAIRR pilot today in the hopes of selling resources to the full NAIRR tomorrow. I asked what OAC leadership thought about moving beyond such a transactional relationship between NSF and industry at one of the BOFs I attended, and while the panelists were eager to explore specific answers to that question, I didn't hear any ideas that would approach some sort of truly equitable partnership where both parties contributed in-kind.

I also walked away from these NSF sessions struck by how different the NSF HPC community's culture is from that of the DOE. NSF BOF attendees seemed focused on getting answers and guidance from NSF leadership, unlike the typical DOE gathering, where discussions often revolve around attendees trying to shape priorities to align with their own agendas. A room full of DOE people tends to feel like everyone thinks they're the smartest person there, while NSF gatherings appear more diverse in the expertise and areas of depth of its constituents. Neither way is inherently better or worse, but it will make the full ambition of NAIRR (as an inter-agency collaboration) challenging to navigate. This is particularly relevant as DOE is now pursuing its own multi-billion-dollar AI infrastructure effort, FASST, that appears to sidestep NAIRR.

Exhibitor trends

There's no better way to figure out what's going on in the HPC industry than walking the exhibit floor each year, because booths cost money and reflect the priorities (and budgets) of all participants. This year's exhibit felt physically huge, and walking from one end to the other was an adventure. You can get a sense of the scale from this photo I took during the opening gala:

Despite having almost 18,000 registrants and the opening gala usually being a crush of people, the gala this year felt and looked very sparse just because people and booths were more spread out. There was also a perceptibly larger number of splashy vendors who have historically never attended before who were promoting downstream HPC technologies like data center cooling and electrical distribution, and there was healthy speculation online about whether the hugeness of the exhibit this year was due to these new power and cooling companies.

To put these questions to rest, I figured out how to yank down all the exhibitor metadata from the conference website so I could do some basic analysis on it.

Booths by the numbers

The easiest way to find the biggest companies to appear this year was to compare the exhibitor list and booth sizes from SC23 to this year and see whose booth went from zero to some big square footage.

I only took the top twenty new vendors, but they broadly fall into a couple of categories:

Power and cooling: Stulz, Delta, Airedale, Valvoline, Boundary Electric, Schneider Electric, Mara
Server manufacturing: Wistron, AMI, Pegatron
Higher ed: Tennessee Tech, SCRCC

There were a couple other companies that must've just missed last SC but aren't new to the show (NetApp, Ansys, Samsung, Micron, Broadcom). And curiously, only one new GPU-as-a-Service provider (Nebius) showed up this year, suggesting last year was the year of the GPU Cloud.

But to confirm what others had speculated: yes, a significant amount of the new square footage of the exhibit floor can be attributed to companies focused on power and cooling. This is an interesting indicator that HPC is becoming mainstream, largely thanks to AI demanding ultra-high density of power and cooling. But it's also heartening to see a few new exhibitors in higher education making an appearance. Notably, SCRCC (South Carolina Research Computing Consortium) is a consortium between Clemon, University of South Carolina, and Savannah River National Laboratory that just formed last year, and I look forward to seeing what their combined forces can bring to bear.

We can also take a look at whose booths grew the most compared to SC23:

This distribution is much more interesting, since the top 20 exhibitors who grew their footprint comprise the majority of the growth in existing exhibitors. Cherry-picking a few interesting growers:

Power and cooling: USystems, Midas, Vertiv
Data center/GPUaaS: iM, Iris Energy, and (arguably) Oracle
Software: Arc Compute and CIQ
Companies facing serious financial or legal troubles: I count at least three! Impressive that they are still pouring money into their SC booths.

It's also interesting to see HLRS, the German national HPC center, grow so significantly. I'm not sure what prompted such a great expansion, but I take it to mean that things have been going well there.

Finally, Dell had a massive booth and showing this year. Not only did they grow the most since SC23, but they had the single largest booth on the exhibit floor at SC24. This was no doubt a result of their great successes in partnering with NVIDIA to land massive GPU buildout deals at places like xAI and CoreWeave. They also had "AI factory" messaging emblazoned all over their marketing material and debuted a nice 200 kW liquid-cooled rack that will be the basis for their GB200 NVL72 solution, clearly leaning into the idea that they are leaders in AI infrastructure. Despite this messaging being off-beat for the SC audience as I've described earlier, their booth was surprisingly full all the time, and I didn't actually get a chance to get in there to talk to anyone about what they've been doing.

Equally interesting are the vendors who reduced their footprint at SC24 relative to SC23:

Reading too much into any of these big shrinkers is pretty easy; while a reduction in booth size could suggest business hasn't been as good, it could equally mean that an exhibitor just went overboard at SC23 and downsized to correct this year. A few noteworthy exhibitors to call out:

Penguin and the Korea Semiconductor Industry Association both cut way back from massive 50x50 booths to 30x30. Their booths this year were both big, but they weren't massive. Viridien, formerly known as CGG, also shrunk from a massive booth to a less-massive 30x40.
Juniper still kept an independent booth, but it is in the process of being absorbed into HPE. Shrinking makes sense.
Major cloud providers Google and AWS scaled back, but Microsoft did not.
GPU-as-a-Service cloud providers CoreWeave and Lambda both scaled back. Since these GPUaaS providers' business models typically rely on courting few big customers, it may make sense to cut back on booth volume.
Major AI storage companies DDN, VAST, and (to a lesser degree) Pure also scaled back, while WEKA did not. I know business for DDN and VAST has been great this past year, so these may just reflect having gone overboard last year.

Overall, almost twice as many vendors grew their booths than scaled back, so I'd caution anyone against trying to interpret any of this as anything beyond exhibitors right-sizing their booths after going all-in last year.

Finally, there are a handful of vendors who disappeared outright after SC23:

It is critical to point out that the largest booths to vanish outright were all on the smaller size: SUSE, Tenstorrent, and Symbiosys Alliance all disappeared this year, but their booths last year were only 20x30. I was surprised to see that Tenstorrent and Arm didn't have booths, but the others are either companies I haven't heard of (suggesting the return on investment of showing at SC might've been low), are easy to rationalize as only being HPC-adjacent (such as SNIA and DigitalOcean), or simply went bankrupt in the last year.

As we say at the business factory, the net-net of the exhibit hall this year is that the square footage of booth space increased by 15,000 square feet, so it was in fact bigger, it did take longer to walk from one end to the other, and there definitely were a bunch of new power and cooling companies filling out the space. Some exhibitors shrank or vanished, but the industry as a whole appears to be moving in a healthy direction.

And if you're interested in analyzing this data more yourself, please have a look at the data and the Jupyter notebook I used to generate the above treemaps on GitHub. If you discover anything interesting, please write about it and post it online!

Proliferation of GPU-as-a-Service providers

As an AI infrastructure person working for a major cloud provider, I kept an eye out for all the companies trying to get into the GPU-as-a-Service game. I described these players last year as "pure-play GPU clouds," and it seems like the number of options available to customers who want to go this route is growing. But I found it telling that a lot of them had booths that were completely indistinguishable from each other. Here's an example of one:

As best I can tell, these companies are all NVIDIA preferred partners with data centers and a willingness to deploy NVIDIA GPUs, NVIDIA SmartNICs, and NVIDIA cloud stack, and sell multi-year commitments to consume those GPUs. I tried to accost some of these companies' booth staff to ask them my favorite question ("What makes you different from everyone else?"), but most of these companies' booths were staffed by people more interested in talking to each other than me.

These GPUaaS providers tend to freak me out, because, as Microsoft's CEO recently stated, these companies are often "just a bunch of tech companies still using VC money to buy a bunch of GPUs." I can't help but feel like this is where the AI hype will come back to bite companies who have chosen to build houses upon sand. Walking the SC24 exhibit floor is admittedly a very narrow view of this line of business, but it seemed like some of these companies were content to buy up huge booths, hang a pretty banner above it, and otherwise leave the booth empty of anything beyond a few chairs and some generic value propositions. I didn't feel a lot of hunger or enthusiasm from these companies despite the fact that a bunch of them have hundreds of millions of dollars of GPUs effectively sitting on credit cards that they are going to have to make payments on for the next five years.

That all said, not all the companies in the GPUaaS are kicking back and letting the money pour in. In particular, I spent a few minutes chatting up someone at the CoreWeave booth, and I was surprised to hear about how much innovation they're adding on top of their conventional GPUaaS offering. For example, they developed Slurm on Kubernetes (SUNK) with one of their key customers to close the gap between the fact that CoreWeave exposes its GPU service through Kubernetes, but many AI customers have built their stack around Slurm, pyxis, and enroot.

In a weird twist of fate, I later ran into an old acquaintance who turned out to be one of the key CoreWeave customers for whom SUNK was developed. He commented that SUNK is the real deal and does exactly what his users need which, given the high standards that this person has historically had, is a strong affirmation that SUNK is more than just toy software that was developed and thrown on to GitHub for an easy press release. CoreWeave is also developing some interesting high-performance object storage caching software, and all of these software services are provided at no cost above whatever customers are already paying for their GPU service.

I bring this up because it highlights an emerging distinction in the GPUaaS market, which used to be a homogenous sea of bitcoin-turned-AI providers. Of course, many companies still rely on that simple business model: holding the bill for rapidly depreciating GPUs that NVIDIA sells and AI startups consume. However, there are now GPUaaS providers moving up the value chain by taking on the automation and engineering challenges that model developers don't want to deal with. Investing in uncertain projects like new software or diverse technology stacks is certainly risky, especially since they may never result in enough revenue to pay for themselves. But having a strong point of view, taking a stance, and investing in projects that you feel are right deserves recognition. My hat is off to the GPUaaS providers who are willing to take these risks and raise the tide for all of us rather than simply sling NVIDIA GPUs to anyone with a bag of money.

Community and connections

As much as I enjoy increasing shareholder value, the part of SC that gives me the greatest joy is reconnecting with the HPC community. Knowing I'll get to chat with my favorite people in the industry (and meet some new favorite people!) makes the long plane rides, upper respiratory infections, and weird hotel rooms completely worth it.

I wound up averaging under six hours of sleep per night this year in large part because 9pm or 7am were often the only free times I had to meet with people I really wanted to see. I have this unhealthy mindset where every hour of every day, from the day I land to the day I leave, is too precious to waste, and it's far too easy for me to rationalize that spending an hour talking to someone interesting is worth losing an hour of sleep.

But like I said at the outset of this blog post, this year felt different for a few reasons, and a lot of them revolve around the fact that I think I'm getting old. Now, it's always fun to say "I'm getting old" in a mostly braggadocious way, but this feeling manifested in concrete ways that affected the way I experienced the conference:

I hit my limit on Monday night and couldn't get home without spending 15 minutes sitting in an unlit playground across from the World of Coke. I've always gotten blisters and fatigue, but this was the first time I couldn't just cowboy up and muscle through it. To avoid a repeat of this, I wound up "wasting" (see above) a lot more time to just get off my feet this year.
This year, I reached the point where I need to start time-box how much time I spend chatting up the folks I bump into. I used to just let the good times roll if I ran into someone I knew, but this year I wound up spending as much time attending sessions as I did missing sessions because I got caught up in a conversation. This isn't a bad thing per se, but I did feel a little sour when I realized I'd made a bad bet on choosing to chat instead of attending a session or vice versa, and this bad feeling lingered in the back of my mind just about every day.
There weren't a lot of surprises for me at the conference this year, and I worry that I am at risk of losing touch with the technical aspects of the conference that get newer attendees excited. Instead of hearing about, say, the latest research in interconnects, more of my time was spent mucking it up with the sorts of people in the HPC community who I used to find intimidating. On the one hand, hooray me for making it into old boys' clubs. But on the other, I don't want to become some HPC greybeard whose last meaningful contribution to the industry was twenty years ago.
This is the first year where I've had people accost me and ask me for advice. I've long been accosted by strangers because of my online presence, but those interactions were always lighthearted exchanges of "I follow you on Twitter" and "Great to meet you. Have an @HPC_Guru pin." This year, I had people specifically ask me for advice on industry versus postdoc, AI versus HPC, and what my master plan was when I left NERSC. Even though I didn't have any sage advice, I still found it really hard to tell bright-eyed students to go kick rocks just so I wouldn't be late for yet another mushy panel on AI.

If you read this all and think "boo hoo, poor Glenn is too popular and wise for his own good," yeah, I get it. There are worse problems to have. But this was the first year where I felt like what I put into the conference was greater than what I got out of it. Presenting at SC used to be at least as good for my career as it was useful for my audiences, but it just doesn't count for much given my current role and career stage. It felt like some of the magic was gone this year in a way I've never experienced before.

Getting to know people

As the years have gone on, I spend an increasing amount of my week having one-on-one conversations instead of wandering aimlessly. This year though, I came to SC without really having anything to buy or sell:

I am not a researcher, so I don't need to pump up the work I'm doing to impress my fellow researchers.
I no longer own a product market segment, so I don't directly influence the customers or vendors with whom my employer works.
I don't have any bandwidth in my day job to support any new customers or partnerships, so I don't have a strong reason to sell people on partnering with me or my employer.

Much to my surprise though, a bunch of my old vendor/partner colleagues still wanted to get together to chat this year. Reflecting back, I was surprised to realize that it was these conversations--not the ones about business--that were the most fulfilling this year.

I learned about people's hobbies, families, and their philosophies on life, and it was amazing to get to know some of the people behind the companies with whom I've long dealt. I was reminded that the person is rarely the same as the company, and even behind some of the most aggressive and blusterous tech companies are often normal people with the same concerns and moments of self-doubt that everyone else has. I was also reminded that good engineers appreciate good engineering regardless of whether it's coming from a competitor or not. The public persona of a tech exec may not openly admire a competitor's product, but that doesn't mean they don't know good work when they see it.

I also surprised a colleague whose career has been in the DOE labs with an anecdote that amounted to the following: even though two companies may be in fierce competition, the people who work for them don't have to be. The HPC community is small enough that almost everyone has got a pal at a competing company, and when there are deals to be made, people looove to gossip. If one salesperson hears a juicy rumor about a prospective customer, odds are that everyone else on the market will hear about it pretty quickly too. Of course, the boundaries of confidentiality and professionalism are respected when it matters, but the interpersonal relationships that are formed between coworkers and friends don't suddenly disappear when people change jobs.

And so, I guess it would make sense that people still want to talk to me even though I have nothing to buy or sell. I love trading gossip just as much as everyone else, and I really enjoyed this aspect of the week.

Talking to early career people

I also spent an atypically significant amount of my week talking to early career people in HPC who knew of me one way or another and wanted career advice. This is the first year I recall having the same career conversations with multiple people, and this new phase of my life was perhaps most apparent during the IEEE TCHPC/TCPP HPCSC career panel in which I was invited to speak this year.

It was an honor to be asked to present on a career panel, but I didn't feel very qualified to give career advice to up-and-coming computer science graduate students who want to pursue HPC. I am neither a computer scientist nor a researcher, but fortunately for me, my distinguished co-panelists (Drs. Dewi Yokelson, Olga Pearce, YJ Ji, and Rabab Alomairy) had plenty of more relevant wisdom to share. And at the end of the panel, there were a few things we all seemed to agree on as good advice:

Knowing stuff is good, but being able to learn things is better. Being eager to learn and naturally curious makes this much easier as well.
The life of a researcher sometimes requires more than working a standard nine-to-five, so it'll be hard to be really successful if your heart isn't in it.
People will forget what you did or what you said, but they remember how you made them feel. Don't be a jerk, because this community is small.

In both this panel the one-on-one conversations I had with early career individuals, the best I could offer was the truth: I never had a master plan that got me to where I am; I just try out new things until I realize I don't like doing them anymore. I never knew what I wanted to be when I grew up, and I still don't really, so it now makes me nervous that people have started approaching me with the assumption that I've got it all figured out. Unless I torpedo my career and go live on a goat farm though, maybe I should prepare for this to be a significant part of my SC experiences going forward.

Shift in social media

One last, big change in the community aspect of SC this year was the mass-migration of a ton of HPC folks from Twitter to Bluesky during the week prior to the conference. I don't really understand what prompted it so suddenly; a few of us have been trying for years to get some kind of momentum on other social platforms like Mastodon, but the general lack of engagement meant that all the excitement around SC always wound up exclusively on Twitter. This year was different though, and Bluesky hit critical mass with the HPC community.

I personally have never experienced an SC conference without Twitter; my first SC was in 2013, and part of what made that first conference so exciting was being able to pull up my phone and see what other people were seeing, thinking, and doing across the entire convention center via Twitter. Having the social media component to the conference made me feel like I was a part of something that first year, and as the years went on, Twitter became an increasingly indispensable part of the complete SC experience for me.

This year, though, I decided to try an experiment and see what SC would be like if I set Twitter aside and invested my time into Bluesky instead.

The verdict? It was actually pretty nice.

It felt a lot like the SC13 days, where my day ended and began with me popping open Bluesky to see what new #SC24 posts were made. And because many of the tech companies and HPC centers hadn't yet made it over, the hashtag wasn't clogged up by a bunch of prescheduled marketing blasts that buried the posts written by regular old conference attendees who were asking important questions:

Which booths at #sc24 have coffee? I noticed oracle do. Anyone else?
— Mike Croucher (@walkingrandomly.bsky.social) November 18, 2024 at 3:02 PM

Of course, I still clogged Bluesky up with my nonsense during the week, but there was an amazing amount of engagement by a diversity of thoughtful people--many who came from Twitter, but some whose names and handles I didn't recognize.

The volume of traffic on Bluesky during the week did feel a little lower than what it had been on Twitter in years past though. I also didn't see as many live posts of technical sessions as they happened, so I couldn't really tell whether I was missing something interesting in real time. This may have contributed to why I felt a little less connected to the pulse of the conference this year than I had in the past. It also could've been the fact that conference was physically smeared out across a massive space though; the sparsity of the convention center was at least on par with the sparsity on Bluesky.

At the end of the week, I didn't regret the experiment. In fact, I'll probably be putting more effort into my Bluesky account than my Twitter account going forward. To be clear though, this isn't a particularly political decision on my part, and I pass no judgment on anyone who wants to use one platform over the other. It's just that I like the way I feel when I scroll through my Bluesky feeds, and I don't get that same feeling when I use Twitter.

So what's the takeaway?

SC this year was a great conference by almost every measure, as it always is, but it still felt a little different for me. I'm sure that some of that feeling is the result of my own growth, and my role with respect to the conference seems to be evolving from someone who gets a lot out of the conference to someone who is giving more to the conference. That's not to say that I don't get a lot out of it, though; I had no shortage of wonderful interactions with everyone from technology executives to rising stars who are early in their career, and I learned a lot about both them and me as whole people. But SC24, more than any SC before it, is when I realized this change was happening.

On the technological front, we saw the debut of a new #1 system (emblazoned with the smiling face of Bronis...) and a growing crop of massive, new clusters deployed for commercial applications. The exhibit floor was quantitatively bigger, in large part due to new power and cooling companies who are suddenly relevant to the HPC world thanks to the momentum of AI. At the same time, the SC technical program is clearly separating itself out as a conference focused on scientific computing; the level of discourse around AI remains largely superficial compared to true AI conferences, the role of hyperscalers in the HPC industry is still cast more as a threat than an opportunity.

For my part, I'm still trying to get a grasp on where government agencies like DOE and NSF want to take their AI ambitions so I can try to help build a better mutual understanding between the scientific computing community and the hyperscale AI community. However, it seems like the NSF is progressing slowly on a wide front, while the DOE is doing what DOE does and charging headfirst into a landscape that has changed more than I think they realize.

There's a lot of technical content that I know I missed on account of the increasing time I've been spending on the people and community aspect of the conference, and I'm coming to terms with the idea that this just may be the way SC is from now on. And I think I'm okay with that, since the support of the community is what helped me go from being a bored materials science student into someone whose HPC career advice is worth soliciting in the short span of eleven years. Despite any or all of the cynicism that may come out in the things I say about this conference, SC is always the highlight of my year. I always go into it with excitement, gladly burn the candle at both ends all week, and fly home feeling both grateful for and humbled by everything the HPC community has done and continues to do to keep getting me out of bed in the morning.

↧

LLM training without a parallel file system

February 1, 2025, 7:59 pm

≫ Next: GTC 2025 recap

≪ Previous: SC'24 recap

The illustrious Jeff Denworth recently posted a hot take across social media, claiming that training large language models (LLMs) doesn't require massive, expensive parallel file systems:

As someone who's been working on one of the largest supercomputers on the planet--one that has no parallel file system at all--I was surprised by how many incredulous or curious responses followed. I guess supercomputers and parallel file systems are like peas and carrots in so many people's minds that the idea of being able to run a massive parallel compute job without a massive parallel file system is so unintuitive that it is unbelievable.

I've given talks about how LLM training uses storage in the past, but I realized I've never written it down. So, for the benefit of humankind, let's talk about how these supercomputers without parallel file systems work.

The workload

Though the actual model training on giant GPU supercomputers gets all the attention, the full process of training an LLM is a little more involved. A colleague of mine at Microsoft gave a great overview of this storage-centric, end-to-end picture at SNIA SDC24; broadly, training an LLM involves the following steps:

Data ingestion: This is where crawlers scrape the Internet and pull down raw html, images, videos, and other media. These raw data are indexed and shoved into a data warehouse. At scale, this can be hundreds or thousands of petabytes of data for frontier models.
Data preparation: This is where the raw data is converted into tokenized data. It amounts to a huge data analytics problem that uses well-documented text and image processing pipelines that filter, deduplicate, and otherwise clean the raw garbage on the Internet using frameworks like Apache Spark. The hundreds of petabytes of input get reduced down by 10x-1000x.
Model training: This is where the tokenized data is shoveled through the LLM on giant GPU clusters in little batches. As the data is processed, the model weights are updated, and those weights are checkpointed to storage. If a compute node crashes and the job fails, that checkpoint is used to restart, just like a traditional scientific HPC application. There might be fine-tuning and the like happening as part of this too, but I won't talk about that.
Model deployment and inferencing: This is where the final model is copied across giant fields of inferencing servers, and a web service sits in front of it all to transform REST API requests into actual inferencing queries that run on the GPUs. This isn't training, but we'll talk about it anyway.

To understand why a parallel file system offers no particular benefit to any of these steps, let's take a closer look at what's going on in each one.

Data ingestion

Data ingestion is a widely distributed process that involves minimal computation; you just need a lot of Internet-facing network connectivity and CPU cores to drive independent processes connecting to other people's public HTTP servers. I don't know a lot about what this process looks like, because it never relies on anything resembling a supercomputer.

To the best of my knowledge, data ingestion just pulls HTML, images, or video streams from the Internet and packs them into data containers. As it is packing webpages into these files, it is building a separate index that stores metadata about the webpage (URL, encoding, date of access) and its location (the file in which the webpage's contents are stored and the byte offset within that file). Thousands of VMs might be performing these tasks completely independently, and because they do not need to synchronize with each other at any step, it can be better to distribute these scrapers around the world rather than centralize all of them in a single datacenter.

While one could store each scraped HTML page in a file that's organized in a parallel file system, accessing those files would be very slow--a full crawl of all the data would require scanning hundreds of billions of little files. So instead of implementing data containers using files and the index using a file system directory tree, it's better to implement data containers on top of object stores and use a distributed key-value store for the index. The fact that scraped data is write-once (and therefore doesn't need features like file locking or read-modify-write), is a natural fit for object stores' design around object immutability.

Data preparation

Once raw data is indexed and saved in object stores, the first phase of computation comes into play. I've documented this data processing pipeline on my LLM training datasets page, but a lot of it amounts to running Apache Spark-like pipelines that chew through all the raw data in a trivially parallel way.

These data processing pipelines are very well defined from the days when Hadoop was all the rage, and their data access patterns map well to the strengths of object stores. Each processing task might read a couple hundred megabytes of data from an object all at once, process it in-memory, then dump it back out to objects all at once. File systems offer no benefit here, because each task reads once and writes once rather than skipping around inside individual objects.

There is a significant compute workload here, and there are points in the data processing pipeline where global synchronization of all tasks is required. Specifically, the process of deduplicating input data--which is a critical step to getting a high-quality model these days--requires comparing every piece of data to every other piece of data. As a result, this data preparation phase is often done in a centralized location that is adjacent the object store containing all the raw data scraped from the previous step. The clusters used for data processing can resemble traditional CPU-based supercomputers (think a system like TACC's Frontera), and in some cases, they might even have full RDMA fabrics to accelerate the all-to-all deduplication step.

Critically, this data processing step is not done on the GPU nodes that actually train the model. Data processing is usually limited by I/O bandwidth to storage, and you never want your GPUs stalling out because they're waiting for data. Parallel file system vendors might tell you that the only way to avoid this GPU starvation issue is to plug every GPU node into a super-fast parallel file system, but the reality is that people just do this I/O-heavy step on completely separate supercomputers before training on GPUs ever begins.

CPU nodes are significantly cheaper than GPUs, so buying cheap object storage and a cheap CPU cluster is more cost-effective than buying an expensive file system and wasting your GPU nodes on trivially parallel text processing tasks. To illustrate this, consider some normalized list prices from Azure:

$1.00 gets you a 96-core general-purpose VM with 384 GB of RAM
$1.65 gets you a 176-core HPC-optimized VM with NDR InfiniBand and 768 GB of RAM
$22.55 gets you a 96-core, 8x H100 GPU VM with NDR InfiniBand

Given that GPUs don't give you a 13x-22x speedup for data processing despite the 13x-22x the price, it makes no sense to perform this data processing on GPU nodes inline with training.

One could argue that the GPUs are sitting idle while the data processing cluster is working anyway, but rest assured that AI model shops have no shortage of work to keep their GPUs busy. Data processing for the next model on a CPU cluster often happens at the same time the current model is being trained on the GPU cluster. In cases where there isn't enough work to keep both CPU and GPU clusters busy around the clock, also remember that most of this stuff happens in the cloud, and cloud providers can sell those idle CPU or GPU cycles to another customer in between training campaigns.

Model training

Huge, distributed training jobs are where most people would think a fast parallel file system is required for both reading input data and writing out checkpoints. After all, the need for fast checkpointing and restart were the primary driver behind the creation of parallel file systems.

While parallel file systems certainly can be used for training, they are not the most cost-effective or scalable way to train across tens of thousands of GPUs. To better illustrate the reasons why this is, let's consider the processes of reading inputs and writing checkpoints separately.

Reading training data

Training a model on GPUs, whether it be on one or a thousand nodes, follows a simple cycle (this is a "step" in LLM training parlance) that's repeated over and over:

A batch of tokenized data is loaded into GPU memory
That data is then processed through the neural network and the model weights are adjusted
All GPUs synchronize their updated weights

It's tempting to imagine the I/O load generated by step #1 as being the same as it would be for a traditional HPC job: data is read from a parallel file system into compute memory at the start of every single step:

Animation showing a naive approach to loading data directly from shared storage into the GPU nodes of an InfiniBand cluster.

In years past, storage vendors would've insisted that this repeated, random re-reading of input data at every step requires a super-fast parallel file system to keep up. However, two factors make that untrue:

The input data isn't millions of little text or image files. As described in the data ingest and data processing steps, these small files are packaged into large objects before the GPUs ever see them.
Tokenized data is very dense compared to raw input, so the amount of bytes being read over the course of hundreds or thousands of steps is actually quite small.

To quantify #2, consider the Llama-3 405b model, which was trained on a significant fraction of the public Internet--15.6 trillion tokens. That sounds like a lot of information until you realize that the size of a typical token is between 3 and 5 bytes depending on the tokenizer and encoding. This means that the entire 405-billion parameter Llama-3 model, which was trained using 16,000 GPUs, only had to load 60 TB of tokens from storage. That divides out to 3.75 GB of tokens processed by each GPU over the entire course of a 54-day run.

When you consider how few bytes are required to train an LLM, it should become clear that the biggest I/O challenge in the performance-critical training loop isn't raw bandwidth; it's performance variability. As such, the best way to ensure that GPUs do not stall out due to read requests is to eliminate as much I/O performance variability as possible. To do this, you have to minimize the sources of contention that might arise between the storage devices and the network that connects them to the GPUs. While you can do this using sophisticated quality-of-service in both the storage servers and interconnect, there is an easier way.

Just stick some local SSDs in every GPU node.

This ensures that no contention will occur when loading data from storage into the GPU, because the only network between them is the PCIe on the node. In addition, using node-local NVMe allows storage capacity and storage performance to scale linearly with GPU performance. By comparison, a remote storage system (whether it be parallel file or object) won't get any bigger or faster as you add more GPUs to the training job, resulting in each GPU losing efficiency due to I/O as more GPUs are added to the training job.

In practice, model training uses local SSDs like this:

Animation showing how data can be read once from shared storage into GPU nodes, then distributed within the GPU nodes' InfiniBand cluster.

At the start of a training job, data is read from remote storage into the local SSDs in a distributed fashion once. Because the tokenized data is so small, many replicas of the entire dataset can be stored across the job's GPU nodes as well; for example, if you were to train Llama-3 405b on NVIDIA DGX H100 nodes, you could fit the entire training dataset (all 60 TB of it) on just three nodes since each node comes with 30 TB of local SSD. Given that the model was trained on 16,000 GPUs (2,000 nodes), that translates to storing hundreds of replicas of the entire training set. This has a few major benefits:

GPUs never have to wait for shared storage to return data before they can compute. Everything they need is on the local SSDs.
When a GPU node fails, its input data can be recovered from a surviving GPU node over the backend InfiniBand. After training starts, input data never has to be read from shared storage again.
It's common to scale up training over time by adding more GPUs (more data-parallel domains) to the job as it stabilizes. When this happens, I/O performance scales linearly because these new GPUs never have to fight over shared storage.

A reasonable critique of this approach is that data management becomes more complicated; either the training framework has to keep track of which SSDs and nodes have copies of which input data, or a distributed, client-side shared namespace like WEKA Converged Mode or CoreWeave LOTA has to sit between your application and your data. In practice though, frontier models are trained for exactly one epoch; that is, every input token is processed exactly one time to achieve optimal model quality. Because no two GPUs will ever need to read the same input token, there's never a need to copy input tokens between nodes inside the training loop.

I also acknowledge that the above description is greatly simplified; the entire node-local SSD capacity cannot be filled with input data, as space is also needed for checkpoints and other temporary data. However, the fact remains that super high-bandwidth or super high-capacity parallel file systems are not necessary for loading input tokens during training. AI training clusters are built with a ton of local SSDs to do the heavy lifting, and the input data for LLMs is small enough to fit in just a handful of GPU nodes.

Writing model checkpoints

Though the read workload of LLM training is modest at best, the write workload can be quite intense at scale because the probability of failure increases superlinearly with the size of the training job. However, unlike with scientific HPC jobs, the checkpoint size does not scale as a function of the job size; the checkpoint for a 405 billion-parameter model trained on 16,000 nodes is the same size as the checkpoint for that model trained on three nodes. This is a result of the fact that every training step is followed by a global synchronization which makes each data-parallel copy of the model identical. Only one copy of those model weights, which amounts to under a hundred terabytes for state-of-the-art LLMs, needs to be saved:

Animation showing a naive approach to model checkpointing where a single model replica dumps its model weights directly to shared storage.

Kartik and Colleen Tartow at VAST wrote a quantitative breakdown of the true I/O requirements of checkpointing, and they illustrate how even a trillion-parameter model can achieve 99.7% forward progress (only 0.3% time spent checkpointing) when training across 3,072 GPUs with a modest 273 GB/s file system. A parallel file system is not required to get that level of performance; for example, HDD-based Azure Blob achieved over 1 TB/s when benchmarked with IOR for writes at scale.

As with reading input tokens though, the real goal for checkpointing at scale is to remove any dependence on shared storage from the training loop entirely. And again, the best way to do this is to simply checkpoint to node-local storage. However, special care must be taken to ensure that the checkpoints don't get lost when a node crashes.

In practice, LLM training is now done with asynchronous, multilevel checkpointing. This technique provides the scalability of checkpointing to node-local storage and the durability of shared storage:

An animation showing an approach to hierarchical checkpointing where data is first copied from GPU memory into CPU memory, then from CPU memory to the local SSD of a partner node. After that, data is collectively copied from local SSD to shared storage.

The key to this checkpointing process is hierarchical data synchronization:

Model weights are first copied from GPU memory into the node's CPU memory after every training step. This checkpoint is governed by the CPU-GPU bandwidth (either PCIe or NVLink/Infinity Fabric), and a 500 GB checkpoint can complete in a second. The benefit of checkpointing to DRAM is that the GPU can unblock and begin computing the next step very quickly. However, this checkpoint in DRAM is not protected and will be lost if the node crashes.
To protect against node crashes, the checkpoint is then asynchronously copied from CPU DRAM to a neighbor node's local SSD using RDMA. Now if a node crashes, it can restore from a checkpoint that is stored on its neighboring node's SSD via InfiniBand. Reading and writing a 500 GB checkpoint to neighboring SSDs might take ten seconds, so this asynchronous replication might be done for every tenth DRAM checkpoint.
To store many checkpoints long-term, checkpoints are also asynchronously copied from node-local SSD to shared storage. This might take a minute or two per 500 GB checkpoint, so this last-level checkpoint copy might be done once every ten minutes.

This hierarchical checkpointing scheme allows the GPUs to spend only a second checkpointing while being able to recover from job, node, and even cluster-level failures by tailoring the checkpoint tiering frequencies to the performance of each storage tier being used. The cost of recovering from a catastrophic failure might be re-computing up to ten minutes worth of training, but given the rarity of such events, this scheme balances the performance (and risks) of checkpointing to DRAM against hard drive prices (and suffering their performance) for a durable object store.

To this latter point, the requirements of the shared storage system at the bottom of this checkpointing hierarchy are very modest:

The checkpoint only needs to complete in the time between successive last-level checkpoint copies. If the 500 GB checkpoint is drained to shared storage only once every ten minutes, our shared storage only needs to deliver 1 GB/s of total bandwidth.
The write pattern from node-local NVMe to shared storage is arbitrary, because it is a simple copy operation of a fully formed checkpoint file. Unlike direct-to-storage checkpoints, there are no weirdly shaped tensors being serialized into a file on the fly; rather, opaque bits are streaming from a local checkpoint file into a remote object using whatever transfer size and parallelism gives the highest write bandwidth.

This combination of modest write bandwidth and simple, sequential, large-block writes is ideally suited for object stores. This isn't to say a parallel file system cannot work here, but this checkpointing scheme does not benefit from directory structure, fine-grained consistency semantics, or any of the other complexities that drive up the cost of parallel file systems.

The catch, of course, is that checkpointing using these schemes can be complicated to implement. Fortunately, a growing number of training frameworks support both writing and restoring checkpoints using asynchronous and hierarchical approaches. Model developers never have to worry about interacting with specific files or objects; instead, the framework manages data locality during checkpoint and restart underneath a high-level API.

Model deployment and inferencing

Once a model is trained, putting it into production as an inferencing service is the final step of its lifecycle. From a storage and I/O standpoint, this is a lot more complicated than training because it marries an enterprise service delivery model (failover, load balancing, authentication, and scaling) with copies of a trained model running across HPC infrastructure. When you hear vendors talking about key-value stores, vector databases, and RAG, that is all happening at this stage.

Setting aside everything but the storage attached to the GPU cluster though, the I/O requirements of inferencing are relatively straightforward:

When provisioning a GPU node for inferencing, model weights must be loaded from shared storage as fast as possible.
When using an LLM to search documents, a vector database is required to perform the similarity search that augments the LLM query with the relevant documents. This is the basis for RAG.
Key-value caches are often used to reduce the latency for different parts of the inferencing pipeline by storing context including the conversation or frequently accessed contextual documents.
As the inferencing demand evolves, different models and weights may be swapped in and out of individual GPU servers.

A parallel file system is not particularly useful for any of these; the only place in which their high bandwidth would be a benefit is in loading and re-loading model weights (#1 and #4). But as with hierarchical checkpointing, those I/O operations are whole-object, read-only copies that are a natural fit for object APIs. Complex directory structures and strong consistency simply aren't necessary here.

Objects are good enough, maybe better

None of the steps in this model training lifecycle uniquely benefit from the capabilities that parallel file systems offer:

Data ingestion involves hundreds of petabytes of small documents, but they are immediately packaged and indexed into large data containers. Their metadata is stored in a separate key-value store, so the directory hierarchy of a file system isn't used, and once data has been packaged and indexed, it's never modified in-place. The bandwidth requirements are modest as well since web crawling is the rate-limiting step.
Data processing is an I/O-intensive data analytics workload. Read bandwidth is critical here, but data is accessed in large transactions and most of the computation is embarrassingly parallel. This workload runs on standalone analytics clusters, so even though the read bandwidth here is rate-limiting, slower storage is not going to impact GPU utilization on training clusters in any way. This step also reduces data by 100x or more, so the write requirements are also modest.
Training requires both loading input tokens and checkpointing model weights. However, both of these workloads lean on node-local NVMe in every node to eliminate slowdowns due to noisy neighbors. Input data is staged to node-local storage only once at the beginning of a training campaign, and checkpoints are asynchronously bled out to shared storage without impacting GPU utilization.
Inferencing involves infrequent, read-only, bulk loading of model weights into GPU nodes. While key-value caches and vector databases are also used in inferencing, parallel file systems offer no particular benefit for them.

The I/O patterns of each of these steps map nicely to object storage since they are predominantly write-once and whole-file transactions. Parallel file systems certainly can be used, and workloads will benefit from the high bandwidth they offer. However, they come with the cost of features that aren't necessary--either literal costs (in the case of appliances or proprietary software) or figurative costs (allocating people to manage the complexities of debugging a parallel file system).

The importance of this latter point is hard to appreciate if you've never used a supercomputer without a parallel file systems. However, I recently sat in on the validation of a brand-new H200 training cluster where various InfiniBand congestion and routing issues were being worked out. It wasn't until someone said "eviction" in some nontechnical context that I realized that the sporadic file system evictions during fabric instability were simply a non-issue. There was no cleanup of mount points after major fabric events because there was no persistent, fragile client-server state being maintained. I/Os between GPU nodes or nodes and storage might have failed during a rough patch, but they recovered and resumed on their own as soon as the fabric came back. Similarly, identity didn't matter, and all tests could be run as root because there was no implicit trust between the client kernel and remote storage. Removing the dependence between compute nodes, LDAP, and healthy file system mounts completely eliminates many of the challenges of standing up new clusters quickly.

An ideal AI training cluster architecture

The workloads I described above form a rough outline for an AI training infrastructure which has:

A bunch of GPU nodes with a strong RDMA backend like InfiniBand. Each node should have at least enough node-local SSD to store a substantial amount of the input tokens to be used for training, enough space for hierarchical checkpointing, and enough I/O bandwidth to these SSDs to support draining checkpoints from partner nodes' DRAM in just a few seconds. A separate frontend network that connects to storage is also a good idea; it ensures that asynchronous checkpoint draining won't interfere with weight synchronization in the training loop.
A separate CPU cluster for data processing pipelines. A strong backend network will benefit the deduplication step (which is critical to producing high-quality training datasets), but more emphasis should be placed on optimizing large-transaction reads from storage. Given that CPU nodes are so much cheaper than GPU nodes, separating the data processing nodes from training nodes allows you cut more corners when optimizing this CPU cluster. Keeping data processing out-of-band of actual model training means your most data-intensive step (data processing) is decoupled from your most expensive step (training).
A scalable object store that supports basic write-once semantics with modest I/O bandwidth at scale. This matches the needs of the workloads with the price-performance of the storage system and simplifies the recovery process if the interconnect between compute and storage gets congested. It can also serve the data needs of all stages of the training pipeline: hundreds of petabytes of raw training data, hundreds of terabytes of input tokens, and tens of terabytes of model weights all have similar performance needs and can be stored on the same infrastructure with the appropriate QOS settings.
A pool of general-purpose compute infrastructure for hosting the raw training data indices. This can also be used to support vector databases, raw context documents for RAG, and any other ancillary services required for production inferencing.

By eschewing a high-performance parallel file system and localizing I/O performance to inside the GPU cluster with node-local NVMe, a vanilla network between the GPU cluster and the other subsystems is sufficient. Although less high-performance, these non-critical bits (ideally) have lower complexity, maintenance, and supportability as well, allowing (again, ideally) more resources to be sloshed towards supporting the high-value GPU infrastructure.

Incidentally, this architecture happens to be how most of the largest AI training clusters on which I work are designed.

But parallel files aren't all bad

Of course, having no parallel file system presents some usability challenges if users are expecting to be able to SSH into a login node and have a complete user environment ready. The user experience for the above infrastructure works best for those who are comfortable developing software in containers and launching pods rather than developing software in vim and submitting Slurm jobs. I do not advocate for throwing out parallel file systems if they're already ingrained in users' workflows!

In addition, the latest crop of modern, distributed file systems all now support multi-protocol data access. For example, WEKA, VAST, and Qumulo, all support S3 (object) interfaces as first-class citizens. Users who want the traditional HPC experience can play with their data using a file mount as they always have, while those who are coming in from the cloud-native side have equal access to those same data as objects. Supporting multiprotocol access to data in AI environments doesn't reduce the need to overbuild infrastructure or support stateful file mounts across all compute nodes, but it does provide an onramp for users to get comfortable moving away from the traditional HPC user experience.

Finally, a few of the leading-edge parallel-file-system-turned-AI-storage platforms are also shipping features that make them valuable for the deployment and inferencing part of the lifecycle. For example, WEKA has their WARRP reference architecture for RAG, and VAST has its InsightEngine--both use the unique architectures underneath their file interfaces to accelerate vector queries far beyond what you would get from running a vector database on, say, Lustre. These so-called "AI data platforms," despite starting as parallel file systems, are spreading their relevance out to the entire LLM lifecycle, filling needs for file, object, and structured data with a single storage system.

This is all to say that parallel file systems aren't bad, and they aren't going anywhere. But they aren't required to train frontier models either, and as I've tried to describe above, some of the largest supercomputers on the planet are designed not to require them.

↧

Relevant Technologies

JBOFs

I/O Servers

System Composition

Shared-everything consistency

The VAST write path

Data Protection

Performance Expectations

Other Bells and Whistles

Take-away Messages

High-level observations

Intel disclosures about Aurora 2021

Nodes will be comprised of Intel Xeon CPUs and multiple Intel GPUs

System will have over 10 PB of system memory

The storage subsystem will deliver over 230 PB of capacity at over 25 TB/sec

The programming model for the system will utilize SYCL

DAOS will be HPC's gateway drug to object stores

The Cloud is coming for us

Tiering is no longer only a problem of the rich and famous

Impressions of the industry

No winners in EDSFF vs. NF.1

Memory Landscape and Competition

China

Concluding thoughts

Before the conference

The "I am HPC Guru" campaign

The new job

High-level trends

Intel's big splash

PDSW 2019

Alluxio Keynote

Asynchronous I/O

Analytics for Runtime and Operations

My Talk on Data Motion

Scale-up highly available NVMe hardware

DAOS User Group

Actually using DAOS

The burgeoning DAOS community

Everything else

Technical tidbits - the Cray Shasta cabinet

Less-technical tidbits

Staying organized

Concluding thoughts

The ZFS RAIDZ Write Penalty

Writing Data

Rewriting Data

Implications of RAIDZ on Performance and Design

How DRAID changes things

Response to “Request for Information on Update to Strategic Computing Objectives”

Preface

Question 1. What are emerging and future scientific and technical challenges and opportunities that are central to ensuring American leadership in Strategic Computing (SC), and what are effective mechanisms for addressing these challenges?

Question 2. What are appropriate models for partnerships between government, academia and industry in SC, and how can these partnerships be effectively leveraged to advance the objectives of SC?

Question 3. How do we develop and nurture the capable workforce with the necessary skill and competencies to ensure American leadership in SC? What are effective nontraditional approaches to lowering the barriers to knowledge transfer?

Question 4. How can technical advances in SC and other large government and private initiatives, including infrastructure advances, provide new knowledge and mechanisms for executing next generation research?

Question 5. What are the future national-level use cases that will drive new computing paradigms, and how will new computing paradigms yield new use cases?

Objective 1: accelerating the development of capable exascale by the mid-2020s

Objective 2: Developing a coherent platform for modeling, simulation, and data analytics

Objective 3: R&D towards post-CMOS technologies and new paradigms

Objective 4: Improving application development and workforce development

Objective 5: Broadening public-private partnership

Choosing a Desk Supplier

Desk Assembly

Accessories

Afterward

Keynote - Nitin Agrawal

Keeping It Real: Why HPC Data Services Don't Achieve I/O Microbenchmark Performance

Towards On-Demand I/O Forwarding in HPC Platforms

Fractional-Overlap Declustered Parity

NVIDIA GPUDirect Storage Support in HDF5

Fingerprinting the Checker Policies of Parallel File Systems

Table of Contents

Big Splashes

What's new

What's missing

High-level Themes

Computing Technologies Futures

Storage Technologies Futures

Actual Future Directions

The Relationship of HPC and AI

Disaggregation in Practice