Quantcast
Channel: Glenn K. Lockwood
Viewing all 81 articles
Browse latest View live

Hadoop's Uncomfortable Fit in HPC

$
0
0
Hadoop has come up in a few conversations I've had in the last few days, and it's occurred to me that the supercomputing community continues having a difficult time fully understanding how Hadoop currently fits (and should fit) into scientific computing.  HPCwire was kind enough to run a piece that let me voice my perspective of the realities of Hadoop use in HPC a few months ago--that is, scientists are still getting a feel for Hadoop and what it can do, and it just isn't seeing widespread adoption in scientific computing yet.  This contrasts with the tremendous buzz surrounding the "Hadoop" brand and ultimately gives way to strange dialogue, originating from the HPC side of the fence, like this:


I'm not sure if this original comment was facetious and dismissive of the Hadoop buzz or if it was a genuinely interested observation.  Regardless of the intent, both interpretations reveal an important fact: Hadoop is being taken seriously only at a subset of supercomputing facilities in the US, and at a finer granularity, only by a subset of professionals within the HPC community.  Hadoop is in a very weird place within HPC as a result, and I thought it might benefit the greater discussion of its ultimate role in research computing if I laid out some of the factors contributing to Hadoop's current awkward fit.  The rest of this post will strive to answer two questions: Why does Hadoop remain at the fringe of high-performance computing, and what will it take for it to be a serious solution in HPC?

#1. Hadoop is an invader

I think what makes Hadoop uncomfortable to the HPC community is that, unlike virtually every other technology that has found successful adoption within research computing, Hadoop was not designed by HPC people.  Compare this to a few other technologies that are core to modern supercomputing:

By contrast, Hadoop was developed by Yahoo, and the original MapReduce was developed by Google.  They were not created to solve problems in fundamental science or national defense; they were created to provide a service for the masses.  They weren't meant to interface with traditional supercomputers or domain scientists; Hadoop is very much an interloper in the world of supercomputing.

The notion that Hadoop's commercial origins make it contentious for stodgy people in the traditional supercomputing arena may sound silly without context, but the fact is, developing a framework for a commercial application rather than a scientific application leaves it with an interesting amount of baggage.

#2. Hadoop looks funny

The most obvious baggage that Hadoop brings with it to HPC is the fact that it is written in Java.  One of the core design features of the Java language was to allow its programmers to write code once and be able to run it on any hardware platform--a concept that is diametrically opposite to the foundations of high-performance computing, where code should be compiled and optimized for the specific hardware on which it will run.  Java made sense for Hadoop due to its origins in the world of web services, but Java maintains a perception of being slow and inefficient.  Slow and inefficient codes are, frankly, offensive to most HPC professionals, and I'd wager than a majority of researchers in traditional HPC scientific domains simply don't know the Java language at all.  I sure don't.

The idea of running Java applications on supercomputers is beginning to look less funny nowadays with the explosion of cheap genome sequencing.  Some of the most popular foundational applications in bioinformatics (e.g., GATK and Picard) are written in Java, and although considered an "emerging community" within the field of supercomputing, bioinformatics is rapidly outgrowing the capabilities of lab-scale computing.  Perhaps most telling is Intel's recent contributions to GATK which facilitate much richer use of AVX operations for variant calling.

With that being said though, Java is still a very strange way to interact with a supercomputer.  Java applications don't compile, look, or feel like normal applications in UNIX as a result of their cross-platform compatibility.  Its runtime environment exposes a lot of very strange things to the user for no particularly good reason (-Xmx1g?  I'm still not sure why I need to specify this to see the version of Java I'm running, much less do anything else) and it doesn't support shared-memory parallelism in an HPC-oriented way (manual thread management, thread pools...yuck).  For the vast majority of HPC users coming from traditional domain sciences the professionals who support their infrastructure, Java applications remain unconventional and foreign.

#3. Hadoop reinvents HPC technologies poorly

For those who have taken a serious look at the performance characteristics of Hadoop, the honest truth is that it re-invents a lot of functionality that has existed in HPC for decades, and it does so very poorly.  Consider the following examples:
  1. Hadoop uses TCP with a combination of REST and RPC for inter-process communication.  HPC has been using lossless DMA-based communication, which provides better performance in all respects, for years now.
  2. Hadoop doesn't really handle multi-tenancy and its schedulers are terrible.  The architecture of Hadoop is such that, with a 3x replication factor, a single cluster can only support three concurrent jobs at a time with optimal performance.  Its current scheduler options have very little in the way of intelligent, locality-aware job placement.
  3. Hadoop doesn't support scalable interconnect topologies.  The rack-aware capabilities of Hadoop, while powerful for their intended purpose, do not support scalable network topologies like multidimensional meshes and toruses.  They handle Clos-style network topologies, period.
  4. HDFS is very slow and very obtuse.  Parallel file systems like Lustre and GPFS have been an integral part of HPC for years, and HDFS is just very slow and difficult to use by comparison.  The lack of a POSIX interface means getting data in and out is tedious, and its vertical integration of everything from replication and striping to centralized metadata in Java makes it rather unresponsive.
However, these poor reinventions are not the result of ignorance; rather, Hadoop's reinvention of a lot of HPC technologies arises from reason #1 above: Hadoop was not designed to run on supercomputers and it was not designed to fit into the existing matrix of technologies available to traditional HPC.  Rather, it was created to interoperate with web-oriented infrastructure.  Specifically addressing the above four points,
  1. Hadoop uses TCP/IP and Ethernet because virtually all data center infrastructure is centered around these technologies, not high-speed RDMA.  Similarly, REST and RPC are used across enterprise-oriented services because they are simple protocols.
  2. Multi-tenancy arises when many people want to use a scarce resource such as a supercomputer; in the corporate world, resources should never be a limiting factor because waiting in line is what makes consumers look elsewhere.  This principle and the need for elasticity is what has made the cloud so attractive to service providers.  It follows that Hadoop is designed to provide a service for a single client such as a single search service or data warehouse.
  3. Hadoop's support for Clos-style (leaf/spine) topologies models most data center networks.  Meshes, toruses, and more exotic topologies are exclusive to supercomputing and had no relevance to Hadoop's intended infrastructure.
  4. HDFS implements everything in software to allow it to run on the cheapest and simplest hardware possible--JBODs full of spinning disk.  The lack of a POSIX interface is a direct result of Hadoop's optimization for large block reads and data warehousing.  By making HDFS write-once, a lot of complex distributed locking can go out the window because MapReduce doesn't need it.
This loops back around to item #1 above: Hadoop came from outside of HPC, and it carries this baggage with it.

#4. Hadoop evolution is backwards

A tiny anecdote

I gave two MapReduce-related consultations this past month which really highlighted how this evolutionary path of Hadoop (and MapReduce in general) is not serving HPC very well.

My first meeting was with a few folks from a large clinical testing lab that was beginning to to incorporate genetic testing into their service lineup. They were having a difficult time keeping up with the volume of genetic data being brought in by their customers and were exploring Hadoop BLAST as an alternative to their current BLAST-centric workflow. The problem, though, is that Hadoop BLAST was developed as an academic project when Hadoop 0.20 (which has evolved into Hadoop 1.x) was the latest and greatest technology. Industry has largely moved beyond Hadoop version 1 onto Hadoop 2 and YARN, and this lab was having significant difficulties in getting Hadoop BLAST to run on their brand new Hadoop cluster because its documentation hasn't been updated in three years.

The other meeting was with a colleague who works for a multinational credit scoring company.  They were deploying Spark on their Cloudera cluster with the aforementioned clinical testing company: their data collection processes were outgrowing their computational capabilities and they were exploring better alternatives for data exploration.  The problem they encountered was not one caused by their applications being frozen in time after someone finished their Ph.D.; rather, their IT department had botched the Spark installation.
Generally speaking, the evolution of all technologies at the core of HPC follow a similar evolutionary path into broad adoption.  Both software and hardware technologies arise as disparities between available and necessary technologies widen.  Researchers often hack together non-standard solutions to these problems until a critical mass is achieved, and a standard technology emerges to fill the gap.  OpenMP is a great example--before it became standard, there were a number of vendor-specific pragma-based multithreading API;  Cray, Sun, and SGI's all had their own versions that did the same thing but made porting codes between systems very unpleasant.  These vendors ultimately all adopted a standard interface which became OpenMP, and that technology has been embraced because it provided a portal way of solving the original motivating problem.

The evolution of Hadoop has very much been a backwards one; it entered HPC as a solution to a problem which, by and large, did not yet exist.  As a result, it followed a common, but backwards, pattern by which computer scientists, not domain scientists, got excited by this new toy and invested a lot of effort into creating proof of concept codes and use-cases.  Unfortunately, this sort of development is fundamentally unsustainable by itself, and as the shine of Hadoop wore off, researchers moved on to the next big thing and largely abandoned these model applications.  This has left a graveyard of software, documentation, and ideas that are frozen in time and rapidly losing relevance as Hadoop moves on.

Consider this evolutionary path of Hadoop compared to OpenMP: there were no OpenMP proofs-of-concept.  There didn't need to be any; the problems had already been defined by the people who needed OpenMP, so by the time OpenMP was standardized and implemented in compilers, application developers already knew where it would be needed.

Not surprisingly, the innovation in the Hadoop software ecosystem remains where it was developed: data warehousing and data analytics.

How can Hadoop fit into HPC?

So this all is why Hadoop is in this awkward position, but does this mean Hadoop (and MapReduce) will never be welcome in the world of HPC?  Alternatively, what would it take for Hadoop to become a universally recognized core technology in HPC?

I'll say up front that there are no easy answer--if there were, I wouldn't be delivering this monologue.  However, solutions are being developed and attempted to address a few of the four major barriers I outlined above.

Reimplement MapReduce in an HPC-oriented way

This idea has been tried in a number of different ways (see MPI MapReduce and Phoenix), but none have really gained traction.  I suspect this is largely the result of one particular roadblock: there just aren't that many problems which are so burdensome in the traditional HPC space that reimplementing a solution in a relatively obscure implementation of MapReduce becomes worth the effort.  As I mentioned in point #4 above, HPC vendors haven't been creating their own MapReduce APIs to address the demands of their customers, so Hadoop's role in HPC is not clearly addressing a problem that needs an immediate solution.

This is not to say that the data-oriented problems at which Hadoop excels do not exist within the domain sciences.  Rather, there are two key roles that Hadoop/MapReduce will play in scientific computations:
  • Solving existing problems:  The most activity I've seen involving Hadoop in domain sciences comes out of bioinformatics and observational sciences.  Bioinformatics as a consumer of HPC cycles is still in its infancy, but the data sets being generated by next-generation sequencers are enormous--the data to describe a single human genome, even when compressed, takes up about 120 GB.  Similarly, advances in imaging and storage technology have allowed astronomy and radiology to generate extremely large collections of data.
  • Enabling new problems: One of Hadoop's more long-term promises is not solving the problems of today, but giving us a solution to problems we previously thought to be intractable.  Although I can't disclose too much detail, an example of this lies in statistical mechanics: many problems involving large ensembles of particles have relied on data sampling or averaging to reduce the sheer volume of numerical information into a usable state.  Hadoop and MapReduce allow us to start considering what deeper, more subtle patterns may emerge if a massive trajectory through phase space could be dumped and analyzed with, say, machine learning methods.

Unfortunately, reimplementing MapReduce inside the context of existing HPC paradigms represents a large amount of work for a relatively small subset of problems.

Incorporate HPC technologies in Hadoop

Rather than reimplementing Hadoop/MapReduce as an HPC technology, I think a more viable approach forward is to build upon the Hadoop framework and correct some of its poorly reinvented features I described in item #3 above.  This will allow HPC to continuously fold in new innovations being developed in Hadoop's traditional competencies--data warehousing and analytics--as they become relevant to scientific problems.  Some serious effort is being made to this end:

In addition to incorporating these software technologies from HPC into Hadoop, there are some really innovative things you can do with hardware technologies that make Hadoop much more appealing to traditional HPC.  I am working on some exciting and innovative (if I may say so) hardware designs that will further lower the barrier between Hadoop and HPC, and with any luck, we'll get to see some of these ideas go into production in the next few years.

Make MapReduce Less Weird

The very nature of MapReduce is a very strange one to supercomputing--it solves a class of problems that the world's fastest supercomputers just weren't designed to solve.  Rather than make raw compute performance the most important capability, MapReduce treats I/O scalability as the most important capability and CPU performance is secondary.  As such, it will always be weird until such a day comes when science faces an equal balance of compute-limited and data-limited problems.  Fundamentally, I'm not sure that such a day will ever come.  Throwing data against a wall to see what sticks is good, but deriving analytical insight is better.

With that all being said, there's room for improvement in making Hadoop less weird.  Spark is an exciting project because it sits at a nice point between academia and industry; developed at Berkeley but targeted directly at Hadoop, it feels like it was developed for scientists, and it treats high-performance as a first-class citizen by providing the ability to utilize memory a lot more efficiently than Hadoop does.  It also doesn't have such a heavy-handed Java-ness to it and provides a reasonably rich interface for Python (and R support is on the way!).  There still are a lot of rough edges (this is where the academic origins shine through, I think) but I'm hopeful that it cleans up under the Apache project.

Perhaps more than (or inclusive of) the first two paths forward in increasing MapReduce adoption in research science, Spark holds the most promise in that it feels less like Hadoop and more normal from the HPC perspective.  It doesn't force you to cast your problem in terms of a map and a reduce step; the way in which you interact with your data (your resilient distributed dataset, or RDD, in Spark parlance) is much more versatile and is more likely to directly translate to the logical operation you want to perform.  It also supports the basic things Hadoop lacks such as iterative operations.

Moving Forward

I think I have a pretty good idea about why Hadoop has received a lukewarm, and sometimes cold, reception in HPC circles, and much of these underlying reasons are wholly justified.  Hadoop's from the wrong side of the tracks from the purists' perspective, and it's not really changing the way the world will do its high-performance computing.  There is a disproportionate amount of hype surrounding it as a result of its revolutionary successes in the commercial data sector.

However, Hadoop and MapReduce aren't to be dismissed outright either.  There are a growing subset of scientific problems that are growing against a scalability limit in terms of data movement, and at some point, solving these problems using conventional, CPU-oriented parallelism will reduce to using the wrong tool for the job.  The key, as is always the case in this business, is to understand the job and understand that there are more tools in the toolbox than just a hammer.

As these data-intensive and data-limited problems gain a growing presence in traditional HPC domains, I hope the progress being made on making Hadoop and MapReduce more relevant to research science continues.  I mentioned above that great progress is being made towards truly bridging the gap of utility and making MapReduce a serious go-to solution to scientific problems, and although Hadoop remains on the fringe of HPC today, it won't pay to dismiss it for too much longer.

Massive Parallelism and the Outlook for CUDA

$
0
0

Massive Parallelism

The future of high performance computing (and the so-called exascale computing) lies squarely in the realm of massively parallel computing systems. When I first began hearing about "massively parallel" programming and computing, I had (perhaps naïvely) thought massively parallel was just a buzzword to describe more of the same--more cores, more memory, and bigger networks with faster interconnects. As I have since learned, there is a lot more to massively parallel computing than stuffing cores into a rack; such an approach has really reached its limits with super-massive clusters like RIKEN's K-computer, whose LINPACK benchmarks are only matched by its absurd power draw, equivalent to $12,000,000 a year alone.

Massively parallel systems take a different (and perhaps more intelligent) approach that, in my opinion, reflects a swinging of the supercomputing pendulum from clusters of general-purpose commodity hardware back to specialized components. Whereas the building block of traditional parallel computing is the general purpose CPU (good at everything but great at nothing), the building block for massive parallelism are indivisible bundles of low-power, minimally functioning cores. In the context of GPU computing, these are "thread blocks" or "work groups" (or perhaps warps, depending on your viewpoint), and in the context of Blue Gene, these are I/O nodes+.

The parallel processing elements within these building blocks have very limited functionality; unlike a CPU in a cluster, they do not run an OS*, and they rely on other hardware to perform many duties such as scheduling and I/O. Also unlike a standard CPU, these elements are clocked low and have very poor serial performance. Their performance arises solely from the fact that they are scheduled in bundles, and instead of scheduling one core to attack a problem as you would in conventional parallel computing, you are typically scheduling over 100 compute elements (let's call them threads for simplicity) at a time in massively parallel computing.  The fact that you are guaranteed to have hundreds of threads in flight means you can begin to do things like layer thread execution so that if one thread is stalled at a high-latency event (anything from a cache miss to an MPI communication), other threads can execute simultaneously and use the otherwise idle computational capabilities of the hardware.

In fact, this process of "hiding latency" underneath more computation is key to massively parallel programming. GPUs incorporate this concept into their scheduling hardware at several levels (e.g. during the so-called zero-overhead warp execution and block scheduling within the streaming multiprocessors), and Blue Gene/Q's A2 cores have 4-way multithreading and innovative hardware logic like "thread-level speculation" and "thread wakeup" that ensure processing cycles don't get wasted.  In addition to hardware-based latency hiding, additional openings exist (e.g., PCIe bus transfers in GPUs or MPI calls in BG/Q) where the programmer can use nonblocking calls to overlap latency with computation.

While I firmly believe that massive parallelism is the only way to move forward in scientific computing, there isn't a single path forward that has been cut.  IBM's Blue Gene is a really innovative platform that is close enough to conventional parallel computing that it can be programmed using standard APIs like MPI and OpenMP; the jump from parallel to massively parallel on Blue Gene is in the algorithms rather than the API.  GPGPUs (i.e., CUDA) are a lot more stuffy in the sense that both your algorithm and the API (and in fact the programming language) is tightly restricted by the GPU hardware.  Intel's upcoming MIC/Knight's Corner/Xeon Phi accelerators are supposed to be x86 compatible, but I don't know enough to speak on how they will need to be programmed.



Practical Limitations of CUDA

With all this being said, I don't see CUDA being one of those paths into the exascale; I don't think it has a long-term future in scientific computing.

I may be poisoned by the fact that I've been taking a few CUDA courses; the format of these things tend to be a few minutes explaining a technique followed by an hour of cautionary tales, things you can't do, and how the technique won't work under various conditions.  Here are some examples.

GPUs have painfully limited memory

GPU architecture is a lot more awkward than NVIDIA indicates up front, and a part of this is how the various tiers of memory function.  While you can get Fermi accelerators with 6 GB of RAM, the memory bandwidth feeding the thread blocks that handle core execution is nowhere near enough to realize peak performance.  You wind up having to tune your algorithm to operate on only a few kilobytes of data at a time, and the exact amount of local memory you can use varies.  This leads into the next problem.

Algorithms are tightly coupled to hardware

The amount of resources (registers, threads, local memory) available to each thread block is governed by how you choose to dice up your numerical problem.  While a given problem divided a certain way will usually run across any CUDA-capable hardware, the performance can vary widely depending on what hardware you choose to use.  NVIDIA has done a good job of keeping certain implementation-specific hardware features like the warp size constant; however, warp scheduling is not even a part of CUDA despite its significant effect on how efficient algorithms are written, and most literature is careful to note that warps are only 32 threads large "for now."

While it's a reasonable expectation to have to re-tool code for new and improved architectures or hardware revisions, changes in GPU hardware have the potential to make optimized code sub-optimal by changing the capacities of each streaming multiprocessor.  The outlook for more generic APIs like OpenCL is no better; although an OpenCL code may run across different types of accelerators, it will need to be retooled to achieve optimum performance on each individual type.

GPUs just aren't suited to many parallel problems

All vendors selling massively parallel computing solutions are guilty of understating the difference between peak performance and expected performance, but GPUs seem to have far more suboptimal conditions that can keep your code performance far from peak.  As I see it, GPUs are really only suited for solving dense matrix operations; everything else only performs well under very specific circumstances.

What I feel is understated in GPU whitepapers is that, although a Tesla card has hundreds of cores, they really don't operate like a CPU core.  Take for example the NVIDIA Tesla C2070.  It has 448 cores, but the truth is that these cores are scheduled in 32-core groups (warps) that act very much like a 32-register vector processor.  All 32 cores go through a CUDA kernel in lock-step, meaning you suffer severe penalties for divergent code within a warp and your work size must be a multiple of 32 for peak performance.  In the case of sparse data (n-body problems come to mind), you have to be very deliberate about how your data is  sorted before launching a CUDA kernel or you wind up wasting a significant number of cycles.

What this amounts to is that GPUs perform well only if the problem is extremely parallel and the data dependencies are extremely local.

In the case of n-body problems (specifically, molecular dynamics code), good CUDA acceleration really doesn't exist in a general sense.  You can find benchmarks that show the trademark massive acceleration (e.g., GPU acceleration with NAMD), but I've found these acceleration figures tend to quietly ignore two important points:

  1. MD codes that boast very good CUDA performance seem to come from groups that are extensively funded by NVIDIA in terms of equipment, money, and expertise.  This suggests that effective CUDA acceleration requires significant capital and close contact with experts at NVIDIA.  I am sure there are exceptions, but the codes that hold significant market share follow this trend.
  2. MD benchmarks stating very high acceleration tend to be apples-to-oranges comparisons.  They take a given problem and divide it in a way that would be extremely slow on a CPU (e.g., 864,000 Lennard-Jones atoms per node) but far more efficient on any parallel platform and compare the performance.
It seems like n-body problems like molecular dynamics really only begin to perform well for algorithms that are extremely slow (ellipsoidal particles, complex multibody potentials), but even then, writing optimized CUDA kernels to do this is not easy and does require re-tooling with every successive hardware update.


Outlook

This isn't all to say GPUs are a waste of time and money; the fact that an increasing number of the top 500 supercomputers are using GPUs as a cheap (both in terms of money and energy) source of FLOPS is a testament to their utility in number crunching.  However, GPUs aren't generic enough to effectively accelerate many scientific problems, and I really see them as a stopgap solution until the pendulum swings back towards massively parallel accelerators that, like GPUs, are targeted at scientific computing, but unlike GPUs, are designed specifically for the task of getting maximum generic floating-point throughput.  To their credit, GPGPUs created the demand for specialized high-throughput, massively-parallel floating point accelerators for scientific computing, but as the industry matures, I'm positive that better approaches will emerge.


+ This is true for Blue Gene/P.  Blue Gene/Q allows "block subdivision" which enables localized MPI jobs to run within a block by sharing an I/O node.
* Sort of.  GPUs don't run an OS; Blue Gene runs CNK which doesn't support virtual memory, multitasking, scheduling, or context switching.

Quick MPI Cluster Setup on Amazon EC2

$
0
0

Premise

I was recently tasked with doing some MPI benchmarks on Amazon EC2 to get a general idea of how well EC2's Cluster Compute capabilities perform.  I had never used EC2 before, so the notion of setting up a working set of EC2 instances that have the necessary configuration to run MPI applications was initially quite daunting.  I had no idea what the difference between an AMI and an instance was, what AMI and instance to use, and if I should mess with any automatic provisioning/configuration tools (like Starcluster) to quickly spin up a cluster.

Most guides online are kind of unhelpful in that they try to illustrate some proof of concept in how easy it is to get a fully configured cluster-in-the-cloud setup using some sort of provisioning toolchain.  They gloss over the basics of exactly how to start these instances up and what to expect as far as their connectivity.  Fortunately it only took me a morning to get MPI up and running, and for the benefit of anyone else who just wants to get MPI applications running on EC2 with as little fuss as possible, here are my notes.

It is worth pointing out that I use the term "instance" and "node" (or "compute node") interchangeably in the following discussion.

Step 1: Establish an EC2 Account

Go to http://aws.amazon.com/ and click "My Account / Console" at the top right, then navigate to the "AWS Management Console." You'll be asked to log in, and if you already have an Amazon.com account, you can use that one.  You are then prompted to sign up for AWS, which is a process that requires a credit card number on file to charge for instances, and a robocall to verify your identity.  Once this is done, find your way to the AWS Management Console and click the "EC2" option:


This should take you to the EC2 Dashboard from where you can launch VMs.

Step 2: Create your Instances

Click the big blue "Launch Instances" button to get into the wizard.


I used the Classic Wizard.

Select a VM Image


The first thing the wizard asks you to do is select an AMI (Amazon Machine Image).  Scroll down and choose the Cluster Compute Amazon Linux ABI.


This is sort of like a VM image that establishes your OS distribution, base software stack, and initial login user.  What threw me off is that there is a regular "Amazon Linux" and the "Cluster Compute Amazon Linux," yet they both have the same description.  As far as I understand, the Cluster Compute image uses a virtualization layer (called "HVM") that is a lot thinner than the one used by non-compute images. You wind up getting an instance that runs on an entire physical compute node, and only the I/O adapters are virtualized.  With that being said, this "Cluster Compute Amazon Linux" AMI says nothing about the hardware you want your instances to use.

Select an Instance Type


This next step is where hardware comes in.


The cc2.8xlarge instance has a configuration reasonably close to what SDSC Gordon has: 2x Intel Xeon E5-2670 processors and ~60 GB of RAM per node, so that's what I chose.  I also didn't want to run up a huge bill while I was getting this process stamped out, so I only wanted to launch two nodes to start--the minimum necessary to test MPI communications.

Also, Amazon lets you launch these instances like normal at a rate of $2.40 per instance per hour, or you can potentially get a significantly discounted rate by using spot instances.  Despite the possibility that my spot instances could be shut off at any time, the 90% discount was worth the risk so I went with the spot instances and bid the absolute lowest rate.  This turned out to be sufficient to have these virtual clusters running for a few hours, and I've never actually seen the spot instance bid rise above the value we paid.  I guess these things are not in high demand.  For what it's worth, the entire exercise of figuring all of this EC2 stuff out on a 2-node cluster, then spinning up a 4-node cluster and running a specific application benchmark only cost us $3.04.

Define a Placement Group


Either way, the next step is to define the instances' Placement Group, which is a critical component.


Instances which are on the same placement group share the same network subnet and get the full bisection bandwidth available to these Cluster Compute 10 GigE connections.  Give this cluster's placement group a name, and if you want to add nodes to this cluster on the fly later, you have the possibility of starting more instances and requesting that they be added to this same subnet.

The User Data field is unimportant for our purposes, and you may want to change the Shutdown Behavior to "terminate." This will ensure that you aren't being billed for the VMs while they are "stopped" but not "terminated."

Configure Storage


The next step is defining your instances' storage volumes.


Even though the Cluster Compute AMI is backed by EBS (block storage), I don't think it's persistent since persistent EBS typically costs additional money.  I accept the defaults here, but you can change the "Delete on Termination" value to false.  It appears that AWS charges you all of the regular EBS rates for the default (required?) EBS-backed root volumes used by the Cluster Compute instances, so it might be possible to make the disks on these instances non-volatile.  I haven't played with this very much though.

If you have the money to burn, you can create a persistent EBS store and NFS-mount it across all compute nodes.  I quickly cover how to do this at the bottom of this guide.

Define VM Tags


The next screen lets you define tags to manage your instances.


This is not necessary for this simple test, so just continue past this screen.

Create a Key Pair


The next step, creating your key pair, is critical to being able to get into your instances once they boot up.


Give your key an arbitrary name, then download it as a .pem file.  This is the privatekey that will allow you to ssh into the instance once it's up; if you don't have this, you will simply be locked out of your own instance.

Set Proper Security Group


The next step of customizing your cluster's security group is also essential.


The default behavior of the wizard is to have you create a new security group called "quick-start-1" that will open port 22 (SSH) to the world but block all traffic everywhere else.  You need to let all instances (nodes) within your cluster communicate freely for MPI to work, so you must add a new rule to this security group that opens all ports (1-65535) to all other instances within the same security group (quick-start-1).  Once you enter these two parameters, you must then click the Add Rule button or this rule will not be applied and MPI will not work!

Launch Your Instances


This is the end of the instance wizard.


Click Launch, and your instances will begin booting up.  Go back to the EC2 Dashboard and wait for your VMs to reach the "Running" state.

Step 3. Configuring the Virtual Cluster


Once the instances are booted up, you can get their public IP addresses from the EC2 Dashboard by selecting their entry under the "Instances" page.  The default user on these Cluster Compute Amazon Linux images is "ec2-user", so ssh into one of the instances using the .pem file downloaded earlier:
$ ssh -i ~/Downloads/glock.pem ec2-user@ec2-54-245-13-102.us-west-2.compute.amazonaws.com
Once logged into the instance, getting set up is a pretty quick process.  The general procedure involves making changes on a single, master node (whether that be install packages, compile software, upload data, etc) and then push those changes out to all the compute nodes by hand.  This is a very rough approach, but when it comes down to it, you don't need very much to get MPI running.  The goal here is to do as little as it takes to run an MPI benchmark across Amazon EC2 Cluster Compute nodes.

Preliminary Cluster Setup


The Cluster Compute instances are a "cluster" inasmuch as they are all booted to the same OS and all share the same 10 GigE subnet with full bisection bandwidth; on the software front, there is close to nothing that makes the instances a cluster when they are booted up.  The very first thing I did was set up a few things as a matter of convenience.

1. Establish a useful /etc/hosts

Add all nodes' internal IP addresses (10.x.x.x) to /etc/hosts under aliases like node2, node3, node4, etc.  This test cluster has only two nodes, so only two entries need to be added.

2. Create alias to push files across the cluster

Create a simple script to push files out from this master node (node1) to the worker nodes (node2 and any others you may want to have booted).  Mine was a file called pushout that looked like this:
for node in node2 node3 node4
do
  scp $1 $node:$1
done
If you want to get fancy, you make a similar script to execute a command across all nodes as well.
for node in node2 node3 node4
do
  ssh $node "$@"
done

3. Enable password-less ssh between instances

Transfer the .pem onto the nodes so they can use password-less ssh to communicate.  I copied mine onto all my instances as /home/ec2-user/.ssh/id_rsa so that ssh/scp used it by default.

Installing and Running MPI


The Amazon Linux AMI is a Red Hat derivative and uses the yum package manager.  Amazon provides both OpenMPI and mpich RPMs in their repository, so installing MPI is a simple matter of

sudo yum install openmpi-devel
or
sudo yum install mpich-devel
This will pull in the GCC compiler, the MPI runtimes, and the mpicc wrapper.  You have to do this on all compute nodes.  In addition, you have to then add the following lines to .bashrc:

export PATH=/usr/lib64/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib

Be sure to then push this new .bashrc file out to all of your compute nodes or your mpi jobs will throw "bash: orted: command not found" errors as soon as you issue mpirun.

With this, you can now build applications with mpicc and run them with mpirun.  Given that our compute cluster here has two nodes (called node1 and node2), each with 16 cores (actually 16 cores + 16 hyperthreads), running a 32-way MPI job is just a matter of creating a hostfile compatible with OpenMPI's mpirun, e.g.,

$ cat ~/nodefile
node1 slots=16
node2 slots=16

and, assuming our mpi binary is ~/simulation.x and has been pushed out to all compute nodes, doing

mpirun -np 32 -hostfile ~/nodefile ~/simulation.x

The hostfile format is a little different for mpich's mpirun, but using mpich instead of OpenMPI is much the same.

Getting Fancy: NFS Filesystem

Creating a persistent EBS block storage volume is pretty straightforward.  The rates seem pretty cheap ($0.11 per GB per month, and $0.11 per million IOPS, however much that is), so I made one in the same Availability Zone as my cluster (this is essential!), attached it to my master node (node1) from the EC2 Management Console, and checked dmesg on node1 to see what the device name was (/dev/xdvf).  A quick few commands made it available:

# mkfs.ext4 /dev/xdvf
# mkdir /ebs
# mount /dev/xdvf /ebs
$ mount|grep /ebs
/dev/xvdf on /ebs type ext4 (rw)

Sharing this filesystem across the nodes is pretty simple, and you don't really need a separate EBS volume to export a filesystem over NFS.  You can just mkdir /ebs and use the volatile root EBS volume that comes with your instances if you don't want to mess with adding EBS volumes.

Either way, let's assume we're sharing /ebs.  First install nfsd on the master:

# yum install nfs-utils
...
Dependencies Resolved
================================================================================
 Package            Arch        Version                    Repository      Size
================================================================================
Installing:
 nfs-utils          x86_64      1:1.2.3-36.13.amzn1        amzn-main      415 k
Installing for dependencies:
 libevent           x86_64      2.0.18-1.10.amzn1          amzn-main      278 k
 libgssglue         x86_64      0.1-11.7.amzn1             amzn-main       23 k
 libtirpc           x86_64      0.2.1-5.7.amzn1            amzn-main       87 k
 nfs-utils-lib      x86_64      1.1.5-6.9.amzn1            amzn-main       76 k
 rpcbind            x86_64      0.2.0-11.5.amzn1           amzn-main       56 k
Transaction Summary
================================================================================
Install       6 Package(s)
Total download size: 936 k
Installed size: 2.0 M
Is this ok [y/N]: y


Then configure the export by adding this line to /etc/exports:

/ebs (rw,insecure)

Start rpcbind and nfsd on your master node:

# /etc/init.d/rpcbind start
# /etc/init.d/nfs start

And NFS should be ready to go.  Log into each other node and install NFS, create the mountpoint, and mount it:

# yum install nfs-utils
# mkdir /ebs
# mount node1:/ebs /ebs

This then saves the hassle of having to sync up runtime files across nodes.  If you used a persistent EBS block store underneath this NFS share, you can also terminate/restart your instances without having to spend the money and time on EC2 bandwidth into and out of the cloud.

Spark on Supercomputers: A Few Notes

$
0
0
I've been working with Apache Spark quite a bit lately in an effort to bring it into the fold as a viable tool for solving some of the data-intensive problems encountered in supercomputing.  I've already added support for provisioning Spark clusters to a branch of the myHadoop framework I maintain so that Slurm, Torque, and SGE users can begin playing with it, and as a result of these efforts, I've discovering a number of interesting issues with Spark running on traditional supercomputers.

At this point in time, Spark is very rough around the edges.  The core implementation of resilient distributed datasets are all there and work wonderfully, but I've found that it doesn't take long to start discovering bugs and half-implemented features that can get very confusing very quickly.  Perhaps half of the problems I've faced are the result of the fact that I have been trying to run Spark in non-traditional ways (for example, over hosts' TCP over InfiniBand interfaces and with non-default config directories), and although the documentation claims to support all of the features necessary to make this possible, the reality is a bit different.

What follows are just some incoherent notes I've taken while porting Spark to the myHadoop framework.  Spark is rapidly developing and it is constantly improving, so I hope this post becomes outdated as the Spark developers make the framework more robust.

Control Script Problems

Hadoop and Spark both ship with "control scripts" or "cluster launch scripts" that facilitate the starting and stopping of the entire cluster of daemons.  At the highest level, this includes start-all.sh and stop-all.sh, which make calls to start-dfs.sh and start-yarn.sh (in Hadoop) and start-master.sh and start-slaves.sh.  In Hadoop, these scripts work wonderfully, but Spark's implementation of these control scripts is still quite immature because they carry implicit assumptions about users' Spark configurations.

Like Hadoop, Spark supports a spark-env.sh file (located in $SPARK_CONF_DIR) which defines environment variables for all of the remote Spark workers that are spawned across the cluster.  This file is an ideal place to put the following environment variable definitions:
  • SPARK_MASTER_IP - the default value for this is `hostname` which is generally not a great default on most clusters.  On Rocks, we append ".ibnet" to the hostname to get Spark to operate over the InfiniBand fabric.
  • SPARK_LOCAL_IP - again, ensure that this is set up to use the correct interface on the cluster.  We append .ibnet on Rocks.
  • SPARK_HOME, SPARK_PREFIX, and SPARK_CONF_DIR should also be defined here since spark-env.sh will usually override the variables defined by spark-config.sh (see below)
$SPARK_HOME/sbin/spark-config.sh is where much of the Spark control scripts'"intelligence" comes from as far as defining the environment variables that Spark needs to launch.  In particular, spark-config.sh defines the following variables before reading spark-env.sh:
  • SPARK_PREFIX
  • SPARK_HOME
  • SPARK_CONF_DIR
The problem is that spark-config.sh will stomp all over anything the user defines for the above variables, and since spark-config.sh is called from within all of the Spark control scripts (both evoked by the user and evoked by sub-processes on remote hosts during the daemon spawning process), trying to get Spark to use non-default values for SPARK_CONF_DIR (e.g., exactly what myHadoop does) gets to be tedious.

The Spark developers tried to work around this by having the control scripts call spark-env.sh after spark-config.sh, meaning you should be able to define your own SPARK_CONF_DIR in spark-env.sh.  Unfortunately, this mechanism of calling spark-env.sh after spark-config.sh appears as

. "$sbin/spark-config.sh"

if [ -f "${SPARK_CONF_DIR}/spark-env.sh" ]; then
. "${SPARK_CONF_DIR}/spark-env.sh"
fi

That is, spark-config.sh will stomp all over any user-specified SPARK_CONF_DIR, and then use the SPARK_CONF_DIR from spark-config.sh to look for spark-env.sh.  Thus, there is no actual way to get the Spark control scripts (as of version 0.9) to honor the user-specified SPARK_CONF_DIR.  It looks like the latest commits to Spark have started to address this, but a cursory glance over the newest control scripts suggests that this remains broken.

Anyway, as a result of this, myHadoop's Spark integration eschews the Spark control scripts and handles spawning the daemons more directly using the manual method of spawning slaves.  Doing this averts the following issues:
  1. start-slaves.sh can't find any slaves because it always looks for $SPARK_HOME/etc/slaves.  This can be worked around by passing SPARK_SLAVES=$SPARK_CONF_DIR/slaves to start-slaves.sh for a non-default SPARK_CONF_DIR.
  2. stop-master.sh doesn't do anything useful because you still need to kill -9 the master process by hand.  Not sure why this is the case.

Deciphering Spark Errors

Here are various cryptic stack traces I've encountered while working on Spark.  I kept these mostly for myself, but I've started meeting people that hit the same problems and thought it might be worthwhile to share the diagnoses I've found.

In general, Spark seems to work best when used conservatively, but when you start doing things that do not strictly fall within the anticipated use case, things break in strange ways.  For example, if you try to write an RDD with an empty element (e.g., a text file with empty lines), you would get this really crazy error that does not actually say anything meaningful:

14/04/30 16:23:07 ERROR Executor: Exception in task ID 19
scala.MatchError: 0 (of class java.lang.Integer)
     at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110)
     at org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:153)
     at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
     at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
     at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
     at org.apache.spark.scheduler.Task.run(Task.scala:53)
     at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
     at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
     at java.lang.Thread.run(Thread.java:722)

I filed a bug report about this particular problem and the issue has been fixed, but it's just one of those edge cases where Spark will fail catastrophically (I had to look at the source code to figure out what "scala.MatchError" meant).  Usually you wouldn't be operating on empty data sets, but I discovered this error when I was trying to quickly determine if my Spark slaves were communicating with my master correctly by issuing

file = sc.textFile('hdfs://master.ibnet0/user/glock/input.txt')
file.saveAsTextFile('hdfs://master.ibnet0/user/glock/output')

That is, simply reading in a file and writing it back out with pyspark would cause catastrophic failure.  This is what I meant when I say Spark's still rough around the edges.

Here are a few more errors I've encountered.  They're not problems with Spark, but the stack traces and exceptions thrown can be a little mysterious.  I'm pasting it all here for the sake of googlers who may run into these same problems.

If you try to use Spark built against Hadoop 2 with a Hadoop 1 HDFS, you'll get this IPC error:

>>> file.saveAsTextFile('hdfs://s12ib:54310/user/glock/gutenberg.out')
Traceback (most recent call last):
  File "", line 1, in
  File "/home/glock/apps/spark-0.9.0/python/pyspark/rdd.py", line 682, in saveAsTextFile
    keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
  File "/home/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 537, in __call__
  File "/home/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o23.saveAsTextFile.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
     at org.apache.hadoop.ipc.Client.call(Client.java:1070)
     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
     at $Proxy7.getProtocolVersion(Unknown Source)
     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)


If your Pythons aren't all the same version across the nodes when Spark workers are instantiated, you might get a cryptic error like this when trying to call the count() method on an RDD:

14/04/30 16:15:11 ERROR Executor: Exception in task ID 12
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/worker.py", line 77, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers.py", line 182, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers.py", line 117, in dump_stream
    for obj in iterator:
  File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers.py", line 171, in _batched
    for item in iterator:
  File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", line 493, in func
    if acc is None:
TypeError: an integer is required

     at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
     at org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:153)
     at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
     at org.apache.spark.scheduler.Task.run(Task.scala:53)
     at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
     at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
     at java.lang.Thread.run(Thread.java:722)


If you try to write an RDD to a file with mismatched Python versions, or if you were using anything earlier than Python 2.7 (e.g., 2.6) with any Spark version earlier than 1.0.0, you'd see this:

14/04/30 17:53:20 WARN scheduler.TaskSetManager: Loss was due to org.apache.spark.api.python.PythonException
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/worker.py", line 77, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/serializers.py", line 117, in dump_stream
    for obj in iterator:
  File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py", line 677, in func
    if not isinstance(x, basestring):
SystemError: unknown opcode

     at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
     at org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:153)
     at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
     at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
     at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
     at org.apache.spark.scheduler.Task.run(Task.scala:53)
     at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
     at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
     at java.lang.Thread.run(Thread.java:722)


If your HDFS URI is wrong, the error message actually makes sense.  It is buried quite deeply though.

Traceback (most recent call last):
  File "", line 1, in
  File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py", line 682, in saveAsTextFile
    keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
  File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 537, in __call__
  File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o23.saveAsTextFile.
: java.lang.IllegalArgumentException: java.net.UnknownHostException: s12ib.ibnet0
     at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:418)
     at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:231)
     at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:139)
     at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:510)
     at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:453)
     at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:136)
     at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433)
     at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
     at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
     at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
     at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
     at org.apache.hadoop.mapred.SparkHadoopWriter$.createPathFromString(SparkHadoopWriter.scala:193)
     at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:685)
     at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:572)
     at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:894)
     at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:355)
     at org.apache.spark.api.java.JavaRDD.saveAsTextFile(JavaRDD.scala:27)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
     at java.lang.reflect.Method.invoke(Method.java:597)
     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
     at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
     at py4j.Gateway.invoke(Gateway.java:259)
     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
     at py4j.commands.CallCommand.execute(CallCommand.java:79)
     at py4j.GatewayConnection.run(GatewayConnection.java:207)
     at java.lang.Thread.run(Thread.java:619)
Caused by: java.net.UnknownHostException: s12ib.ibnet0
     ... 29 more

Perspectives on the Current State of Data-Intensive Scientific Computing

$
0
0
I recently had the benefit of being invited to attend two workshops in Oakland, CA, hosted by the U.S. Department of Energy (DOE), that shared the common theme of emerging trends in data-intensive computing: the Joint User Forum on Data-Intensive Computing and the High Performance Computing Operational Review.  My current employment requires that I stay abreast of all topics in data-intensive scientific computing (I wish there was an acronym to abbreviate this...DISC perhaps?) so I didn't go in with the expectation of being exposed to a world of new information.  As it turned out though, I did gain a very insightful perspective on how data-intensive scientific computing (DISC), and I daresay Big Data, is seen from the people who operate some of the world'slargestsupercomputers.

The DOE perspective is surprisingly realistic, application-oriented, and tightly integrated with high-performance computing.  There was the obligatory discussion of Hadoop and how it may be wedged into machines at LLNL with Magpie, ORNL with Spot Hadoop, and SDSC with myHadoop, of course, and there was also some discussion of real production use of Hadoop on bona fide Hadoop clusters at some of the DOE labs.  However, Hadoop played only a minor role in the grand scheme of the two meetings for all of the reasons I've outlined previously.

Rather, these two meetings had three major themes that crept into all aspects of the discussion:
  1. Scientific workflows
  2. Burst buffers
  3. Data curation
I found this to be a very interesting trend, as #1 and #2 (workflows and burst buffers) aren't topics I'd heard come up at any other DISC workshops, forums, or meetings I've attended.  The connection between DISC and workflows wasn't immediately evident to me, and burst buffers are a unique aspect of cyberinfrastructure that have only been thrust into the spotlight with the NERSC-8/LANL Trinity RFP last fall.  However, all three of these topics will become central to both data-intensive scientific computing and, by virtue of their ability to produce data, exascale supercomputers.

Scientific workflows

Workflows are one of those aspects of scientific computing that have been easy to dismiss as the toys of computer scientists because traditional problems in high-performance computing have typically been quite monolithic in how they are run.  SDSC's own Kepler and USC's Pegasus systems are perhaps the most well-known and highly engineered workflow management systems, and I have to confess that when I'd first heard of them a few years ago, I thought they seemed like a very complicated way to do very simple tasks.

As it turns out though, both data-intensive scientific computing and exascale computing (by virtue of the output size of exaflop calculations) tend to follow patterns that look an awful lot like map/reduce at a very abstract level.  This is a result of the fact that most data-intensive problems are not processing giant monoliths of tightly coupled and inter-related data; rather, they are working on large collections of generally independent data.  Consider the recent talk I gave about a large-scale genomic study on which I consulted; the general data processing flow was
  1. Receive 2,190 input files, 20 GB each, from a data-generating instrument
  2. Do some processing on each input file
  3. Combine groups of five input files into 438 files, each 100 GB in size
  4. Do more processing 
  5. Combine 438 files into 25 overlapping groups to get 100 files, each 2.5 GB in size
  6. Do more processing
  7. Combine 100 files into a single 250 GB file
  8. Perform statistical analysis on this 250 GB file for scientific insight
The natural data-parallelism inherent from the data-generating instrument means that any collective insight to be gleaned from this data requires some sort of mapping and reduction, and the process of managing this large volume of distributed data is where scientific workflows become a necessary part of data-intensive scientific computing.  Managing terabytes or petabytes of data distributed across thousands or millions of logical records (whether they be files on a file system, rows in a database, or whatever else) very rapidly becomes a problem that nobody will want to do by hand.  Hadoop/HDFS delivers an automated framework for managing these sorts of workflows if you don't mind rewriting all of your processing steps against the Hadoop API and building out HDFS infrastructure, but if this is not the case, alternate workflow management systems begin to look very appealing.

The core debate was not whether or not workflow management systems were a necessary component in DISC; rather, I observed two salient, open questions:
  1. The systems in use at DOE (notably Fireworks and qdo) are primarily used to work around deficiencies in current HPC schedulers (e.g., Moab and SLURM) in that they cannot handle scheduling hundreds of thousands of tiny jobs concurrently.  Thus, should these workflow managers be integrated into the scheduler to address these shortcomings at their source?
  2. How do we stop every user from creating his or her own workflow manager scripts and adopt an existing solution instead?  Should one workflow manager rule them all, or should a Darwinian approach be taken towards the current diverse landscape of existing software?
Question #1 is a highly technical question that has several dimensions; ultimately though, it's not clear to me that there is enough incentive for resource manager and scheduler developers to really dig into this problem.  They haven't done this yet, and I can only assume that this is a result of the perceived domain-specificity and complexity of each workflow.  In reality, a large number of workflows can be accommodated by two simple features: support for directed acyclic graphs (DAGs) of tasks and support for lightweight, fault-tolerant task scheduling within a pool of reserved resources.  Whether or not anyone will rise to the challenge of incorporating this in a usable way is an open question, but there certainly is a need for this in the emerging realm of DISC.

Question #2 is more interesting to me since this problem of multiple people cooking up different but equivalent solutions to the same problems is pervasive throughout computational and computer science. This is in large part due to the fatal assumption held by many computer scientists that good software can be simply "thrown over the fence" to scientists and it will be adopted.  This has never worked; rather, the majority of widely adopted software technologies in HPC have been a result of the standardization of a landscape of similar but non-standard tools.  This is something I touched on in a previous post when outlining the history of MPI and OpenMP's successes.

I don't think the menagerie of workflows' developers are ready to settle on a standard, as the field is not mature enough to have a holistic understanding of all of the issues that workflows need to solve.  Despite the numerous presentations and discussions of various workflow solutions being used across DOE's user facilities, my presentation was the only one that considered optimizing workflow execution for the underlying hardware.  Given that the target audience of these talks were users of high-performance computing, the lack of consideration given to the performance aspects of workflow optimization is a testament to this immaturity.

Burst buffers

For those who haven't been following the details of one of DOE's more recent procurement rounds, the NERSC-8 and Trinity request for proposals (RFP) explicitly required that all vendor proposals include a burst buffer to address the capability of multi-petaflop simulations to dump tremendous amounts of data in very short order.  The target use case is for petascale checkpoint-restart, where the memory of thousands of nodes (hundreds of terabytes of data) needs to be flushed to disk in an amount of time that doesn't dominate the overall execution time of the calculation.

The concept of what a burst buffer is remains poorly defined.  I got the sense that there are two outstanding definitions:
  • The NERSC burst buffer is something more tightly integrated on the compute side of the system and may be a resource that can be allocated on a per-job basis
  • The Argonne burst buffer is something more tightly integrated on the storage side of the system and acts in a fashion that is largely transparent to the user.  This sounded a lot like the burst buffer support being explored for Lustre.
In addition, Los Alamos National Labs (LANL) is exploring burst buffers for the Trinity procurement, and it wasn't clear to me if they had chosen a definition or if they are exploring all angles.  One commonality is that DOE is going full-steam ahead on providing this burst buffer capability in some form or another, and solid-state storage is going to be a central enabling component.

Personally, I find the NERSC burst buffer concept a lot more interesting since it provides a more general purpose flash-based resource that can be used in novel ways.  For example, emerging software-defined storage platforms like EMC's Vipr can potentially provide very fine-grained access to flash as-needed to make better overall use of the underlying SSDs in HPC environments serving a broad user base (e.g., NERSC and the NSF centers).  Complementing these software technologies are emerging hardware technologies like DSSD's D5 product which will be exposing flash to compute systems in innovative ways at hardware, interconnect, and software levels.

Of course, the fact that my favorite supercomputer provides dynamically allocatable SSDs in a fashion not far removed from these NERSC burst buffers probably biases me, but we've demonstrated unique DISC successes enabled by our ability to pile tons of flash on to single compute nodes.  This isn't to say that the Argonne burst buffer isn't without merit; given that the Argonne Leadership Computing Facility (ALCF) caters to capability jobs rather than capacity jobs, their user base is better served by providing a uniform, transparent burst I/O capability across all nodes.  The NERSC burst buffer, by comparison, is a lot less transparent and will probably be much more susceptible to user disuse or misuse.  I suspect that when the dust settles, both takes on the burst buffer concept will make their way into production use.

A lot of the talk and technologies surrounding burst buffers are shrouded in NNSA secrecy or vendor non-disclosures, so I'm not sure what more there is to be said.  However, the good folks at HPCwire ran an insightful article on burst buffers after the NERSC-8 announcement for those who are interested in more detail.

Data curation

The final theme that bubbled just beneath the surface of the DOE workshops was the idea that we are coming upon an era where scientists can no longer save all their data from all their calculations in perpetuity.  Rather, someone will have to become the curator of the scientific data being generated by computations and figure out what is and is not worth keeping, and how or where that data should be stored and managed.  This concept of selectively retaining user data manifested in a variety of discussions ranging from in-place data sharing and publication with Globus Plus and science DMZs to transparently managing online data volumes with hierarchical storage management (HSM).  However, the common idea was that scientists are going to have to start coming to grips with data management themselves, as facilities will soon be unable to cope with the entirety of their users' data.

This was a particularly interesting problem to me because it very closely echoed the sentiments that came about from Datanami's recent LeverageBIGDATA event which had a much more industry-minded audience.  The general consensus is that several fields are far ahead of the pack in terms of addressing this issue; the high-energy physics community has been filtering data at its genesis (e.g., ignoring the data from uninteresting collision events) for years now, and enterprises seem comfortable with retaining marketing data for only as long as it is useful.  By comparison, NERSC's tape archive has not discarded user data since its inception several decades ago; each new tape system simply repacks the previous generation's tape to roll all old data forward.

All of the proposed solutions for this problem revolve around metadata.  The reality is that not all user data has equal importance, and there is a need to provide a mechanism for users (or their applications) to describe this fact.  For example, the principal use case for the aforementioned burst buffers is to store massive checkpoint-restart files; while these checkpoints are important to retain while a calculation is running, they have limited value after the calculation has completed.  Rather than rely on a user to manually recognize that these checkpoints can be deleted, the hope is that metadata attributes can be attached to these checkpoint files to indicate that they are not critical data that must be retained forever for automated curation systems to understand.

The exact way this metadata would be used to manage space on a file system remains poorly defined.  A few examples of exactly how metadata can be used to manage data volume in data-intensive scientific computing environments include
  • tagging certain files or directories as permanent or ephemeral, signaling that the file system can purge certain files whenever a cleanup is initiated;
  • tagging certain files with a set expiration date, either as an option or by default.  When a file ages beyond a certain point, it would be deleted;
  • attributing a sliding scale of "importance" to each file, so that files of low importance can be transparently migrated to tape via HSM
Some of these concepts are already implemented, but the ability for users and applications to attach extensible metadata to files in a file system-agnostic way does not yet exist.  I think this is a significant gap in technology that will need to be filled in very short order as pre-exascale machines begin to demonstrate the ability to generate tremendous I/O loads.  Frankly, I'm surprised this issue hasn't been solved in a broadly deployable way yet.

The good news here is that the problem of curating digital data is not new; it is simply new to high-performance computing.  In the spirit of doing things the right way, DOE invited the director of LANL's Research Library to attend the workshops, and she provided valuable insights into how methods of digital data curation may be applied to these emerging challenges in data-intensive scientific computing.

Final Thoughts

The products of the working groups' conventions at the HPC Operational Review are being assembled into a report to be delivered to DOE's Office of Science, and it should be available online at the HPCOR 2014 website as well as the usual DOE document repository in a few months.  Hopefully it will reflect what I feel was the essence of the workshop, but at any rate, it should contain a nice perspective on how we can expect the HPC community to address the new demands emerging from data-intensive scientific computing (DISC) community.

In the context of high-performance computing, 
  • Workflow management systems will continue to gain importance as data sets become larger, more parallel, and more unwieldy.
  • Burst buffers, in one form or another, will become the hardware solution to the fact that all exascale simulations will become data-intensive problems.
  • Data curation frameworks are the final piece of the puzzle and will provide the manageability of data at rest.
None of these three legs are fully developed, and this is simply an indication of data-intensive scientific computing's immaturity relative to more traditional high-performance computing:  
  • Workflows need to converge on some sort of standardized API or feature set in order to provide the incentive to users to abandon their one-off solutions.
  • Burst buffer technology has diverged into two solutions centered at either the compute or storage side of a DISC platform; both serve different workloads, and the underlying hardware and software configurations remain unfinished.
  • Effective data curation requires a metadata management system that will allow both users and their applications to identify the importance of data to automate sensible data retention policy enforcement and HSM.
Of course, I could be way off in terms of what I took away from these meetings seeing as how I don't really know what I'm talking about.  Either way, it was a real treat to be invited out to hang out with the DOE folks for a week; I got to meet some of my personal supercomputing heroes, share war stories, and make some new pals.

I also got to spend eight days getting to know the Bay Area.  So as not to leave this post entirely without a picture,


I also learned that I have a weird fascination with streetcars.  I'm glad I was introduced to supercomputers first.

Exascale in perspective: RSC's 1.2 petaflop rack

$
0
0
Russian supercomputing manufacturer RSC generated some buzz at ISC'14 last week when they showed their 1.2 PF-per-rack Xeon Phi-based platform.  I was aware of this system from when they first announced it a few months prior, and I referenced it in a piece of a blog post I was writing about the scarier aspects of exascale computing.  Given my impending career change though, it is unclear that I will have the time to ever finish that post before it becomes outdated.  Since RSC is back in the spotlight though, I thought I'd post the piece I wrote up to illustrate how wacky this 1.2 PF rack really is in terms of power consumption.  Power consumption, of course, is the limiting factor standing between today and the era of exascale computing.

So, to put a 400 kW, 1.2 PF rack into perspective, here is that piece:



The Importance of Energy Efficiency

Up through the petascale era in which we currently live, raw performance of high-performance components--processors, RAM, and interconnect--were what limited the ultimate performance of a given high-end machine.  The first petaflop machine, Los Alamos' Roadrunner, derived most of its FLOPs from high-speed PowerXCell 8i processors pushing 3.2 GHz per core.  Similarly, the first 10 PF supercomputer, RIKEN's K computer, derived its performance from its sheer size of 864 cabinets.  Although I don't mean to diminish the work done by the engineers that actually got these systems to deliver this performance, the petascale era really was made possible by making really big systems out of really fast processors.

By contrast, Exascale represents the first milestone where the limitation does not lie in making these high-performance components faster; rather, performance is limited by the amount of electricity that can be physically delivered to a processor and the amount of heat that can be extracted from it.  This limitation is what has given rise to these massively parallel processors that eschew a few fast cores for a larger number of low-powered ones.  By keeping clock speeds low and densely packing many (dozens or hundreds) of compute cores on a single silicon die, these massively parallel processors are now realizing power efficiencies (flops per watt) that are an order of magnitude higher than what traditional CPUs can deliver.

The closest technology on the market that will probably resemble the future's exaflop machines are based on accelerators--either NVIDIA GPUs or Intel's MICs.  The goal will be to jam as many of these massively parallel processors into as small a space and with as tight of an integration as possible.  Recognizing this trend, NERSC has opted to build what I would call the first "pre-exascale" machine in its NERSC-8 procurement which will feature a homogeneous system of manycore processors.

However, such pre-exascale hardware doesn't actually exist yet, and NERSC-8 won't appear until 2016.  What does exist, though, is a product by Russia's RSC Group called PetaStream: a rack packed with 1024 current-generation Xeon Phi (Knight's Corner) coprocessors that has a peak performance of 1.2 PF/rack.  While this sounds impressive, it also highlights the principal challenge of exascale computing: power consumption.  One rack of RSC PetaStream is rated for 400 kW, delivering 3 GFLOPs/watt peak.  Let's put this into perspective.

Kilowatts, megawatts, and gigawatts in perspective

During a recent upgrade to our data center infrastructure, three MQ DCA220SS-series diesel generators were brought in for the critical systems.  Each is capable of producing 220 kVA according to the spec sheets.
Three 220 kVA diesel generators plugged in during a PM at SDSC
It would take three of these diesel generators to power a single rack of RSC's PetaStream.  Of course, these backup diesel generators aren't a very efficient way of generating commercial power, so this example is a bit skewed.

Let's look at something that is used to generate large quantities of commercial power instead.  A GE 1.5-77 wind turbine, which is GE's most popular model, is advertised as delivering 1.5 megawatts at wind speeds above 15 miles per hour.

GE 1.5 MW wind turbine.   Source: NREL
Doing the math, this means that the above pictured turbine would be able to power only three racks of RSC PetaStream on a breezy day.

To create a supercomputer with a peak capability of an exaflop using RSC's platform, you'd need over 800 racks of PetaStream and over 300 MW of power to turn it all on.  That's over 200 of the above GE wind turbines and enough electrity to power about 290,000 homes in the U.S.  Wind farms of this size do exist; for example,

300 MW Stateline Wind Farm.  Source: Wikimedia Commons
the Stateline Wind Farm, which was built on the border between Oregon and Washington, has a capacity of about 300 MW.  Of course, wind farms of this capacity cannot be built in any old place.

Commercial nuclear power plants can be built in a variety of places though, and they typically generate on the order of 1 gigawatt (GW) of power per reactor.  In my home state of New Jersey, the Hope Creek Nuclear Generating Station has a single reactor that was built to deliver about 1.2 GW of power:

1.2 GW Hope Creek nuclear power station.  The actual reactor is housed in the concrete cylinder to the bottom left.  Courtesy of the Nuclear Regulatory Commission.

This is enough to power almost 4 exaflops of PetaStream.  Of course, building a nuclear reactor for every exaflop supercomputer would be extremely costly, given the multi-billion dollar cost of building reactors like this.  Clearly, the energy efficiency (flops/watt) of computing technology needs to improve substantially before we can arrive at the exascale era.

Storage Utilization in the Long Tail of Science

$
0
0
Shameless Advertising
If the things I discuss in this blog post sound like something you'd like to do professionally, you're in luck!  NERSC is looking to fill a position in user services in conjunction with the Joint Genome Institute.  The folks at NERSC are a world-class bunch, and this position is, in my biased opinion, one of the best roles to have in the world of HPC and bioinformatics.

This particular position is at the exciting crossroads of next-generation sequencing and high-performance computing, and you'll get to work on some of the largest scientific datasets being generated.  DOE and NERSC are great employers and organizations, and I seriously can't say enough good things about them.

Introduction

Since changing careers and moving up to the San Francisco Bay Area in July, I haven't had nearly as much time to post interesting things here on my blog—I guess that's the startup life. That isn't to say that my life in DNA sequencing hasn't been without interesting observations to explore though; the world of high-throughput sequencing is becoming increasingly dependent on high-performance computing, and many of the problems being solved in genomics and bioinformatics are stressing aspects of system architecture and cyberinfrastructure that haven't gotten a tremendous amount of exercise from the more traditional scientific domains in computational research.

Take, for example, the biggest and baddest DNA sequencer on the market: over the course of a three-day run, it outputs around 670 GB of raw (but compressed) sequence data, and this data is spread out over 1,400,000 files. This would translate to an average file size of around 500 KB, but the reality is that the file sizes are a lot less uniform:

Figure 1. File size distribution of a single flow cell output (~770 gigabases) on Illumina's highest-end sequencing platform

After some basic processing (which involves opening and closing hundreds of these files repeatedly and concurrently), these data files are converted into very large files (tens or hundreds of gigabytes each) which then get reduced down to data that is more digestible over the course of hundreds of CPU hours. As one might imagine, this entire process is very good at taxing many aspects of file systems, and on the computational side, most of this IO-intensive processing is not distributed and performance benefits most from single-stream, single-client throughput.

As a result of these data access and processing patterns, the storage landscape in the world of DNA sequencing and bioinformatics is quite different from conventional supercomputing. Some large sequencing centers do use the file systems we know and love (and hate) like GPFS at JGI and Lustre at Sanger, but it appears that most small- and mid-scale sequencing operations are relying heavily on network-attached storage (NAS) for both receiving raw sequencer data and being a storage substrate for all of the downstream data processing.

I say all of this because these data patterns—accessing large quantities of small files and large files with a high degree of random IO—is a common trait in many scientific applications used in the "long tail of science." The fact is, the sorts of IO for which parallel file systems like Lustre and GPFS are designed are tedious (if not difficult) to program, and for the majority of codes that don't require thousands of cores to make new discoveries, simply reading and writing data files in a naïve way is "good enough."

The Long Tail

This long tail of science is also using up a huge amount of the supercomputing resources made available to the national open science community; to illustrate, 98% of all jobs submitted to the XSEDE supercomputers in 2013 used 1024 or fewer CPU cores, and these modest-scale jobs represented over 50% of all the CPU time burned up on these machines.

Figure 2. Cumulative job size distribution (weighted by job count and SUs consumed) for all jobs submitted to XSEDE compute resources in 2013

The NSF has responded to this shift in user demand by awarding Comet, a 2 PF supercomputer designed to run these modest-scale jobs. The Comet architecture limits its full-bisection bandwidth interconnectivity to groups of 72 nodes, and these 72-node islands will actually have enough cores to satisfy 99% of all the jobs submitted to XSEDE clusters in 2013 (see above). By limiting the full-bisection connectivity to smaller islands and using less rich connectivity between islands, the cost savings in not having to buy so many mid-tier and core switches are then turned into additional CPU capacity.

What the Comet architecture doesn't address, however, is the question of data patterns and IO stress being generated by this same long tail of science—the so-called 99%. If DNA sequencing is any indicator of the 99%, parallel file systems are actually a poor choice for high-capacity, mid-scale jobs because their performance degrades significantly when facing many small files. Now, the real question is, are the 99% of HPC jobs really generating and manipulating lots of small files in favor of the large striped files that Lustre and GPFS are designed to handle? That is, might the majority of jobs on today's HPC clusters actually be better served by file systems that are less scalable but handle small files and random IO more gracefully?

Some colleagues and I set out to answer this question last spring, and a part of this quest involved looking at every single file on two of SDSC's Data Oasis file systems. This represented about 1.7 PB of real user data spread across two Lustre 2.4 file systems—one designed for temporary scratch data and the other for projects storage—and we wanted to know if users' data really consisted of the large files that Lustre loves or if, like job size, the 99% are really working with small files.  Since SDSC's two national resources, Gordon and Trestles, restrict the maximum core count for user jobs to modest-scale submissions, these file systems should contain files representative of long-tail users.

Scratch File Systems

At the roughest cut, files can be categorized based on whether their size is on the order of bytes and kilobytes (size < 1024*1024 bytes), megabytes (< 1024 KB), gigabytes (<1024 MB), and terabytes (< 1024 GB). Although pie charts are generally a terrible way to show relative compositions, this is how the files on the 1.2 PB scratch file system broke down:

Figure 3. Fraction of file count consumed by files of a given size on Data Oasis's scratch file system for Gordon


The above figure shows the number of files on the file system classified by their size, and there are clearly a preponderance of small files less than a gigabyte in size. This is not terribly surprising as the data is biased towards smaller files; that is, you can fit a thousand one-megabyte files in the same space that a single one-gigabyte file would take up. Another way to show this data is by how much file system capacity is taken up by files of each size:

Figure 4. File system capacity consumed by files of a given size on Data Oasis's scratch file system for Gordon


This makes it very apparent that the vast majority of the used space on this scratch file system—a total of 1.23 PB of data—are taken up by files on the order of gigabytes and megabytes. There were only seventeen files that were a terabyte or larger in size.

Incidentally, I don't find it too surprising that there are so few terabyte-sized files; even in the realm of Hadoop, median job dataset sizes are on the order of a dozen gigabytes (e.g., Facebook has reported that 90% of its jobs read in under 100 GB of data). Examining file sizes with much finer granularity reveals that the research data on this file system isn't even of Facebook scale though:

Figure 5. Number of files of a given size on Data Oasis's scratch file system for Gordon.  This data forms the basis for Figure 3 above


While there are a large number of files on the order of a few gigabytes, it seems that files on the order of tens of gigabytes or larger are far more scarce. Turning this into relative terms,

Figure 6. Cumulative distribution of files of a given size on Data Oasis's scratch file system for Gordon


we can make more meaningful statements. In particular,

  • 90% of the files on this Lustre file system are 1 megabyte or smaller
  • 99% of files are 32 MB or less
  • 99.9% of files are 512 MB or less
  • and 99.99% of files are 4 GB or less

The first statement is quite powerful when you consider the fact that the default stripe size in Lustre is 1 MB. The fact that 90% of files on the file system are smaller than this means that 90% of users' files really gain no advantages by living on Lustre. Furthermore, since this is a scratch file system that is meant to hold temporary files, it would appear that either user applications are generating a large amount of small files, or users are copying in large quantities of small files and improperly using it for cold storage. Given the quota policies for Data Oasis, I suspect there is a bit of truth to both.

Circling back a bit though, I said earlier that comparing just the quantity of files can be a bit misleading since a thousand 1 KB files will take up the same space as a single 1 MB file. We can also look at how much total space is taken up by files of various sizes.

Figure 7. File system capacity consumed by files of a given size on Data Oasis's scratch file system for Gordon.  This is just a more finely diced version of the data presented in Figure 4 above.

The above chart is a bit data-dense so it takes some staring at to understand what's going on. First looking at the purple line, we can pull out some pretty interesting facts:

  • Half of the file system's used capacity (50%) is consumed by files that are 1 GB or less in size
  • Over 20% of the file system's used capacity is taken up by files smaller than 64 MB
  • About 10% of the capacity is used by files that are 64 GB or larger

The blue boxes represent the derivative of that purple line—that is, how much space is taken up by files of only one specific size. The biggest chunk of the file system (141 TB) is taken up by 4 GB files, but it appears that there is a substantial range of file sizes that take up very similarly sized pieces of the pie. 512 MB files take up a total of 139 TB; 1 GB, 2 GB, and 8 GB files all take up over 100 TB of total space each as well. In fact, files ranging from 512 MB to 8 GB comprise 50% of the total file system capacity.

Why the sweet spot for space-consuming files is between 512 MB and 8 GB is unclear, but I suspect it's more caused by the human element in research. In my own research, I worked with files in this range simply because it was enough data to be statistically meaningful while still small enough to quickly re-analyze or transfer to a colleague. For file sizes above this range, the mass of the data made it difficult to manipulate using the "long-tail" cyberinfrastructure available to me. But, perhaps as more national-scale systems comes online to meet the needs of these sorts of workloads, this sweet spot will creep out to larger file sizes.

Projects Storage

The above discussion admittedly comes with a lot of caveats.  In particular, the scratch file system we examined was governed by no hard quotas which did lead some people to leave data resident for longer than they probably should have.  However, the other file system we analyzed was SDSC's Data Oasis projects storage which was architected for capacity over performance and featured substantially more disks per OSS.  This projects storage also came with 500 GB quotas by default, forcing users to be a little more mindful of what was worth keeping.

Stepping back to the coarse-grained kilobyte/megabyte/gigabyte/terabyte pie charts, here is how projects storage utilization compared to scratch storage:

Figure 8. Fraction of file count consumed by files of a given size on Data Oasis's projects file system (shared between Gordon and Trestles users)

On the basis of file counts, it's a bit surprising that users seem to store more smaller (kilobyte-sized) files in their projects space than their scratch space.  This may imply that the beginning and end data bookending simulations aren't as large as the intermediate data generated during the calculation.  Alternately, it may be a reflection of user naïveté; I've found that newer users were often afraid to use the scratch space because of the perception that their data may vanish from there without advanced notice.  Either way, gigabyte-sized files comprised a few hundredths of a percent of files, and terabyte-sized files were more scarce still on both file systems.  The trend was uniformly towards smaller sizes on projects space.

As far as space consumed by these files, the differences remain subtle.

Figure 9. Fraction of file system capacity consumed by files of a given size on Data Oasis's projects file system

There appears to be a trend towards users keeping larger files in their projects space, and the biggest change is the decrease in megabyte-sized files in favor of gigabyte-sized files.  However, this trend is very small and persists across a finer-grained examination of file size distributions:

Figure 10. File system capacity consumed by files of a given size on Data Oasis's projects file system

Half of the above plot is the same data shown above, making this plot twice as busy and confusing.  However there's a lot of interesting data captured in it, so it's worth the confusing presentation.  In particular, the overall distribution of mass with respect to the various file sizes is remarkably consistent between scratch and projects storage.  We see the same general peak of file size preference in the 1 GB to 10 GB range, but there is a subtle bimodal divide in projects storage that reveals preference for 128MB-512MB and 4GB-8GB files which manifests in the integrals (red and purple lines) that show a visibly greater slope in these regions.

The observant reader will also notice that the absolute values of the bars are smaller for projects storage and scratch storage; this is a result of the fact that the projects file system is subject to quotas and, as a result, is not nearly as full of user data.  To complicate things further, the projects storage represents user data from two different machines (each with unique job size policies, to boot), whereas the scratch storage is only accessible from one of those machines.  Despite these differences though, user data follows very similar distributions between both file systems.

Corollaries

It is probably unclear what to take away from these data, and that is with good reason.  There are fundamentally two aspects to quantifying storage utilizations--raw capacity and file count--because they represent two logically separate things.  There is some degree of interchangeability (e.g., storing a whole genome in one file vs. storing each chromosome its own file), and this is likely contributing to the broad peak in file size between 512 MB and 8 GB.  With that being said, it appears that the typical long-tail user stores a substantial amount of decidedly "small" files on Lustre, and this is exemplified by the fact that 90% of the files resident on the file systems analyzed here are 1 MB or less in size.

This alone suggests that large parallel file systems may not actually be the most appropriate choice for HPC systems that are designed to support a large group of long-tail users.  While file systems like Lustre and GPFS certainly provide a unique capability in that some types of medium-sized jobs absolutely require the IO capabilities of parallel file systems, there are a larger number of long-tail applications that do single-thread IO, and some of these perform IO in such an abusive way (looking at you, quantum chemistry) that they cannot run on file systems like Lustre or GPFS because of the number of small files and random IO they use.

So if Lustre and GPFS aren't the unequivocal best choice for storage in long-tail HPC, what are the other options?

Burst Buffers

I would be remiss if I neglected to mention burst buffers here since they are designed, in part, to address the limitations of parallel file systems.  However, their actual usability remains unproven.  Anecdotally, long-tail users are generally not quick to alter the way they design their jobs to use cutting-edge technology, and my personal experiences with Gordon (and its 300 TB of flash) were that getting IO-nasty user applications to effectively utilize the flash was often a very manual process that introduced new complexities, pitfalls, and failure modes.  Gordon was a very experimental platform though, and Cray's new DataWarp burst buffer seems to be the first large-scale productization of this idea.  It will be interesting to see how well it works for real users when the technology starts hitting the floor for open science in mid-2016, if not sooner.

High-Performance NAS

An emerging trend in HPC storage is the use of high-performance NAS as a complementary file system technology in HPC platforms.  Traditionally, NAS has been a very poor choice for HPC applications because of the limited scalability of the typical NAS architecture--data resides on traditional local file system with network service being provided by an additional software layer like NFS, and the ratio of storage capacity to network bandwidth out of the NAS is very high.

The emergence of cheap RAM and enterprise SSDs has allowed some sophisticated file systems like ZFS and NetApp's WAFL to demonstrate very high performance, especially in delivering very high random read performance, by using both RAM and flash as a buffer between the network and spinning rust.  This allows certain smaller-scale jobs to enjoy substantially better performance when running on flash-backed NAS than a parallel file system.  Consider the following IOP/metadata benchmark run on a parallel file system and a NAS head with SSDs for caching:

Figure 11. File stat rate on flash-backed NAS vs. a parallel file system as measured by the mdtest benchmark

A four-node job that relies on statting many small files (for example, an application that traverses a large directory structure such as the output of one of the Illumina sequencers I mentioned above) can achieve a much higher IO rate on a high-performance NAS than on a parallel file system.  Granted, there are a lot of qualifications to be made with this statement and benchmarking high-performance NAS is worth a post of its own, but the above data illustrate a case where NAS may be preferable over something like Lustre.

Greater Context

Parallel file systems like Lustre and GPFS will always play an essential role in HPC, and I don't want to make it sound like they can be universally replaced by high-performance NAS.  They are fundamentally architected to scale out so that increasing file system bandwidth does not require adding new partitions or using software to emulate a single namespace.  In fact, the single namespace of parallel file systems makes the management of the storage system, its users, and the underlying resources very flexible and straightforward.  No volume partitioning needs to be imposed, so scientific applications' and projects' data consumption do not have to align with physical hardware boundaries.

However, there are cases where a single namespace is not necessary at all; for example, user home directories are naturally partitioned with fine granularity and can be mounted in a uniform location while physically residing on different NAS heads with a simple autofs map.  In this example, leaving user home directories on a pool of NAS filers offers two big benefits:

  1. Full independence of the underlying storage mitigates the impact of one bad user.  A large job dropping multiple files per MPI process will crush both Lustre and NFS, but in the case of Lustre, the MDS may become unresponsive and block IO across all users' home directories.
  2. Flash caches on NAS can provide higher performance on IOP-intensive workloads at long-tail job sizes.  In many ways, high-performance NAS systems have the built-in burst buffers that parallel file systems are only now beginning to incorporate.
Of course, these two wins come at a cost:
  1. Fully decentralized storage is more difficult to manage.  For example, balancing capacity across all NAS systems is tricky when users have very different data generation rates that they do not disclose ahead of time.
  2. Flash caches can only get you so far, and NFS will fall over when enough IO is thrown at it.  I mentioned that 98% of all jobs use 1024 cores or fewer (see Figure 1), but 1024 cores all performing heavy IO on a typical capacity-rich, bandwidth-poor NAS head will cause it to grind to a halt.
Flash-backed high-performance NAS is not an end-all storage solution for long-tail computational science, but it also isn't something to be overlooked outright.  As with any technology in the HPC arena, its utility may or may not match up well with users' workloads, but when it does, it can deliver less pain and better performance than parallel file systems.

Acknowledgments 

As I mentioned above, the data I presented here was largely generated as a result of an internal project in which I participated while at SDSC.  I couldn't have cobbled this all together without the help of SDSC's HPC Systems group, and I'm really indebted to +Rick+Haisong, and +Trevor for doing a lot of the heavy lifting in terms of generating the original data, getting systems configured to test, and figuring out what it all meant when the dust settled (even after I had left!).  SDSC's really a world-class group of individuals.

Reality Check on Cloud Usage for HPC

$
0
0
The opinions and analysis expressed here are solely my own and do not reflect those of my employer or the National Science Foundation.

I get very bent out of shape when people start speaking authoritatively about emerging and media-hyped technologies without having gone into the trenches to see if the buzz they're perpetuating is backed by anything real.  Personally, I am extremely pessimistic about two very trendy technologies that vendors have thrust into the HPC spotlight: GPGPU computing and cloud computing.  I wrote a post a while ago about why diving head-first into GPUs as the next big thing is unwise, and I recently posted some numbers that showed that, for tightly coupled problems (i.e., traditional HPC), Amazon EC2 cannot compete with Myrinet 10G.

This is not to say that there aren't segments of scientific computing that cannot be served by cloud services; loosely coupled problems and non-parallel batch problems do well on compute instances, and I honestly could've made good use of cloud cycles on my dissertation work for that reason.  But let's be clear--these are not traditional uses of modern HPC, and unless you want to redefine what you're calling "HPC" to explicitly include cloud-amenable problems, HPC is nowhere nearly as great of an idea as popular science would have you believe.

The proof of this is out there, but none of it has really garnered much attention amidst the huge marketing buzz surrounding cloud computing.  I can see why people would start believing the hype given this fact, but there's a wealth of hard information (vs. marketing) out there that paints a truer picture of what role cloud computing is playing in HPC and scientific computing.  What follows is a quick overview of one such source of information.

The NSF's former Office of Cyberinfrastructure (OCI) commissioned a cloud use survey for XSEDE users to determine exactly what researchers wanted, needed, and used in terms of cloud computing last year.  The results are available online, but it's taken a long time (many months at this point) to distill the data into a presentable format so it hasn't gained much attention as far as I know.  Earlier in the year I was asked for an opinion on user demand for HPC-in-the-cloud though, and I read through every single survey response and tabulated some basic metrics.  Here's what I found:

Capability Demands

The average cloud-based scientific project used an average of 617 cores per compute cluster instance.  This is a relatively large number of cores for a lab-scale cluster and fits in nicely with the use model of cloud cluster computing addressing the burst-capability needs of lab-scale users at a fraction of the cost of purchasing a physical machine.

However this 617 core average is not distributed normally--the median is significantly lower, at 16 cores per cluster.  That is, the median cloud cluster size could fit on your typical workstation.   There are definitely a variety of possible rationalizations as to why the surveyed demand for cloud computing resources are so modest in scale, but the fact remains that there is no huge demand for capability computing in the cloud.  This suggests that users are not comfortable scaling up so soon, they realize that the performance of capability computing in the cloud is junk, or they are not doing actual science in the cloud.  Spoiler alert: the third case is what's happening.  See below.

A fairer way of looking at capacity demands was proposed by Dave Hart, whose position is that classifying a project by its maximum core-size requirements is a better representation of its needs, because that determines what size of a system that project will need to satisfy all of its research goals.  While the elasticity of the cloud greatly softens the impact of this distinction, it turns out to not really change a lot.  The average peak instance size is 2,411 cores which is quite a respectable figure--getting this many cores on a supercomputer would require a significant amount of waiting.

However, the median of this project capability is only 128 cores.  Assuming a cluster node has 8-16 cores, this means that 50% of cloud computing projects' compute requirements can be fully met by a 8-16 node cluster.  This is squarely within the capability of lab-scale clusters and certainly not a scale beyond the reach of any reasonably funded department or laboratory.  As a frame of reference, my very modestly funded, 3-member research group back in New Jersey could afford to buy an 8-node cluster every two years.

Capacity and capability requirements of surveyed cloud-based projects.  Colored lines represent medians of collected data.

Capacity Demands

Given the modest scale of your average cloud-based cluster, it should be little surprise that the average project only burned 114,273 core-hours per year in the cloud.  By comparison, the average project on SDSC Trestles burned 298,146 core-hours in the last year, or 2.6x more supercomputing time.  Trestles is a relatively small supercomputer (324 nodes, 10,368 cores, 100 TF peak) that is targeted at new users who are also coming from the lab scale.  Furthermore, Trestles users are not allowed to schedule more than 1024 cores per job, attracting smaller-scale users who might have the best possibility at fitting the aforementioned cloud-scale workload.  Even then though, cloud-based resources are just not seeing a lot of use.

Again, bearing in mind the very uneven distribution of cluster sizes in the cloud, looking at the median SU burn reveals even lower utilization: the median compute burn over all projects is only 8,340 core-hours.  In terms of compute time needed, the median scientific project using the cloud could have its annual compute time satisfied by running a 16-core workstation for 22 days.  Clearly, all this chatter about computing in the cloud is not representative of what real researchers are really doing.  The total compute time consumed by all of the survey respondents adds up to 8,684,715 core-hours, or the compute capacity that Trestles can deliver in a little over a month.

Scientific Output

The above data paints a grim picture for the current usage of cloud computing, but there are a variety of reasons that can rationalize why the quantity of HPC in the cloud is so low.  Coming from a background in research science myself, I can appreciate quality work regardless of how much or how little compute time it needed.  I wanted to know if scientists using the cloud are producing actual scientific output in a domain science--that is, are there any discoveries which we can point to and say "this was made possible by the cloud!"?

Project breakdown


Unfortunately, the answer is "not really." Reading the abstracts of each of the survey respondents revealed that a majority of them, 62%, are projects aimed at developing new software and tools that other scientists can use in the cloud.  In a sense, these tools are being developed with no specific scientific issue at the end of the tunnel; they are being developed before a demand exists.  While one could argue that "if you build it, they will come," such has not proven to be the case in either the compute capacity or capability made available by cloud computing.  Only 25% of surveyed users actually claim to be pursuing problems related to a domain science.  The remainder were either using the cloud for teaching purposes or to host data or web services.

This was rather shocking to me, as it seems rather self-serving by developing technologies that "might be useful" for someone later on.  What's more is that the biggest provider of cloud services to the researchers surveyed was Microsoft's Azure platform (30% of users).  This struck me as odd since Amazon is the most well-known Cloud provider out there; as it turns out, the majority of these projects based on Microsoft Azure were funded, in part or in whole, by Microsoft.  22% of projects were principally using Amazon Web Services, and again, Amazon provided funding for the research performed.

Outlook

This all paints a picture where HPC in the cloud is almost entirely self-driven.  Cloud providers are paying researchers to develop infrastructure and tools for a demand that doesn't exist.  They are giving away compute time, and even then, the published scientific output from the cloud is so scarce that, after several days of scouring journals, I have found virtually no published science that acknowledges compute time on a cloud platform.  There are a huge amount of frameworks and hypothesized use cases, but the only case studies I could find in a domain science are rooted in private companies developing and marketing a product using cloud-based infrastructure.

Ultimately, HPC in the cloud is just not here.  There's a lot of talk, a lot of tools, and a lot of development, but there just isn't a lot of science coming out.  Cloud computing is ultimately a business model, and it is extremely attractive as a business model to a business.  However, it is not a magical new paradigm-shifting, disruptive, or whatever else technology for science, and serious studies commissioned to objectively consider cloud-HPC have consistently come back with overall negative outlooks.

The technology is rapidly changing, and I firmly believe that the performance gap is slowly closing.  I've been involved with the performance testing of some pretty advanced technologies aimed at exactly this, and the real-world application performance (which I would like to write up and share online sometime) looks really promising.  The idea of being able to deploy a "virtual appliance" for turn-key HPC in a virtualized environment is very appealing.  Remember, though, that everything that is virtual is not necessarily cloud.

The benefits of the cloud over a dedicated supercomputing platform which supports VMs do not add up.  Yes, cloud can be cheap, but supercomputing is free.  Between the U.S. Department of Energy and the National Science Foundation, there already exists an open high-performance computing ecosystem and cyberinfrastructure with capability, capacity, and ease of use that outstrips Amazon and Azure.  The collaborations, scientific expertise, and technological knowhow are already all in one place.  User support is free and unlimited, whereas cloud platforms provide none of that.

Turning to the cloud for HPC just doesn't make a lot of sense, and the facts and demonstrated use cases back that up.  If I haven't beaten the dead horse by this point, I've got another set of notes I've prepared from another in-depth assessment of HPC in the cloud.  If the above survey (which is admittedly limited in scope) doesn't prove my point, I'd be happy to provide more.

Now let's wait until July 1 to see if the National Science Foundation's interpretation of the survey is anywhere close to mine.

Thoughts on the NSF Future Directions Interim Report

$
0
0
The National Academies recently released an interim report entitled Future Directions for NSF Advanced Computing Infrastructure to Support U.S. Science and Engineering in 2017-2020 as a part of a $723,000 award commissioned to take a hard look at where the NSF's supercomputing program is going.  Since releasing the interim report, the committee has been soliciting feedback and input from the research community to consider as they draft their final report, and I felt compelled to put some of my thoughts into a response.

NSF's HPC programs are something I hold near and dear since I got my start in the industry by supporting two NSF-owned supercomputers.  I put a huge amount of myself into Trestles and Gordon, and I still maintain that job encompassed the most engaging and rewarding work I've ever done.  However, the NSF's lack of a future roadmap for its HPC program made my future feel perpetually uncertain, and this factored heavily in my decision to eventually pursue other opportunities.

Now that I am no longer affiliated with NSF, I wanted to delineate some of the problems I observed during my time on the inside with the hope that someone more important than me really thinks about how they can be addressed.  The report requested feedback in nine principal areas, so I've done my best to contextualize my thoughts with the committee's findings.

With that being said, I wrote this all up pretty hastily.  Some of it may be worded strongly, and although I don't mean to offend anybody, I stand by what I say.  That doesn't mean that my understanding of everything is correct though, so it's probably best to assume that I have no idea what I'm talking about here.

Finally, a glossary of terms may make this more understandable:

  • XD is the NSF program that funds XSEDE; it finances infrastructure and people, but it does not fund supercomputer procurements or operations
  • Track 1 is the program that funded Blue Waters, the NSF's leadership-class HPC resource
  • Track 2 is the program that funds most of the XSEDE supercomputers.  It funded systems like Ranger, Keeneland, Gordon, and Stampede



1. How to create advanced computing infrastructure that enables integrated discovery involving experiments, observations, analysis, theory, and simulation.

Answering this question involves a few key points:
  1. Stop treating NSF's cyberinfrastructure as a computer science research project and start treating it like research infrastructure operation.  Office of Cyberinfrastructure (OCI) does not belong in Computer & Information Science & Engineering (CISE).
  2. Stop funding cyberinfrastructure solely through capital acquisition solicitations and restore reliable core funding to NSF HPC centers.  This will restore a community that is conducive to retaining expert staff.
  3. Focus OCI/ACI and raise the bar for accountability and transparency.   Stop funding projects and centers that have no proven understanding of operational (rather than theoretical) HPC.
  4. Either put up or give up.  The present trends in funding lie on a road to death by attrition.  
  5. Don't waste time and funding by presuming that outsourcing responsibility and resources to commercial cloud or other federal agencies will effectively serve the needs of the NSF research community.
I elaborate on these points below.

2. Technical challenges to building future, more capable advanced computing systems and how NSF might best respond to them.

"Today’s approach of federating distributed compute- and data-intensive resources to meet the increasing demand for combined computing and data capabilities is technically challenging and expensive."
This is true.
"New approaches that co-locate computational and data resources might reduce costs and improve performance. Recent advances in cloud data center design may provide a viable integrated solution for a significant fraction of (but not all) data- and compute-intensive and combined workloads."
This strong statement is markedly unqualified and unsubstantiated.  If it is really recommending that the NSF start investing in the cloud, consider the following:
  • Cloud computing resources are designed for burst capabilities and are only economical when workloads are similarly uneven.  In stark contrast, most well-managed HPCs see constant, high utilization which is where the cloud becomes economically intractable.
  • The suggestion that cloud solutions can "improve performance" is unfounded.  At a purely technological level, the cloud will never perform as well as unvirtualized HPC resources, period.  Data-intensive workloads and calculations that require modest inter-node communication will suffer substantially.

In fact, if any cost reduction or performance improvement can be gained by moving to the cloud, I can almost guarantee that incrementally more can be gained by simply addressing the non-technological aspects of the current approach of operating federated HPC.  Namely, the NSF must
  1. Stop propping up failing NSF centers who have been unable to demonstrate the ability to effectively design and operate supercomputers. 
  2. Stop spending money on purely experimental systems that domain scientists cannot or will not use.

The NSF needs to re-focus its priorities and stop treating the XD program like a research project and start treating it like a business.  Its principal function should be to deliver a product (computing resources) to customers (the research community).  Any component that is not helping domain scientists accelerate discovery should be strongly scrutinized.  Who are these investments truly satisfying?
"New knowledge and skills will be needed to effectively use these new advanced computing technologies."
This is a critical component of XD that is extremely undervalued and underfunded.  Nobody is born with the ability to know how to use HPC resources, and optimization should be performed on users in addition to code.  There is huge untapped potential in collaborative training between U.S. federal agencies (DOE, DOD) and European organizations (PRACE).  If there is bureaucratic red tape in the way, it needs to be dealt with at an official level or circumvented at the grassroots level.

3. The computing needs of individual research areas.

XDMoD shows this.  The principal workloads across XSEDE are from traditional domains like physics and chemistry, and the NSF needs to recognize that this is not going to change substantially over the lifetime of a program like XD.

Straight from XDMoD for 2014.  MPS = math and physical sciences, BIO = biological sciences, GEO = geosciences.  NSF directorate is not a perfect alignment; for example, I found many projects in BIO were actually chemistry and materials science.


While I wholeheartedly agree that new communities should be engaged by lowering the barriers to entry, these activities cannot be done at a great expense of undercutting the resources required by the majority of XD users.

The cost per CPU cycle should not be deviating wildly between Track 2 awards because the ROI on very expensive cycles will be extremely poor.  If the NSF wants to fund experimental systems, it needs to do that as an activity that is separate from the production resources.  Alternatively, only a small fraction of each award should be earmarked for new technologies that represent a high risk; the Stampede award was a fantastic model of how a conservative fraction of the award (10%) can fund an innovative and high-risk technology.

4. How to balance resources and demand for the full spectrum of systems, for both compute- and data-intensive applications, and the impacts on the research community if NSF can no longer provide state-of-the-art computing for its research community.

"But it is unclear, given their likely cost, whether NSF will be able to invest in future highest-tier systems in the same class as those being pursued by the Department of Energy, Department of Defense, and other federal mission agencies and overseas."
The NSF does not have the budget to support leadership computing.  This is clear even from a bird's eye view: DOE ASCR's budget for FY2012 was $428 million and, by comparison, NSF ACI's budget was only $211 million.  Worse yet, despite having half the funding of its DOE counterpart, the NSF owned HPC resources at seven universities in FY2012 compared to ASCR's three centers.

Even if given the proper funding, the NSF's practice of spreading Track 2 awards across many universities to operate its HPC assets is not conducive to operating leadership computing.  The unpredictable nature of Track 2 awards has resulted in very uneven funding for NSF centers which, quite frankly, is a terrible way to attract and retain the highly knowledgeable world-class staff that is necessary to operate world-class supercomputers.

5. The role of private industry and other federal agencies in providing advanced computing infrastructure.

The report makes some very troubling statements in reference to this question.
"Options for providing highest-tier capabilities that merit further exploration include purchasing computing services from federal agencies…"
This sounds dirty.  Aren't there are regulations in place that restrict the way in which money can flow between the NSF and DOE?  I'm also a little put off by the fact that this option is being put forth in a report that is crafted by a number of US DOE folks whose DOE affiliations are masked by university affiliations in the introductory material.
"…or by making arrangements with commercial services (rather than more expensive purchases by individual researchers)."
Providing advanced cyberinfrastructure for the open science community is not a profitable venture.  There is no money in HPC operations.  I do not see any "leadership" commercial cloud providers offering the NSF a deal on spare cycles, and the going rate for commercial cloud time is known to be far more expensive than deploying HPC resources in-house at the national scale.

6. The challenges facing researchers in obtaining allocations of advanced computing resources and suggestions for improving the allocation and review processes.

"Given the “double jeopardy” that arises when researchers must clear two hurdles—first, to obtain funding for their research proposal and, second, to be allocated the necessary computing resources—the chances that a researcher with a good idea can carry out the proposed work under such conditions is diminished."
XD needs to be more tightly integrated with other award processes to mitigate the double jeopardy issue.  I have a difficult time envisioning the form which this integration would take, but the NSF GRF's approach of prominently featuring NSF HPC resources as a part of the award might be a good start.  As an adaptive proposal reviewer within XSEDE and a front-line interface with first-time users, I found that having the NSF GRF bundle XSEDE time greatly reduced the entry barrier for new users and made it easier for us reviewers to stratify the proposals.  Another idea may be to invite NSF center staff to NSF contractors' meetings (if such things exist; I know they do for DOE BES) to show a greater amount of integration across NSF divisions.

In addition, the current XSEDE allocation proposal process is extremely onerous.  The document that describes the process is ridiculously long and contains of obscure requirements that serve absolutely no purpose.  For example, all XSEDE proposals require a separate document detailing the scaling performance of their scientific software.  Demonstrating an awareness of the true costs of performing certain calculations has its merits, but a detailed analysis of scaling is not even relevant for the majority of users who run modest-scale jobs or use off-the-shelf black-box software like Gaussian.  The only thing these obscure requirements do is prevent new users, who are generally less familiar with all of the scaling requirements nonsense, from getting any time.  If massive scalability is truly required by an application, the PI needs to be moved over to the Track 1 system (Blue Waters) or referred to INCITE.

As a personal anecdote, many of us center staff found ourselves simply short-circuiting the aforementioned allocations guide and providing potential new users with a guide to the guide.  It was often sufficient to provide a checklist of minutia whose absence would result in an immediate proposal rejection and allow the PIs to do what they do best—write scientific proposals for their work.  Quite frankly, the fact that we had to provide a guide to understanding the guide to the allocations process suggests that the allocations process itself is grossly over-engineered.

7. Whether wider and more frequent collection of requirements for advanced computing could be used to inform strategic planning and resource allocation; how these requirements might be used; and how they might best be collected and analyzed.

The XD program has already established a solid foundation for reporting the popularity and usability of NSF HPC resources in XDMoD.  The requirements of the majority are evolving more slowly than computer scientists would have everyone believe.

Having been personally invested in two Track 2 proposals, I have gotten the impression that the review panels who select the destiny of the NSF's future HPC portfolio are more impressed by cutting edge, albeit untested and under-demanded, proposals.  Consequentially, taking a "functional rather than a technology-focused or structural approach" to future planning will result in further loss of focus.  Instead of delivering conservatively designed architectures that will enjoy guaranteed high utilization, functional approaches will give way to computer scientists on review panels dictating what resources domain scientists should be using to solve their problems.  The cart will be before the horse.

Instead, it would be far more valuable to include more operational staff in strategic planning.  The people on the ground know how users interact with systems and what will and won't work.  As with the case of leadership computing, the NSF does not have the financial commitment to be leading the design of novel computing architectures at large scales.  Exotic and high-risk technologies should be simply left out of the NSF's Track 2 program, incorporated peripherally but funded through other means (e.g., MRIs), or incorporated in the form of a small fraction of a larger, lower-risk resource investment.

A perspective of the greater context of this has been eloquently written by Dr. Steven Gottlieb.  Given his description of the OCI conversion to ACI, it seems like taking away the Office of Cyberinfrastructure's (OCI's) autonomy and placing it under Computer & Information Science & Engineering (CISE) exemplifies an ongoing and significant loss of focus within NSF.  This changed reflected the misconception that architecting and operating HPC resources for domain sciences is a computer science discipline.

This is wrong.

Computer scientists have a nasty habit of creating tools that are intellectually interesting but impractical for domain scientists.  These tools get "thrown over the wall," never to be picked up, and represent an overall waste of effort in the context of operating HPC services for non-computer scientists.  Rather, operating HPC resources for the research community requires experienced technical engineers with a pragmatic approach to HPC.  Such people are most often not computer scientists, but former domain scientists who know what does and doesn't work for their respective communities.

8. The tension between the benefits of competition and the need for continuity as well as alternative models that might more clearly delineate the distinction between performance review and accountability and organizational continuity and service capabilities.

"Although NSF’s use of frequent open competitions has stimulated intellectual competition and increased NSF’s financial leverage, it has also impeded collaboration among frequent competitors, made it more difficult to recruit and retain talented staff, and inhibited longer-term planning."
Speaking from firsthand experience, I can say that working for an NSF center is a life of a perpetually uncertain future and dicing up FTEs into frustratingly tiny pieces.  While some people are driven by competition and fundraising (I am one of them), an entire organization built up to support multi-million dollar cyberinfrastructure cannot be sustained this way.

At the time I left my job at an NSF center, my salary was covered by six different funding sources at levels ranging from 0.05 to 0.30 FTEs.  Although this officially meant that I was only 30% committed to directly supporting the operation of one of our NSF supercomputers, the reality was that I (and many of my colleagues) simply had to put in more than 100% of my time into the job.  This is a very high-risk way to operate because committed individuals get noticed and almost invariably receive offers of stable salaries elsewhere.  Retaining talent is extremely difficult when you have the least to offer, and the current NSF funding structure makes it very difficult for centers to do much more than continually hire entry-level people to replace the rising stars who find greener pastures.

Restoring reliable, core funding to the NSF centers would allow them to re-establish a strong foundation that can be an anchor point for other sites wishing to participate in XD.  This will effectively cut off some of the current sites operating Track 2 machines, but frankly, the NSF has spread its HPC resources over too many sites at present and is diluting its investments in people and infrastructure.  The basis for issuing this core funding could follow a pattern similar to that of XD where long-term (10-year) funding is provisioned with a critical 5-year review.

If the NSF cannot find a way to re-establish reliable funding, it needs to accept defeat and stop trying to provide advanced cyberinfrastructure.  The current method of only funding centers indirectly through HPC acquisitions and associated operations costs is unsustainable for two reasons:
  • The length of these Track 2 awards (typically 3 years of operations) makes future planning impossible.  Thus, this current approach forces centers to follow high-risk and inadequately planned roadmaps.
  • All of the costs associated with maintaining world-class expertise and facilities have to come from someone else's coffers.  Competitive proposals for HPC acquisitions simply cannot afford to request budgets that include strong education, training, and outreach programs, so these efforts wind up suffering.


9. How NSF might best set overall strategy for advanced computing-related activities and investments as well as the relative merits of both formal, top-down coordination and enhanced, bottom-up process.

Regarding the top-down coordination, the NSF should drop the Track 2 program's current solicitation model where proposers must have a vendor partner to get in the door.  This is unnecessarily restrictive and fosters an unhealthy ecosystem where vendors and NSF centers are both scrambling to pair up, resulting in high-risk proposals.  Consider the implications:
  1. Vendors are forced to make promises that they may not be able to fulfill (e.g., Track 2C and Blue Waters).  Given these two (of nine) solicitations resulted in substantial wastes of time and money (over 20% vendor failure rate!), I find it shocking that the NSF continues to operate this way.
  2. NSF centers are only capable of choosing the subset of vendors who are willing to play ball with them, resulting in a high risk of sub-optimal pricing and configurations for the end users of the system.

I would recommend a model, similar to many European nations', where a solicitation is issued for a vendor-neutral proposal to deploy and support a program that is built around a resource.  A winning proposal is selected based on not only the system features, its architecture, and the science it will support, but the plan for training, education, collaboration, and outreach as well.  Following this award, the bidding process for a specific hardware solution begins.

This addresses the two high-risk processes mentioned above and simultaneously eliminates the current qualification in Track 2 solicitations that no external funding can be included in the proposal.  By leaving the capital expenses out of the selection process, the NSF stands to get the best deal from all vendors and other external entities independent of the winning institution.

Bottom-up coordination is much more labor-intensive because it requires highly motivated people at the grassroots to participate.  Given the NSF's current inability to provide stable funding for highly qualified technical staff, I cannot envision how this would actually come together.

More Conjecture on KNL's Near Memory

$
0
0
The Platform ran an interesting collection of conjectures on how KNL's on-package MCDRAM might be used this morning, and I recommend reading through it if you're following the race to exascale.  I was originally going to write this commentary as a Google+ post, but it got a little long, so pardon the lack of a proper lead-in here.

I appreciated Mr. Funk's detailed description of how processor caches interact with DRAM, and how this might translate into KNL's caching mode.  However, he underplays exactly why MCDRAM (and the GDDR on KNC) exists on these manycore architectures in his discussion on how MCDRAM may act as an L3 cache.  On-package memory is not simply another way to get better performance out of the manycore processor; rather, it is a hard requirement for keeping all 60+ cores (and their 120+ 512-bit vector registers, 1.8+ MB of L1 data cache, etc) loaded.  Without MCDRAM, it would be physically impossible for these KNL processors to achieve their peak performance due to memory starvation.  By extension, Mr. Funk's assumption that this MCDRAM will come with substantially lower latency than DRAM might not be true.

As a matter of fact, the massive parallelism game is not about latency at all; it came about as a result of latencies hitting a physical floor.  So, rather than drive clocks up to lower latency and increase performance, the industry has been throwing more but slower clocks at a given problem to mask the latencies of data access for any given worker.  While one thread may be stalled due to a cache miss on a Xeon Phi core, the other three threads are keeping the FPU busy to achieve the high efficiency required for performance.  This is at the core of the Xeon Phi architecture (as well as every other massively parallel architecture including GPUs and Blue Gene), so it is unlikely that Intel has sacrificed their power envelope to actually give MCDRAM lower latency than the off-package DRAM on KNL nodes.

At an architectural level, accesses to MCDRAM still needs to go through memory controllers like off-package DRAM.  Intel hasn't been marketing the MCDRAM controllers as "cache controllers," so it is likely that the latencies of memory access are on par with the off-package memory controllers.  There are simply more of these parallel MCDRAM controllers (eight) operating relative to off-package DRAM controllers (two), again suggesting that bandwidth is the primary capability.

Judging by current trends in GPGPU and KNC programming, I think it is far more likely that this caching mode acts at a much higher level, and Intel is providing it as a convenience for (1) algorithmically simple workloads with highly predictable memory access patterns, and (2) problems that will fit entirely within MCDRAM.  Like with OpenACC, I'm sure there will be some problems where explicitly on/off-package memory management (analogous to OpenACC's copyin, copyout, etc) aren't necessary and cache mode will be fine.  Intel will also likely provide all of the necessary optimizations in their compiler collection and MKL to make many common operations (BLAS, FFTs, etc) work well in cache mode as they did for KNC's offload mode.

However, to answer Mr. Funk's question of "Can pre-knowledge of our application’s data use--and, perhaps, even reorganization of that data--allow our application to run still faster if we instead use Flat Model mode," the answer is almost unequivocally "YES!" Programming massively parallel architectures has never been easy, and magically transparent caches rarely deliver reliable, high performance.  Even the L1 and L2 caches do not work well without very deliberate application design to accommodate wide vectors; cache alignment and access patterns are at the core of why, in practice, it's difficult to get OpenMP codes working with high efficiency on current KNC processors.  As much as I'd like to believe otherwise, the caching mode on KNL will likely be even harder to effectively utilize, and explicitly managing the MCDRAM will be an absolute requirement for the majority of applications.

What "university-owned" supercomputers mean to me

$
0
0
The opinions and analysis expressed here are solely my own and do not reflect those of my employer or its funders.

insideHPC ran an article yesterday called "Purdue Supercomputing Empowers Researchers" which references an editorial called "What a supercomputer at Purdue means to you." I gather the latter article, written by a faculty member at Purdue, is intended to talk up Purdue's recent deployment of Conte, a $4.6 million cluster packed with Xeon Phis, to a general and non-technical public audience.  I can appreciate the desire to drum up public interest in a new supercomputer deployment, and I'd be making as many press releases as I could if I just stood up a $4.6 million machine.

With that being said, I found the article a bit offensive because it appears to ignore the fact that the U.S. has a national supercomputing infrastructure that gives supercomputing time to anyone at a university or research institution.  The author is either uninformed or disingenuous and seems to throw programs like XSEDE and INCITE under the bus, all in the name of making Purdue's new machine sound like the best thing on the planet.  Like I said, I can understand wanting to promote a new supercomputing investment, but not when that public promotion comes at the expense of the reality of research computing.

The Real Buy-in Price of Supercomputing

One of the first selling points presented is that a core-hour at Purdue costs $0.005 instead of $1.00, and this 200x discount seems like a great deal.  In reality though, $1.00 per core-hour is a ludicrous price that nobody actually pays; our own campus cluster at UC San Diego charges $0.025 per real "CPU" core hour (as opposed to a Xeon Phi "accelerator" core, which is about 10x slower than a real CPU) for no-commitment supercomputing time.  This cost drops to the equivalent of $0.015 for researchers who want to buy in by contributing hardware; this sort of hardware buy-in is analogous to the $2 million that Purdue's faculty contributed to Conte's $4.6 million acquisition cost.

Of course, this cost of several cents per CPU-hour is the price you'd pay only if you want to pay; any researcher with a scientifically compelling need for supercomputing time can actually get it for free through the National Science Foundation's XSEDE program.  Researchers in the U.S. can obtain time on any of XSEDE's twelve supercomputers at absolutely no cost.  In fact, anyone at a U.S. university who can write a brief abstract and wait two weeks can easily get up to 100,000 CPU-hours on Stampede, the #6 fastest computer in the world, alone.  Those 100,000 hours come with access to free, 24/7 user support from a nationwide pool of supercomputing specialists (disclaimer: myself included), the opportunity to apply for extended collaborative support, and the ability to enjoy the luxuries offered by the economies of scale that come with XSEDE's funding.

The Long Wait Times Myth

Another statement that the article's author makes is particularly misinformed and a bit offensive to me:
"This allows our NEMO group to do large-scale development work every day on campus, instead of waiting to run a few experiments at a national center, as is the experience of most researchers in my field."
This line seems to imply that one must wait days to run a job on a national supercomputer which is an outdated viewpoint.  Below are the averaged wait times for jobs across all of XSEDE in 2012:



This data, as well as a wealth of other user metrics, is publicly available at XDMoD, and anyone can see that the average job waits far less than "days" to run.  Even then, the hours-long wait times given above are proportional to how much compute time the job needs.  Trestles, XSEDE's high-throughput supercomputer, routinely makes users wait only a fraction of their requested runtime (e.g., a 2-hour job only has to wait a half hour in queue) before the job launches using careful planning.  By limiting how many core-hours are given out each quarter and using a custom-made job scheduler, researchers can have their jobs launch with minimal waiting times on a large, shared resource.

Ultimately, wait times are a necessary part of supercomputing simply because of demand.  The article's author implies that there is little or no wait time on Conte; this is only physically possible if there is little or no demand for the full capacity of the supercomputer.  The prospect of an under-utilized $4.6 million investment would probably not sit well with Indiana's taxpayers, tuition payers, project management, or deans, so something about this statement of not having to wait doesn't add up.  Of course, demand and wait times are very low when a supercomputer has just been put into production, and both of these figures go up as users get access to the new machine (e.g., TACC's Ranger):


Remote Resources Comparable to Physical Labs?

The author then draws odd parallels between universities offering state-of-the-art labs versus state-of-the-art supercomputers--the fact is, 99% of supercomputer users never physically interact with the machine itself.  It makes little difference if a supercomputer's user is on the same campus or in a different state because access is inherently remote.  This is wholly unlike a university owning a very expensive microscope, where people have to physically load samples and run analyses.

Even then, the state of the art in experimental research is moving away from every university having its own state-of-the-art equipment.  The capital investment required to do some of the cutting-edge research that needs to be done is inherently beyond the affordability of individual research groups, and an increasing number of collaborative efforts (a notable example being the Large Hadron Collider) involve locating a very expensive instrument at one place, but allowing researchers from around the nation or world to use it remotely.  The U.S.'s national labs operate various accelerators and beam-lines which, like supercomputers, are too expensive for universities to purchase.  Incidentally, there can also be wait times associated with those facilities.

Research is necessarily becoming collaborative beyond the university level, and that is the whole point of consolidating and dividing the costs of research equipment.  The perspective that a single university having a single piece of cutting edge equipment is what will make its research output world-class is very outdated.

Gateways

The talk of "HUBs" that follows in the article actually describes what the restof theworld calls "gateways," and many supercomputing consortiums have been supporting these for years.  XSEDE hosts several dozen gateways alone, and although the article downplays it, they are a really powerful way to bridge the gap between researchers and supercomputing.  So powerful, in fact, that one of the gateways that XSEDE supports, CIPRES, burns 1.6 million core-hours per month.  This represents about a quarter of the total core-hours that Conte can provide with its CPUs, and a scale of use that really isn't satisfied by one or two medium-sized clusters as the article might lead the reader to believe.

The author also seems to play both sides of the argument that a university-owned supercomputer will provide unique opportunities to the faculty at that university.  Supporting gateways means accepting users from around the world, as the author states.  However this also increases demand significantly, increasing wait times, and putting machines that host gateways on the same level as the supercomputers available nationally.  Either Conte shows a high demand for supercomputing time through gateways and, as a result, its users compete with non-Purdue users for time, or it shows low demand for compute time through gateways and it remains an exclusive, low-wait-time resource.

Supercomputer Envy

Saying that "Schools such as Michigan, UCLA and Berkeley are looking with envy at what Purdue is able to offer new faculty" seems extremely presumptive as well.  As I mentioned above, supercomputing is being seen as a part of the national research infrastructure and programs like XSEDE and INCITE provide no-cost access to serious computing power.  Unlike these "university-owned" supercomputers though, such national cyberinfrastructure programs come with much better economies of scale--I will guarantee that the richness of features and level of user support and services offered through XSEDE or INCITE surpasses those provided by campus clusters.

This is especially important for machines like Conte, which use new technology like Xeon Phi coprocessors.  XSEDE staff worked closely with Intel during the design and deployment of Stampede, the world's first at-scale Xeon Phi cluster.  This expertise is now shared with the user community via regular training sessions that are provided both in-person and via webcast, and getting individual support from these experts is just a matter of submitting a help ticket.  A comparable level of support is not often practical at the campus scale, as these clusters are typically staffed by one or two systems administrators and a mailing list-style community support model.

Final Remarks

I'm not trying to throw water on Purdue's fire, because Conte's 580 nodes of Sandy Bridge can do a lot of useful science.  However, I don't find it productive to go to the press with misleading statements about the state of supercomputing at U.S. universities.  Despite what the author posits, there are a lot of accessible supercomputing resources available to university researchers, and they don't cost researchers millions of dollars in research expenses.  The National Science Foundation and Department of Energy has been doing a good job of providing research cyberinfrastructure in the country, and it's disingenuous to discount these programs in any pitch of a new supercomputer for open science.

On "active learning" and teaching science

$
0
0
Nature ran an article last week by Dr. Mitchell Waldrop titled "Why we are teaching science wrong, and how to make it right" (or alternatively, "The science of teaching science") which really ground my gears.  The piece puts forward this growing trend of "active learning" where, rather than traditional lecture-based course instruction, students are put in a position where they must apply subject matter to solve open-ended problems.  In turn, this process of applying knowledge leads students to walk away with a more meaningful understanding of the material and demonstrate a much longer retention of the information.

It bothers me that the article seems to conflate "life sciences" with "science." The fact that students more effectively learn material when they are required to engage with the information over rote memorization and regurgitation is not new.  This "active learning" methodology may seem revolutionary to life science (six of eight advocates quoted are of the life sciences), but the fact of the matter is that this method has been the foundation of physics and engineering education for literally thousands of years.  "Active learning," which seems to be a re-branding of the Socratic method, is how critical thinking skills are developed.  If this concept of education by application is truly new to the life sciences, then that is a shortcoming that is not endemic throughout the sciences as the article's title would suggest.

The article goes on to highlight a few reasons why adoption of the Socratic method in teaching "science" is slow going, but does so while failing to acknowledge two fundamental facts about education and science: effective education takes time, and scientists are not synonymous with educators.

I have had the benefit of studying under some of the best educators I have ever known.  The views I express below are no doubt colored by this, and perhaps all of science is truly filled with ineffective educators.  However as a former materials scientist now working in the biotech industry, I have an idea that the assumptions expressed in this article (which mirror the attitudes of the biologists with whom I work) are not as universal throughout science as Dr. Waldrop would have us think.  With that being said, I haven't taught anything other than workshops for the better part of a decade, so the usual caveats about my writing apply here--I don't know what I'm talking about, so take it all with a grain of salt.

Effective education takes time

The article opens with an anecdote about how Tammy Tobin, a biology professor at Susquehanna University, has her third- and fourth-year students work through a mock viral outbreak.  While this is an undoubtedly memorable exercise that gives students a chance to apply what they learned in class, the article fails to acknowledge that one cannot actually teach virology or epidemiology this way.  This exercise is only effective for third- and fourth-year students who have spent two or three years obtaining the foundational knowledge that allows them to translate the lessons learned from this mock outbreak to different scenarios--that is, to actually demonstrate higher-order cognitive understanding of the scientific material.

As I said above though, this is not a new or novel concept.  In fact, all engineering and applied sciences curricula accredited by ABET are required to include a course exactly like this Susquehanna University experience.  Called the capstone design component, students spend their last year at university working in a collaborative setting with their peers to tackle an applied project like designing a concrete factory or executing an independent research program.  As a result, it is a fact that literally every single graduate of an accredited engineering undergraduate degree program in the United States has gone through an "active learning" project where they have to apply their coursework knowledge to solving a real-world problem.

In all fairness, the capstone project requirement is just a single course that represents a small fraction (typically less than 5%) of students' overall credits towards graduation.   This is a result of a greater fact that the article completely ignores--education takes time.  Professor Tobin's virus outbreak exercise had students looking at flight schedules to Chicago to ensure there were enough seats for a mock trip to ground zero, but realize that students were paying tuition money to do this.  In the time it took students to book fake plane tickets, how much information about epidemiology could have been conveyed in lecture format?  When Prof. Tobin says her course "looked at the intersection of politics, sociology, biology, even some economics," is that really appropriate for a virology course?

This is not to say that the detail with which Prof. Tobin's exercise was executed was a waste of time, tuition dollars, or anything else; as the article rightly points out, the students who took this course are likely to have walked away from it with a more meaningful grasp of applied virology and epidemiology than they would have otherwise.  However, the time it takes to execute these active learning projects at such a scale cuts deeply into the two- or three-year curriculum that most programs have to provide all of the required material for a four-year degree.  This is why "standard lectures" remain the prevailing way to teach scientific courses--lectures are informationally dense, and the "active learning" component comes in the form of homework and projects that are done outside of the classroom.

While the article implies that homework and exercises in this context are just "cookbook exercises," I get the impression that such is only true in the life sciences.  Rote memorization in physics and engineering is simply not valued, and this is why students are typically allowed to bring cheat sheets full of equations, constants, and notes with them into exams.  Rather than providing cookbook exercises, assignments and examinations require that students be able to apply the physical concepts learned in lecture to solve problems.  This is simply how physics and engineering are taught, and it is a direct result of the fact that there are not enough hours in a four-year program to forego lecturing and still effectively convey all of the required content.

And this is not to say that lecturing has to be completely one-way communication; the Socratic method can be extremely effective in lectures.  The article cites a great example of this when describing a question posed by Dr. Sarah Leupen's to her students:  What would happen if the sensory neurons in your legs stopped working as you were walking down the street?  Rather than providing all of the information to answer the question before posing the question itself, posing the question first allows students to figure out the material themselves through discussion.  The discussion is guided towards the correct answer by the lecturer's careful choice of follow-up questions to students' hypotheses to further stimulate critical thinking.

Of course, this Socratic approach in class can waste a tremendous amount of time if the lecturer is not able to effectively dial into each student's aptitudes when posing questions.  In addition, this only works for small classroom sizes; in practice, the discussion is often dominated by a minority of students and the majority simply remain unengaged.  Being able to keep all students engaged, even in a small-classroom setting, requires a great deal of skill in understanding people and how to motivate them.   Finding the right balance of one-sided lecturing and Socratic teaching is an exercise in careful time economics which can change every week.  As a result, it is often easier to simply forego the Socratic method and just deliver lecture; however, this is not always a matter of stodginess or laziness as the article implies, but simply weighing the costs given a fixed amount of material and a fixed period of time.

"Active learning"can be applied in a time-conservative way; this is the basis for a growing number of intensive, hands-on bootcamp programs that teach computer programming skills in twelve weeks. These programs eschew teaching the foundational knowledge of computer science and throw their students directly into applying it in useful (read: employable) ways.  While these programs certainly produce graduates who can write computer programs, these graduates are often unable to grasp important design and performance considerations because they lack a knowledge of the foundations.  In a sense, this example of how applied-only coursework produces technicians, not scientists and engineers.

Scientists are not always educators

The article also cites a number of educators and scientists (all in the life sciences, of course) who are critical of other researchers for not investing time (or alternatively, not being incentivized to invest time) into exploring more effective teaching methodologies.  While I agree that effective teaching is the responsibility of anyone whose job is to teach, the article carries an additional undertone that asserts that researchers should be effective teachers.  The problem is that this is not true; the entanglement of scientific research and scientific education is a result of necessity, and the fact of the matter is that there are a large group of science educators who simply teach because they are required to.

I cannot name a single scientist who went through the process of earning a doctorate in science or engineering because he or she wanted to teach.  Generally speaking, scientists become scientists because they want to do science, and teaching is often a byproduct of being one of the elite few who have the requisite knowledge to actually teach others how to be scientists or engineers.  This is not to say that there are no good researchers who also value education; this article's interviews are a testament to that.  Further, the hallmarks of great researchers and great educators overlap; dissemination of new discoveries is little more than being the first person to teach a new concept to other scientists.  However, the issue of science educators being often disinterested in effective teaching techniques can only be remedied by first acknowledging that teaching is not always most suitably performed by researchers.

The article does speak to some progress being made by institutions which include teaching as a criteria for tenure review.  However the notion of tenure is, at its roots, tied to preserving the academic freedom to do research in controversial areas.  It has little to do with the educational component of being a professor, so to a large degree, it does make sense to base tenure decisions largely on the research productivity, not the pedagogical productivity, of individuals.  Thus, the fact that educators are being driven to focus on research over education is a failing of the university brought about by this entanglement of education and research.

Actually building a sustainable financial model that supports this disentangling of education from research is not something I can pretend to do.  Just as effective teaching takes time, it also costs money, and matching every full-time researcher with a full-time educator across every science and engineering department at a university would not be economical.  However just as there are research professors whose income is derived solely from grants, perhaps there should be equivalent positions for distinguished educators who are fully supported by the university.  As it stands, there is little incentive (outside of financial necessity) for any scientist with a gift for teaching to become a full-time lecturer within the typical university system.

Whatever form progress may take though, as long as education remains entangled with research, the cadence of improvement will be set by the lowest common denominator.

An uninformed perspective on TaihuLight's design

$
0
0
Note: What follows are my own personal thoughts, opinions, and analyses.  I am not a computer scientist and I don't really know anything about processor design or application performance, so it is safe to assume I don't know what I'm talking about.  None of this represents the views of my employer, the U.S. government, or anyone except me.

China's new 93 PF TaihuLight system is impressive given the indigenous processor design and its substantial increase in its HPL score over the #2 system, Tianhe-2.  The popular media has started covering this new system and the increasing presence of Chinese systems on Top500, suggesting that China's string of #1 systems may be a sign of shifting tides.  And maybe it is.  China is undeniably committed to investing in supercomputing and positioning itself as a leader in extreme-scale computing.

That being said, the TaihuLight system isn't quite the technological marvel and threat to the HPC hegemony that it may seem at first glance.  The system features some some critically limiting design choices that make the system smell like a supercomputer that was designed to be #1 on Top500, not solve scientific problems.  This probably sounds like sour grapes at this point, so let's take a look at some of the details.

Back-of-the-envelope math

Consider the fact that each TaihuLight node turns 3,062 GFLOPS (that's 3 TFLOPS) and has 136.51 GB/sec of memory bandwidth. This means that in the time it takes for the processor to load two 64-bit floats into the processor from memory, it could theoretically perform over 350 floating point operations. But it won't, because it can only load the two operands for one single FLOP.

Of course, this is an oversimplification of how CPUs work.  Caches exist to feed the extremely high operation rate of modern processors, and where there are so many cores that their caches can't be fed fast enough, we see technologies like GDDR DRAM and HBM (on accelerators) and on-package MCDRAM (on KNL) appearing so that dozens or hundreds of cores can all retrieve enough floating-point operands from memory to sustain high rates of floating point calculations.

However, the ShenWei SW26010 chips in the TaihuLight machine have neither GDDR nor MCDRAM; they rely on four DDR3 controllers running at 136 GB/sec to keep all 256 compute elements fed with data.  Dongarra's report on the TaihuLight design briefly mentions this high skew:

"The ratio of floating point operations per byte of data from memory on the SW26010 is 22.4 Flops(DP)/Byte transfer, which shows an imbalance or an overcapacity of floating point operations per data transfer from memory. By comparison the Intel Knights Landing processor with 7.2 Flops(DP)/Byte transfer."

This measure of "Flops(DP)/Byte transfer" is called arithmetic intensity, and it is a critical optimization parameter when writing applications for manycore architectures.  Highly optimized GPU codes can show arithmetic intensities of around 10 FLOPS/byte, but such applications are often the exception; there are classes of problems that simply do not have high arithmetic intensities.  This diagram, which I stole from the Performance and Algorithms Research group at Berkeley Lab, illustrates the spectrum:


To put this into perspective in the context of hardware, let's look at the #3 supercomputer, the Titan system at Oak Ridge National Lab.  The GPUs on which it is built (NVIDIA's K20X) each have a GDDR5-based memory subsystem that can feed the 1.3 TFLOP GPUs at 250 GB/sec.  This means that Titan's FLOPS/byte ratio is around 5.3, or over 4x lower (more balanced) than the 22 FLOPS/byte of TaihuLight's SW26010 chips.

This huge gap means that an application that is perfectly balanced to run on a Titan GPU--that is, an application with an arithmetic intensity of 5.3--will run 4x slower on one of TaihuLight's SW26010 processors than a Titan GPU.  Put simply, despite being theoretically capable of doing 3 TFLOPS of computing, TaihuLight's processors would only be able to deliver performance to 1/4th of that, or 0.75 TFLOPS, to this application.  Because of the severely limited per-node memory bandwidth, this 93 PFLOP system would perform like a 23 PFLOP system on an application that, given an arithmetic intensity of 5.3, would be considered highly optimized by most standards.

Of course, the indigenous architecture also means that application developers will have to rely on indigenous implementations or ports of performance runtimes like OpenMP and OpenACC, libraries like BLAS, and ISA-specific vector intrinsics.  The maturity of this software stack for the ShenWei-64 architecture remains unknown.

What is interesting

This all isn't to say that the TaihuLight system isn't a notable achievement; it is the first massive-scale deployment of a CPU-based manycore processor, it is the first massive-scale deployment of EDR InfiniBand, and its CPU design is extremely interesting in a number of ways.

The CPU block diagrams included in Dongarra's report are a bit like a Rorschach test; my esteemed colleagues at The Next Platform astutely pointed out its similarities to KNL, but my first reaction was to compare it with IBM's Cell processor:

IBM Cell BE vs. ShenWei SW26010.  Cell diagram stolen from NAS; SW26010 diagram stolen from the Dongarra report.

The Cell processor was ahead of its time in many ways and arguably the first manycore chip targeted at HPC.  It had
  • a single controller core (the PPE) with L1 and L2 caches
  • eight simpler cores (the SPEs) on an on-chip network with no L2 cache, but an embedded SRAM scratchpad
and by comparison, the SW26010 has
  • a single controller core (the MPE) with L1 and L2 caches
  • sixty-four simpler cores (the CPEs) on an on-chip network with no L2 cache, but an embedded SRAM scratchpad
Of course, the similarities are largely superficial and there are vast differences between the two architectures, but the incorporation of heterogeneous (albeit very similar) cores on a single package is quite bold and is a design point that may play a role in exascale processor designs:

What an exascale processor might look like, as stolen from Kathy Yelick

which may feature a combination of many lightweight cores (not unlike the CPE arrays on the TaihuLight processor) and are accompanied by a few capable cores (not unlike the MPE cores).

The scratchpad SRAM present on all of the CPE cores is also quite intriguing, as it is a marked departure from the cache-oriented design of on-package SRAM that has dominated CPU architectures for decades.  The Dongarra report doesn't detail how the scratchpad SRAM is used by applications, but it may offer a unique new way to perform byte-granular loads and stores that do not necessarily waste a full cache line's worth of memory bandwidth if the application knows that memory access is to be unaligned.

This is a rather forward-looking design decision that makes the CPU look a little more like a GPU.  Some experimental processor designs targeting exascale have proposed eschewing deep cache hierarchies in favor of similar scratchpads:

The Traleika Glacier processor design, featuring separate control and execution blocks and scratchpad SRAM.  Adapted from the Traleika Glacier wiki page.

Whether or not we ever hear about how successful or unsuccessful these processor features are remains to be seen, but there may be valuable lessons to be learned ahead of the first generation of exascale processors from architectures like those in the TaihuLight system.

Outlook

At a glance, it is easy to call out the irony in the U.S. government's decision to ban the sale of Intel's KNL processors to the Chinese now that the TaihuLight system is public.  It is clear that China is in a position to begin building extreme-scale supercomputers without the help of Intel, and it is very likely that the U.S. embargo accelerated this effort.  As pondered by an notable pundit in the HPC community,


And this may have been the case.  However, despite the TaihuLight system's #1 position and very noteworthy Linpack performance and efficiency, is not the massive disruptor that puts the U.S. in the back seat.  Underneath TaihuLight's shiny, 93-petaflop veneer are some cut corners that substantially lower its ability to reliably deliver the performance and scientific impact commensurate to its Linpack score.  As pointed out by a colleague wiser than me, Intel's impending KNL chip is the product of years of effort, and it is likely that it will be years before ShenWei's chip designs and fabs are able to be really deliver a fully balanced, competitive, HPC-oriented microarchitecture.

With that being said, TaihuLight is still a massive system, and even if its peak Linpack score is not representative of its actual achievable performance in solving real scientific problems, it is undeniably a leadership system.  Even if applications can only realize a small fraction of its Linpack performance, there is a lot of discovery to be made in petascale computing.

Further, the SW201060 processor itself features some bold design points, and being able to test a heterogeneous processor with scratchpad SRAM at extreme scale may give China a leg up in the exascale architecture design space.  Only time will tell if these opportunities are pursued, or if TaihuLight follows its predecessors into an existence of disuse in a moldy datacenter caused by a high electric bill, poor system design, and lack of software.

Basics of I/O Benchmarking

$
0
0
Most people in the supercomputing business are familiar with using FLOPS as a proxy for how fast or capable a supercomputer is.  This measurement, as observed using the High-Performance Linpack (HPL) benchmark, is the basis for the Top500 list.  However, I/O performance is becoming increasingly important as data-intensive computing becomes a driving force in the HPC community, and even though there is no Top500 list for I/O subsystems, the IOR benchmark has become the de facto standard way to measure the I/O capability for clusters and supercomputers.

Unfortunately, I/O performance tends to be trickier to measure using synthetic benchmarks because of the complexity of the I/O stack that lies between where data is generated (the CPU) to where it'll ultimately be stored (a spinning disk or SSD on a network file system).  In the interests of clarifying some of the confusion that can arise when trying to determine how capable an I/O subsystem really is, let's take a look at some of the specifics of running IOR.

Getting Started with IOR

IOR writes data sequentially with the following parameters:
  • blockSize (-b)
  • transferSize (-t)
  • segmentCount (-s)
  • numTasks (-n)
which are best illustrated with a diagram:


These four parameters are all you need to get started with IOR.  However, naively running IOR usually gives disappointing results.  For example, if we run a four-node IOR test that writes a total of 16 GiB:

$ mpirun -n 64 ./ior -t 1m -b 16m -s 16
...
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 427.36 16384 1024.00 0.107961 38.34 32.48 38.34 2
read 239.08 16384 1024.00 0.005789 68.53 65.53 68.53 2
remove - - - - - - 0.534400 2

we can only get a couple hundred megabytes per second out of a Lustre file system that should be capable of a lot more.

Switching from writing to a single-shared file to one file per process using the -F (filePerProcess=1) option changes the performance dramatically:

$ mpirun -n 64 ./ior -t 1m -b 16m -s 16 -F
...
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 33645 16384 1024.00 0.007693 0.486249 0.195494 0.486972 1
read 149473 16384 1024.00 0.004936 0.108627 0.016479 0.109612 1
remove - - - - - - 6.08 1

This is in large part because letting each MPI process work on its own file cuts out any contention that would arise because of file locking.  

However, the performance difference between our naive test and the file-per-process test is a bit extreme.  In fact, the only way that 146 GB/sec read rate could be achievable on Lustre is if each of the four compute nodes had over 45 GB/sec of network bandwidth to Lustre--that is, a 400 Gbit link on every compute and storage node.

Effect of Page Cache on Benchmarking

What's really happening is that the data being read by IOR isn't actually coming from Lustre; rather, files' contents are already cached, and IOR is able to read them directly out of each compute node's DRAM.  The data wound up getting cached during the write phase of IOR as a result of Linux (and Lustre) using a write-back cache to buffer I/O, so that instead of IOR writing and reading data directly to Lustre, it's actually mostly talking to the memory on each compute node.

To be more specific, although each IOR process thinks it is writing to a file on Lustre and then reading back the contents of that file from Lustre, it is actually
  1. writing data to a copy of the file that is cached in memory.  If there is no copy of the file cached in memory before this write, the parts being modified are loaded into memory first.
  2. those parts of the file in memory (called "pages") that are now different from what's on Lustre are marked as being "dirty"
  3. the write() call completes and IOR continues on, even though the written data still hasn't been committed to Lustre
  4. independent of IOR, the OS kernel continually scans the file cache for files who have been updated in memory but not on Lustre ("dirt pages"), and then commits the cached modifications to Lustre
  5. dirty pages are declared non-dirty since they are now in sync with what's on disk, but they remain in memory
Then when the read phase of IOR follows the write phase, IOR is able to just retrieve the file's contents from memory instead of having to communicate with Lustre over the network.

There are a couple of ways to measure the read performance of the underlying Lustre file system. The most crude way is to simply write more data than will fit into the total page cache so that by the time the write phase has completed, the beginning of the file has already been evicted from cache. For example, increasing the number of segments (-s) to write more data reveals the point at which the nodes' page cache on my test system runs over very clearly:


However, this can make running IOR on systems with a lot of on-node memory take forever.

A better option would be to get the MPI processes on each node to only read data that they didn't write.  For example, on a four-process-per-node test, shifting the mapping of MPI processes to blocks by four makes each node N read the data written by node N-1.


Since page cache is not shared between compute nodes, shifting tasks this way ensures that each MPI process is reading data it did not write.

IOR provides the -C option (reorderTasks) to do this, and it forces each MPI process to read the data written by its neighboring node.  Running IOR with this option gives much more credible read performance:

$ mpirun -n 64 ./ior -t 1m -b 16m -s 16 -F -C
...
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 41326 16384 1024.00 0.005756 0.395859 0.095360 0.396453 0
read 3310.00 16384 1024.00 0.011786 4.95 4.20 4.95 1
remove - - - - - - 0.237291 1

But now it should seem obvious that the write performance is also ridiculously high. And again, this is due to the page cache, which signals to IOR that writes are complete when they have been committed to memory rather than the underlying Lustre file system.

To work around the effects of the page cache on write performance, we can issue an fsync() call immediately after all of the write()s return to force the dirty pages we just wrote to flush out to Lustre. Including the time it takes for fsync() to finish gives us a measure of how long it takes for our data to write to the page cache and for the page cache to write back to Lustre.

IOR provides another convenient option, -e (fsync), to do just this. And, once again, using this option changes our performance measurement quite a bit:

$ mpirun -n 64 ./ior -t 1m -b 16m -s 16 -F -C -e
...
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 2937.89 16384 1024.00 0.011841 5.56 4.93 5.58 0
read 2712.55 16384 1024.00 0.005214 6.04 5.08 6.04 3
remove - - - - - - 0.037706 0

and we finally have a believable bandwidth measurement for our file system.

Defeating Page Cache

Since IOR is specifically designed to benchmark I/O, it provides these options that make it as easy as possible to ensure that you are actually measuring the performance of your file system and not your compute nodes' memory.  That being said, the I/O patterns it generates are designed to demonstrate peak performance, not reflect what a real application might be trying to do, and as a result, there are plenty of cases where measuring I/O performance with IOR is not always the best choice.  There are several ways in which we can get clever and defeat page cache in a more general sense to get meaningful performance numbers.

When measuring write performance, bypassing page cache is actually quite simple; opening a file with the O_DIRECT flag going directly to disk.  In addition, the fsync() call can be inserted into applications, as is done with IOR's -e option.

Measuring read performance is a lot trickier.  If you are fortunate enough to have root access on a test system, you can force the Linux kernel to empty out its page cache by doing
# echo 1 > /proc/sys/vm/drop_caches
and in fact, this is often good practice before running any benchmark (e.g., Linpack) because it ensures that you aren't losing performance to the kernel trying to evict pages as your benchmark application starts allocating memory for its own use.

Unfortunately, many of us do not have root on our systems, so we have to get even more clever.  As it turns out, there is a way to pass a hint to the kernel that a file is no longer needed in page cache:


The effect of passing POSIX_FADV_DONTNEED using posix_fadvise() is usually that all pages belonging to that file are evicted from page cache in Linux.  However, this is just a hint--not a guarantee--and the kernel evicts these pages asynchronously, so it may take a second or two for pages to actually leave page cache.  Fortunately, Linux also provides a way to probe pages in a file to see if they are resident in memory.

Finally, it's often easiest to just limit the amount of memory available for page cache.  Because application memory always takes precedence over cache memory, simply allocating most of the memory on a node will force most of the cached pages to be evicted.  Newer versions of IOR provide the memoryPerNode option that do just that, and the effects are what one would expect:


The above diagram shows the measured bandwidth from a single node with 128 GiB of total DRAM.  The first percent on each x-label is the amount of this 128 GiB that was reserved by the benchmark as application memory, and the second percent is the total write volume.  For example, the "50%/150%" data points correspond to 50% of the node memory (64 GiB) being allocated for the application, and a total of 192 GiB of data being read.

This benchmark was run on a single spinning disk which is not capable of more than 130 MB/sec, so the conditions that showed performance higher than this were benefiting from some pages being served from cache.  And this makes perfect sense given that the anomalously high performance measurements were obtained when there was plenty of memory to cache relative to the amount of data being read.

Corollary 

Measuring I/O performance is a bit trickier than CPU performance in large part due to the effects of page caching.  That being said, page cache exists for a reason, and there are many cases where an application's I/O performance really is best represented by a benchmark that heavily utilizes cache.

For example, the BLAST bioinformatics application re-reads all of its input data twice; the first time initializes data structures, and the second time fills them up.  Because the first read caches each page and allows the second read to come out of cache rather than the file system, running this I/O pattern with page cache disabled causes it to be about 2x slower:


Thus, letting the page cache do its thing is often the most realistic way to benchmark with realistic application I/O patterns.  Once you know how page cache might be affecting your measurements, you stand a good chance of being able to reason about what the most meaningful performance metrics are.

Learning electronics with roulette, datasheets, and Raspberry Pi

$
0
0
I've had a few electronics kits kicking around for years now that I'd never sat down and put together.  At a glance, these kits all seemed like they were designed to be soldering practice that resulted in a fun gadget at the end of the day.  All the magical functionality always was always hidden in black-box integrated circuits, so I could never figure out exactly how the circuit worked, and this frustration (combined with my poor soldering abilities) left me without much desire to do much with them.

Very recently though, it occurred to me that we now live in an age where the datasheets for many of these black-box chips are online, and it's now actually possible to pull back the curtain on what they're doing under the hood.  As it turns out, most of them are a lot simpler than I would have guessed.  And after digging through my old kits, I also realized that they are often just simple IC components that are connected in clever ways to achieve their perform their magic.

With this epiphany and newfound confidence understanding how these kits work, I set out to learn something new about electronics.  And given that my background in electronics has been limited to a week of electronics camp at age 13 and an 8 AM physics class in college, I figured my odds at accomplishing this were pretty good.

Velleman MK152 Spinning LED Wheel

This endeavor started with a Spinning LED Wheel kit by a Belgian company called Velleman.  It's a simple LED roulette wheel circuit where, upon pressing a button, a light spins around a ring of ten LEDs very quickly at first, then slows and eventually stops on a single "winning" LED.  The kit comes with a couple resistors, capacitors, LEDs, and two DIP chips, and is really inexpensive.


It also comes with a printed circuit board and battery pack which are supposed to be all soldered together, but I wanted to assemble this all on a breadboard for a couple of reasons:
  1. It would be a lot easier to experiment: changing resistors and capacitors to see what would happen would help me understand which circuit components are the most important.
  2. It would be easier to rebuild and improve the circuit with additional features later on.
  3. It would be easier to interface with my Raspberry Pi for debugging and improvement.
  4. It's a lot harder to screw up assembly when a soldering iron is not required!
So, with a trusty $3 breadboard and a handful of jumper wires, I set out to reproduce the circuit diagram that ships with the Velleman MK152 kit:


The biggest mystery to this kit are the two DIP chips included in the kit since they are, at a glance, little black boxes:


The MK152 kit documentation includes no mention of what they actually do, making it really difficult to figure out what the circuit does with only the contents of the kit.  However, Googling their part numbers brings up a wealth of information about these chips, and it turns out that these two DIPs are a set of inverters and a decade counter:
  • The CD4069UBE chip is just six NOT gates (inverters) stuffed into a DIP package.
  • The CD4017BE chip is a decade counter, which is a neat component that has ten numbered output pins (called Q0 through Q9) and a single input pin (called CLK).  It determines which of the ten output pins is lit up at any given time using the following logic:
    • When the input pin (CLK) is first lit up, output first output pin (Q0) is lit up.
    • The next time CLKis bounced (turned off, then turned on again), the first output pin (Q0) turns off and the second pin (Q1) turns on.
    • This cycle repeats every time the CLK is and wraps around back after the tenth pin (Q9) is lit up.
After understanding how these two ICs worked, building the kit's circuit on a breadboard seems a lot less daunting.  Because I only had long braided jumper wires though, my final product looked a bit ugly:


But it worked!



Understanding the Circuit

Not having any practical experience with electronics, I had a hard time understanding exactly how this circuit was working.  The CD4017BE IC is certainly central to this circuit's operation, and I understood that every time the voltage going into the CLK pin went up and back down, a new LED would light up.  I also understood that resistor-capacitor series have time-dependent behavior that can be used to make voltages go low and high in a very predictable manner, which could drive the CLK pin.  But how do these concepts translate into a wheel that spins, slows down, and eventually stops?

Aside from the CD4017BE decade counter, this circuit really has two distinct sections.  The first section handles the input:


Pressing the switch (SW1) charges up the 47 µF capacitor (C3) and starts the roulette wheel going.  From here, I figured out that
  • Since the C3 capacitor is the biggest one in the kit, it made sense that this is probably what drives the entire circuit after the switch is opened and the battery pack is no longer connected.  And indeed, replacing this C3 capacitor with one of smaller capacitance causes the roulette wheel to spin for a much shorter period of time before shutting off.
  • The combination of the 1 µF capacitor (C2) and the 100 KΩ resistor (R4) looks a lot like an RC series that can be used as a timer to drive the other half of the circuit.  And again, changing the capacitance of this capacitor changes the speed at which the LED wheel "spins."
  • The NOT gates (inverters) are directly connected to the C3 capacitor driving the whole circuit, so they are probably acting as a shutoff mechanism.  After the C3 capacitor discharged enough (effectively turning "off"), everything on the other side of the inverters (IC1F, IC1B, IC1C) switch on.  Since there are nothing but our LEDs north of these gates, this reversal of polarity would cause the LEDs to shut off for good.
The other half of the circuit is what drives the actual CLK signal that causes the LEDs to light up in order.  It effectively converts the analog signal coming from our RC series into a digital signal that drives the CD4017BE decade counter.


This was (and still is) a bit harder for me to figure out since the subtleties of how analog signals interact with digital components like the NOT gates aren't very clear to me.  That being said, I figured out that
  • The IC1A inverter is what holds the CLK pin high (on) when the rest of the circuit is completely discharged.  This means that full CLK signals (going fully on, then fully off again) are driven by this IC1A gate being momentarily shut off, since its default state is high (on).
  • The 10 nF capacitor (C1) is a bit of a red herring.  The CD4069BE datasheet recommends conditioning power using small capacitors like this, and that's exactly what this component does--removing it doesn't actually affect the rest of the circuit under normal conditions.
  • The combination of the 3.3 MΩ resistor (R2), the 470 KΩ resistor (R1) and the IC1E and ICD1D inverters form a pulse shaping circuit.  This converts the falling (analog) voltage coming from the 1 µF capacitor (C2) on the input section into an unambiguous high or low (digital) voltage that drives IC1A, which in turn drives the CLK signal.

Integrating with Raspberry Pi

As a fun exercise in both programming and understanding the digital aspects of this circuit, I then thought it would be fun to replace the CD4017BE decade counter IC with a Raspberry Pi.  This is admittedly a very silly thing to do--that is, replacing a simple IC with a full-blown microprocessor running Linux--but I wanted to see if I could replicate what I thought the CD4017BE chip was doing using the Raspberry Pi's GPIO pins and a bit of Python.

The basic idea is that each pin on the actual CD4017BE will map to a GPIO pin on the Raspberry Pi, and then a Python script will mimic the functionality of each CD4017BE pin.  Removing all the jumper wires that fed into the CD4017BE DIP and instead plugging them into GPIO headers on the Raspberry Pi was a little messy:


I also removed the battery pack that came with the MK152 and just powered the whole circuit off of the Raspberry Pi's 5V rail.  Then, each CD4017BE pin had to be mapped to a GPIO pin:
  • CD4017BE pin 1 (Q5) mapped to GPIO pin 12
  • CD4017BE pin 2 (Q1) mapped to GPIO pin 17
  • CD4017BE pin 3 (Q0) mapped to GPIO pin 22
  • CD4017BE pin 4 (Q2) mapped to GPIO pin 5
  • CD4017BE pin 5 (Q6) mapped to GPIO pin 25
  • CD4017BE pin 6 (Q7) mapped to GPIO pin 24
  • CD4017BE pin 7 (Q3) mapped to GPIO pin 6
  • CD4017BE pin 8 (VSS) isn't needed
  • CD4017BE pin 9 (Q8) mapped to GPIO pin 27
  • CD4017BE pin 10 (Q4) mapped to GPIO pin 13
  • CD4017BE pin 11 (Q9) mapped to GPIO pin 23
  • CD4017BE pin 12 (CARRY OUT) isn't needed
  • CD4017BE pin 13 (CLOCK INHIBIT) isn't needed
  • CD4017BE pin 14 (CLOCK) mapped to GPIO pin 4
  • CD4017BE pin 15 (RESET) isn't needed
  • CD4017BE pin 16 (VDD) isn't needed
Because the logic performed by this decade counter chip is so simple, the Python code that implements the same logic is also quite simple.  Here's the minimum working code:

Since the Raspberry Pi only replaces the CD4017BE chip (and the battery pack), the physical button still has to be pressed to activate the circuit after the above Python script is started.  Once it's pressed though, the LED wheel works just like before!


This Python version of the decade counter logic doesn't have to stop here though; for example, I went on to implement the full CD4017BE chip in Python (including pins we don't use in this project like CARRY OUT and CLOCK INHIBIT) just for fun.  It would be trivial to also implement the CD4069UBE's NOT gates too and convert this kit into a real Frankenstein circuit.

Wrap-Up

This Velleman MK152 kit turned out to be a really fun project to start learning about both analog and digital circuitry.  Once I realized that IC datasheets are easily and freely found online nowadays, the idea of understanding the circuit became tractable.  This gave me a basis on which I could experiment; I could easily prod different segments with a multimeter, try to guess what would happen if I removed or replaced a component, then actually perform the experiment.  For example, I found that messing with the C2 and C3 capacitors changes how long and how quickly the roulette wheel spins, and sticking a passive piezo buzzer in parallel with the CLK signal adds roulette wheel-like sound effects too.

This kit is really a neat demonstration of a digital circuit using pretty simple analog and digital components.  What's more, it's a great boilerplate design for how analog components like resistors and capacitors can work with the Raspberry Pi.  The decade counter and inverter DIPs are also versatile components that can be used in other projects; this contrasts with many of the electronics kits that ship with a full microcontroller which, despite being able to perform more complex tasks, are truly black boxes.  Fortunately, the higher cost of microcontrollers actually makes these versatile kits cheaper, so they wind up being an economical way to build up a parts collection too.

If nothing else, messing with this kit along with my Raspberry Pi was a good excuse to get familiar with basic electronics and get in some practice programming GPIO.  Assembly and basic testing fit into an afternoon, but there is still plenty of opportunity to experiment and expand after that.  

Reviewing the state of the art of burst buffers

$
0
0
If you're interested in burst buffers and happen to be a student, please reach out and contact me! We have an internship opportunity in performance analysis of our 1.8 PB/1.5 TB/sec burst buffer for students of all levels of experience.
Just over two years ago I attended my first DOE workshop as a guest representative of the NSF supercomputing centers, and I wrote a post that summarized my key observations of how the DOE was approaching the increase in data-intensive computing problems.  At the time, the most significant thrusts seemed to be
  1. understanding scientific workflows to keep pace with the need to process data in complex ways
  2. deploying burst buffers to overcome the performance limitations of spinning disk relative to the increasing scale of simulation data
  3. developing methods and processes to curate scientific data
Here we are now two years later, and these issues still take center stage in the discussion surrounding the future of  data-intensive computing.  The DOE has made significant progress in defining its path forward in these areas though, and in particular, both the roles of burst buffers and scientific workflows have a much clearer focus on DOE’s HPC roadmap.  Burst buffers in particular are becoming a major area of interest since they are now becoming commercially available, so in the interests of updating some of the incorrect or incomplete thoughts I wrote about two years ago, I thought I'd write about the current state of the art in burst buffers in HPC.

Two years ago I had observed that there were two major camps in burst buffer implementations: one that is more tightly integrated with the compute side of the platform that utilizes explicit allocation and use, and another that is more closely integrated with the storage subsystem and acts as a transparent I/O accelerator.  Shortly after I made that observation though, Oak Ridge and Lawrence Livermore announced their GPU-based leadership systems, Summit and Sierra, which would feature a new type of burst buffer design altogether that featured on-node nonvolatile memory.

This CORAL announcement, combined with the deployment of production, large-scale burst buffers at NERSCLos Alamos, and KAUST, has led me to re-think my taxonomy of burst buffers.  Specifically, it really is important to divide burst buffers into their hardware architectures and software usage modes; different burst buffer architectures can provide the same usage modalities to users, and different modalities can be supported by the same architecture.

For the sake of laying it all out, let's walk through the taxonomy of burst buffer hardware architectures and burst buffer software usage modalities.

Burst Buffer Hardware Architectures

First, consider your typical medium- or large-scale HPC system architecture without a burst buffer:


In this design, you have

  • Compute Nodes (CN), which might be commodity whitebox nodes like the Dell C6320 nodes in SDSC's Comet system or Cray XC compute blades
  • I/O Nodes (ION), which might be commodity Lustre LNET routers (commodity clusters), Cray DVS nodes (Cray XC), or CIOD forwarders (Blue Gene)
  • Storage Nodes (SN), which might be Lustre Object Storage Servers (OSSes) or GPFS Network Shared Disk (NSD) servers
  • The compute fabric (blue lines), which is typically Mellanox InfiniBand, Intel OmniPath, or Cray Aries
  • The storage fabric (red lines), which is typically Mellanox InfiniBand or Intel OmniPath

Given all these parts, there are a bunch of different places you can stick flash devices to create a burst buffer.  For example...

ION-attached Flash

You can put SSDs inside IO nodes, resulting in an ION-attached flash architecture that looks like this:


Gordon, which was the first large-scale deployment of what one could call a burst buffer, had this architecture.  The flash was presented to the compute nodes as block devices using iSCSI, and a compute node could have anywhere between zero and sixteen SSDs mounted to it entirely via software.  More recently, the Tianhe-2 system at NUDT also deployed this architecture and exposes the flash to user applications via their H2FS middleware.

Fabric-attached Flash

A very similar architecture is to add specific burst buffer nodes on the compute fabric that don't route I/O, resulting in a fabric-attached flash architecture:

Like the ION-attached flash design of Gordon, the flash is still embedded within the compute fabric and is logically closer to the compute nodes than the storage nodes.  Cray's DataWarp solution uses this architecture.

Because the flash is still on the compute fabric, this design is very similar to ION-attached flash and the decision to chose it over the ION-attached flash design is mostly non-technical.  It can be more economical to embed flash directly in I/O nodes if those nodes have enough peripheral ports (or physical space!) to support the NICs for the compute fabric, the NICs for the storage fabric, and the flash devices.  However as flash technology moves away from being attached via SAS and towards being directly attached to PCIe, it becomes more difficult to stuff that many high-performance peripherals into a single box without imbalancing something.  As such, it is likely that fabric-attached flash architectures will replace ION-attached flash going forward.

Fortunately, any burst buffer software designed for ION-attached flash designs will also probably work on fabric-attached flash designs just fine.  The only difference is that the burst buffer software will no longer have to compete against the I/O routing software for on-node resources like memory or PCIe bandwidth.

CN-attached Flash

A very different approach to building burst buffers is to attach a flash device to every single compute node in the system, resulting in a CN-attached flash architecture:


This design is neither superior nor inferior to the ION/fabric-attached flash design.  The advantages it has over ION/fabric-attached flash include

  • Extremely high peak I/O performance -The peak performance scales linearly with the number of compute nodes, so the larger your job, the more performance your job can have.
  • Very low variation in I/O performance - Because each compute node has direct access to its locally attached SSD, contention on the compute fabric doesn't affect I/O performance.
However, these advantages come at a cost:
  • Limited support for shared-file I/O -  Because each compute node doesn't share its SSD with other compute nodes, having many compute nodes write to a single shared file is not a straightforward process.  The solution to this issue include from such N-1 style I/O being simply impossible (the default case), relying on I/O middleware like the SCR library to manage data distribution, or relying on sophisticated I/O services like Intel CPPR to essentially journal all I/O to the node-local flash and flush it to the parallel file system asynchronously.
  • Data movement outside of jobs becomes difficult - Burst buffers allow users to stage data into the flash before their job starts and stage data back to the parallel file system after their job ends.  However in CN-attached flash, this staging will occur while someone else's job might be using the node.  This can cause interference, capacity contention, or bandwidth contention.  Furthermore, it becomes very difficult to persist data on a burst buffer allocation across multiple jobs without flushing and re-staging it.
  • Node failures become more problematic - The point of writing out a checkpoint file is to allow you to restart a job in case one of its nodes fails.  If your checkpoint file is actually stored on one of the nodes that failed, though, the whole checkpoint gets lost when a node fails.  Thus, it becomes critical to flush checkpoint files to the parallel file system as quickly as possible so that your checkpoint file is safe if a node fails.  Realistically though, most application failures are not caused by node failures; a study by LLNL found that 85% of job interrupts do not take out the whole node.
  • Performance cannot be decoupled from job size - Since you get more SSDs by requesting more compute nodes, there is no way to request only a few nodes and a lot of SSDs.  While this is less an issue for extremely large HPC jobs whose I/O volumes typically scale linearly with the number of compute nodes, data-intensive applications often have to read and write large volumes of data but cannot effectively use a huge number of compute nodes.
If you take a step back and look at what these strengths and weaknesses play to, you might be able to envision what sort of supercomputer design might be best suited for this type of architecture:
  • Relatively low node count, so that you aren't buying way more SSD capacity or performance than you can realistically use given the bandwidth of the parallel file system to which the SSDs must eventually flush
  • Relatively beefy compute nodes, so that the low node count doesn't hurt you and so that you can tolerate running I/O services to facilitate the asynchronous staging of data and middleware to support shared-file I/O
  • Relatively beefy network injection bandwidth, so that asynchronous stage in/out doesn't severely impact the MPI performance of the jobs that run before/after yours
There are also specific application workloads that are better suited to this CN-attached flash design:
  • Relatively large job sizes on average, so that applications routinely use enough compute nodes to get enough I/O bandwidth.  Small jobs may be better off using the parallel file system directly, since parallel file systems can usually deliver more I/O bandwidth to smaller compute node counts.
  • Relatively low diversity of applications, so that any applications that rely on shared-file I/O (which is not well supported by CN-attached flash, as we'll discuss later) can either be converted into using the necessary I/O middleware like SCR, or can be restructured to use only file-per-process or not rely on any strong consistency semantics.
And indeed, if you look at the systems that are planning on deploying this type of CN-attached flash burst buffer in the near future, they all fit this mold.  In particular, the CORAL Summit and Sierra systems will be deploying these burst buffers at extreme scale, and before them, Tokyo Tech's Tsubame 3.0 will as well.  All of these systems derive the majority of their performance from GPUs, leaving the CPUs with the capacity to implement more functionality of their burst buffers in software on the CNs.

Storage Fabric-attached Flash

The last notable burst buffer architecture involves attaching the flash on the storage fabric rather than the compute fabric, resulting in SF-attached flash:


This is not a terribly popular design because
  1. it moves the flash far away from the compute node, which is counterproductive to low latency
  2. it requires that the I/O forwarding layer (the IONs) support enough bandwidth to saturate the burst buffer, which can get expensive
However, for those HPC systems with custom compute fabrics that are not amenable to adding third-party burst buffers, this may be the only possible architecture.  For example, the Argonne Leadership Computing Facility has deployed a high-performance GPFS file system as a burst buffer alongside their high-capacity GPFS file system in this fashion because it is impractical to integrate flash into their Blue Gene/Q's proprietary compute fabric.  Similarly, sites that deploy DDN's Infinite Memory Engine burst buffer solution on systems with proprietary compute fabrics (e.g., Cray Aries on Cray XC) will have to deploy their burst buffer nodes on the storage fabric.

Burst Buffer Software

Ultimately, all of the different burst buffer architectures still amount to sticking a bunch of SSDs into a supercomputing system, and if this was all it took to make a burst buffer though, burst buffers wouldn't be very interesting.  Thus, there is another half of the burst buffer ecosystem: the software and middleware that transform a pile of flash into an I/O layer that applications can actually use productively.

In the absolute simplest case, this software layer can just be an XFS file system atop RAIDed SSDs that is presented to user applications as node-local storage.  And indeed, this is what SDSC's Gordon system did; for many workloads such as file-per-process I/O, it is a suitable way to get great performance.  However, as commercial vendors have gotten into the burst buffer game, they have all started using this software layer to differentiate their burst buffer solutions from their competitors'.  This has resulted in modern burst buffers now having a lot of functionality that allow users to do interesting new things with their I/O.

Because this burst buffer differentiation happens entirely in software, it should be no surprise that these burst buffer software solutions look a lot like the software-defined storage products being sold in the enterprise cloud space.  The difference is that burst buffer software can be optimized specifically for HPC workloads and technologies, resulting in much nicer and accessible ways in which they can be used by HPC applications.

Common Software Features

Before getting too far, it may be helpful to enumerate the features common to many burst buffer software solutions:
  • Stage-in and stage-out - Burst buffers are designed to make a job's input data already be available on the burst buffer immediately when the job starts, and to allow the flushing of output data to the parallel file system after the job ends.  To make this happen, the burst buffer service must give users a way to indicate what files they want to be available on the burst buffer when they submit their job, and they must also have a way to indicate what files they want to flush back to the file system after the job ends.
  • Background data movement - Burst buffers are also not designed to be long-term storage, so their reliability can be lower than the underlying parallel file system.  As such, users must also have a way to tell the burst buffer to flush intermediate data back to the parallel file system while the job is still running.  This should happen using server-to-server copying that doesn't involve the compute node at all.
  • POSIX I/O API compatibility - The vast majority of HPC applications rely on the POSIX I/O API (open/close/read/write) to perform I/O, and most job scripts rely on tools developed for the POSIX I/O API (cd, ls, cp, mkdir).  As such, all burst buffers provide the ability to interact with data through the POSIX I/O API so that they look like regular old file systems to user applications.  That said, the POSIX I/O semantics might not be fully supported; as will be described below, you may get an I/O error if you try to perform I/O in a fashion that is not supported by the burst buffer.
With all this being said, there are still a variety of ways in which these core features can be implemented into a complete burst buffer software solution.  Specifically, burst buffers can be accessed through one of several different modes, and each mode provides a different balance of peak performance and usability.

Transparent Caching Mode

The most user-friendly burst buffer mode uses flash to simply act as a giant cache for the parallel file system which I call transparent caching mode.  Applications see the burst buffer as a mount point on their compute nodes, and this mount point mirrors the contents of the parallel file system, and any changes I make to one will appear on the other.  For example,

$ ls /mnt/lustre/glock
bin project1 project2 public_html src

### Burst buffer mount point contains the same stuff as Lustre
$ ls /mnt/burstbuffer/glock
bin project1 project2 public_html src

### Create a file on Lustre...
$ touch /mnt/lustre/glock/hello.txt

$ ls /mnt/lustre/glock
bin hello.txt project1 project2 public_html src

### ...and it automatically appears on the burst buffer.
$ ls /mnt/burstbuffer/glock
bin hello.txt project1 project2 public_html src

### However its contents are probably not on the burst buffer's flash
### yet since we haven't read its contents through the burst buffer
### mount point, which is what would cause it to be cached

However, if I access a file through the burst buffer mount (/mnt/burstbuffer/glock) rather than the parallel file system mount (/mnt/lustre/glock),
  1. if hello.txt is already cached on the burst buffer's SSDs, it will be read directly from flash
  2. if hello.txt is not already cached on the SSDs, the burst buffer will read it from the parallel file system, cache its contents on the SSDs, and return its contents to me
Similarly, if I write to hello.txt via the burst buffer mount, my data will be cached to the SSDs and will not immediately appear on the parallel file system.  It will eventually flush out to the parallel file system, or I could tell the burst buffer service to explicitly flush it myself.

This transparent caching mode is by far the easiest, since it looks exactly like the parallel file system for all intents and purposes.  However if you know that your application will never read any data more than once, it's far less useful in this fully transparent mode.  As such, burst buffers that implement this mode provide proprietary APIs that allow you to stage-in data, control the caching heuristics, and explicitly flush data from the flash to the parallel file system.  

DDN's Infinite Memory Engine and Cray's DataWarp both implement this transparent caching mode, and, in principle, it can be implemented on any of the burst buffer architectures outlined above.

Private PFS Mode

Although the transparent caching mode is the easiest to use, it doesn't give users a lot of control over what data does or doesn't need to be staged into the burst buffer.  Another access mode involves creating a private parallel file system on-demand for jobs, which I will call private PFS mode.  It provides a new parallel file system that is only mounted on your job's compute nodes, and this mount point contains only the data you explicitly copy to it:

### Burst buffer mount point is empty; we haven't put anything there,
### and this file system is private to my job
$ ls /mnt/burstbuffer

### Create a file on the burst buffer file system...
$ dd if=/dev/urandom of=/mnt/burstbuffer/mydata.bin bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.776115 s, 13.5 MB/s

### ...it appears on the burst buffer file system...
$ ls -l /mnt/burstbuffer
-rw-r----- 1 glock glock 10485760 Jan 1 00:00 mydata.bin

### ...and Lustre remains entirely unaffected
$ ls /mnt/lustre/glock
bin project1 project2 public_html src

This is a little more complicated than transparent caching mode because you must now manage two file system namespaces: the parallel file system and your private burst buffer file system.  However this gives you the option to target your I/O to one or the other, so that a tiny input deck can stay on Lustre while your checkpoints are written out to the burst buffer file system.

In addition, the burst buffer private file system is strongly consistent; as soon as you write data out to it, you can read that data back from any other node in your compute job.  While this is true of transparent caching mode if you always access your data through the burst buffer mount point, you can run into trouble if you accidentally try to read a file from the original parallel file system mount point after writing out to the burst buffer mount.  Since private PFS mode provides a completely different file system and namespace, it's a bit harder to make this mistake.

Cray's DataWarp implements private PFS mode, and the Tsubame 3.0 burst buffer will be implementing private PFS mode using on-demand BeeGFS.  This mode is most easily implemented on fabric/ION-attached flash architectures, but Tsubame 3.0 is demonstrating that it can also be done on CN-attached flash.

Log-structured/Journaling Mode

As probably the least user-friendly but highest-performing use mode, log-structured (or journaling) mode burst buffers present themselves to users like a file system, but they do not support the full extent of file system features.  Under the hood, writes are saved to the flash not as files, but as records that contain a timestamp, the data to be written, and the location in the file to which the data should be written.  These logs are continually appended as the application performs its writes, and when it comes time to flush the data to the parallel file system, the logs are replayed to effectively reconstruct the file that the application was trying to write.

This can perform extremely well since even random I/O winds up being restructured as sequentially appended I/O.  Furthermore, there can be as many logs as there are writers; this allows writes to happen with zero lock contention, since contended writes are resolved out when the data is re-played and flushed.

Unfortunately, log-structured writes make reading very difficult, since the read can no longer seek directly to a file offset to find the data it needs.  Instead, the log needs to be replayed to some degree, effectively forcing a flush to occur.  Furthermore, if the logs are spread out across different logical flash domains (as would happen in CN-attached flash architectures), read-back may require the logs to be centrally collected before the replay can happen, or it may require inter-node communication to coordinate who owns the different bytes that the application needs to read.

What this amounts to is functionality that may present itself like a private parallel file system burst buffer, but behaves very differently on reads and writes.  For example, attempting to read the data that exists in a log that doesn't belong to the writer might generate an I/O error, so applications (or I/O middleware) probably need to have very well-behaved I/O to get the full performance benefits of this mode.  Most extreme-scale HPC applications already do this, so log-structured/journaling mode is a very attractive approach for very large applications that rely on extreme write performance to checkpoint their progress.

Log-structured/journaling mode is well suited for CN-attached flash since logs do not need to live on a file system that presents a single shared namespace across all compute nodes.  In practice, the IBM CORAL systems will probably provide log-structured/journaling mode through IBM's burst buffer software.  Oak Ridge National Laboratory has also demonstrated a log-structured burst buffer system called BurstMem on a fabric-attached flash architecture.  Intel's CPPR library, to be deployed with the Argonne Aurora system, may also implement this functionality atop the 3D XPoint to be embedded in each compute node.

Other Modes

The above three modes are not the only ones that burst buffers may implement, and some burst buffers support more than one of the above modes.  For example, Cray's DataWarp, in addition to supporting private PFS and transparent caching modes, also has a swap mode that allows compute nodes to use the flash as swap space to prevent hard failures for data analysis applications that consume non-deterministic amounts of memory.  In addition, Intel's CPPR library is targeting byte-addressable nonvolatile memory which would expose a load/store interface, rather than the typical POSIX open/write/read/close interface, to applications.

Outlook

Burst buffers, practically speaking, remain in their infancy, and there is a lot of room for the landscape I've outlined here to change.  For example, the common software features I highlighted (staging, background data movement, and POSIX API support) are still largely implemented via proprietary, non-standard APIs at present.  There is effort to get burst buffer vendors to agree to a common API, and as this process proceeds, features may appear or disappear as customers define what is and isn't a worthwhile differentiating feature.

On the hardware front, the burst buffer ecosystem is also in flux.  ION-attached flash is where burst buffers began, but as discussed above, they are likely to be replaced by dedicated fabric-attached flash servers.  In addition, the emergence of storage-class memory (that is, byte-addressable nonvolatile memory) will also add a new dimension to burst buffers that may make one architecture the clear winner over the others.  At present though, both fabric-attached and CN-attached burst buffers have their strengths and weaknesses, and neither is at risk of disappearing in the next five years.

As more extreme-scale systems begin to hit the floor and users figure out what does and doesn't work across the diversity of burst buffer hardware and software features, the picture is certain to become clearer.  Once that happens, I'll be sure to post another update.

A less-biased look at tape versus disks

$
0
0

Executive Summary

Tape isn't dead despite what object store vendors may tell you, and it still plays an important role in both small- and large-scale storage environments.  Disk-based object stores certainly have eroded some of the areas where tape has historically been the obvious choice, but in the many circumstances where low latency is not required and high cost cannot be tolerated, tape remains a great option.

This post is a technical breakdown of some of the misconceptions surrounding the future of tape in the era of disk-based object stores as expressed in a recent blog post from an object store vendor's chief marketing officer.  Please note that the opinions stated below are mine alone and not a reflection of my employer or the organizations and companies mentioned.  I also have no direct financial interests in any tape, disk, or object store vendors or technologies.

Introduction

IBM 701 tape drive--what many people picture when they hear about tape-based storage.  It's really not still like this, I promise.
Scality, and object store software vendor whose product relies on hard disk-based (HDD-based) storage, recently posted a marketing blog post claiming that tape is finally going to die and disk is the way of the future.  While I don't often rise to the bait of marketing material, tape takes a lot more flack than it deserves because of how old and of a technology it is.  There is no denying that tape is old--it actually precedes the first computers by decades, and digital tape recording goes back to the early 1950s.  Like it or not though, tape technology is about as up-to-date as HDD technology (more on this later), and you're likely still using tape on a regular basis whether you like it or not.  For example, Google relies on tape to archive your everyday including Gmail because, in terms of cost per bit and power consumption, tape will continue to beat disk for years to come.  So in the interests of sticking up for tape for both its good and bad, let's walk through Scality's blog post, authored by their chief of marketing Paul Turner, and tell the other side of the story.

1. Declining Tape Revenues

Mr. Turner starts by pointing out that "As far back as 2010, The Register reported a 25% decline in tape drive and media sales." This decrease is undeniably true:

Market trends for LTO tape, 2008-2015.  Data from the Santa Clara Consulting Group, presented at MSST 2016 by Bob Fontana (IBM)

Although tape revenue has been decreasing, an increasing amount of data is landing on tape.  How can these seemingly contradictory trends be reconciled?

The reality is that the tape industry at large is not technologically limited like CPU processors, flash storage, or even spinning disk.  Rather, the technology that underlies both the magnetic tape media and the drive heads that read and write this media are actually lifted over from the HDD industry.  That is, the hottest tape drives on the market today are using technology that the HDD industry figured out years ago.  As such, even if HDD innovation completely halted overnight, the tape industry would still be able to release new products for at least one or two more technology generations.

This is all to say that the rate at which new tape technologies reach market are not limited by the rate of innovation in the underlying storage technology.  Tape vendors simply lift over HDD innovations into new tape products when it becomes optimally profitable to do so, so declining tape revenues simply means that the cadence of the tape technology refresh will stretch out.  While this certainly widens the gap between HDD and tape and suggests a slow down-ramping of tape as a storage media, you cannot simply extrapolate these market trends in tape down to zero.  The tape industry simply doesn't work like that.

2. The Shrinking Tape Vendor Ecosystem

Mr. Turner goes on to cite an article published in The Register about Oracle's EOL of the StorageTek line of enterprise tape:
"While this falls short of a definitive end-of-life statement, it certainly casts serious doubt on the product’s future. In fairness, we’ll note that StreamLine is a legacy product family originally designed and built for mainframes. Oracle continues to promote the open LTO tape format, which is supported by products from IBM, HPE, Quantum, and SpectraLogic."
To be fair, Mr. Turner deserves credit for pointing out that StorageTek (which was being EOL'ed) and LTO are different tape technologies, and Oracle continues to support LTO.  But let's be clear here--the enterprise (aka mainframe) tape market has been roughly only 10% of the global tape market by exabytes shipped, and even then, IBM and Oracle have been the only vendors in this space.  Oracle's exit from the enterprise tape market is roughly analogous to Intel recently EOL'ing Itanium with the 9700-series Kittson chips in that a boutique product is being phased out in favor of a product that hits a much wider market.

3. The Decreasing Cost of Disk

Mr. Turner goes on to cite a Network Computing article:
"In its own evaluation of storage trends, including the increasing prevalence of cloud backup and archiving, Network Computing concludes that “…tape finally appears on the way to extinction.” As evidence, they cite the declining price of hard disks,"
Hard disk prices decrease on a cost per bit basis, but there are a few facts that temper the impact of this trend:

Point #1: HDDs include both the media and the drive that reads the media.  This makes the performance of HDDs scale a lot more quickly than tape, but it also means HDDs have a price floor of around $40 per device.  The cost of the read heads, voice coil, and drive controller are not decreasing.  When compared to the tape cartridges of today (whose cost floor is limited by the magnetic tape media itself) or the archival-quality flash of tomorrow (think of how cheaply thumb drives can be manufactured), HDD costs don't scale very well.  And while one can envision shipping magnetic disk platters that rely on external drives to drive the cost per bit down, such a solution would look an awful lot like a tape archive.

Point #2: The technology that underpins the bit density of hard drives has been rapidly decelerating.  The ultra high-density HDDs of today seem to have maxed out at around 1 terabit per square inch using parallel magnetic recording (PMR) technology, so HDD vendors are just cramming more and more platters into individual drives.  As an example, Seagate's recently unveiled 12 TB PMR drives contain an astounding eight platters and sixteen drive heads; their previous 10 TB PMR drives contained seven platters, and their 6 TB PMR drives contained five platters.  Notice a trend?

There are truly new technologies that radically change the cost-per-bit trajectory for hard drives which include shingled magnetic recording (SMR), heat-assisted magnetic recording (HAMR), and bit-patterned media (BPM).  However, SMR's severe performance limitations for non-sequential writes make them a harder sell as a wholesale replacement for tape.  HAMR and BPM hold much more universal promise, but they simply don't exist as products yet and therefore simply don't compete with tape.  Furthermore, considering our previous discussion of how tape technology evolves, the tape industry has the option to adopt these very same technologies to drive down the cost-per-bit of tape by a commensurate amount.

4. The Decreasing Cost of Cloud

Mr. Turner continues citing the Network Computing article, making the bold claim that two other signs of the end of tape are
"...the ever-greater affordability of cloud storage,"
This is deceptive.  The cloud is not a charitable organization; their decreasing costs are a direct reflection of the decreasing cost per bit of media, which are savings that are realized irrespective of whether the media is hosted by a cloud provider or on-premise.  To be clear, the big cloud providers are definitely also reducing their costs by improving their efficiencies at scale; however, these savings are transferred to their customers only to the extent that they can be price competitive with each other.  My guess, which is admittedly uneducated, is that most of these cost savings are going to shareholders, not customers.
"and the fact that cloud is labor-free."
Let's be real here--labor is never "free" in the context of data management.  It is true that you don't need to pay technicians to swap disks in your datacenter if you have no tape (or no datacenter).  However, it's a bit insulting to presume that the only labor done by storage engineers is replacing disks.  Storage requires babysitting regardless of if it lives in the cloud or on-premise, and regardless of if it is backed by tape or disk.  It needs to be integrated with the rest of a company's infrastructure and operations, and this is where the principal opex of storage should be spent.  Any company that is actually having to scale personnel linearly with storage is doing something terribly wrong, and making the choice to migrate to the cloud to save opex is likely putting a band-aid over a much bigger wound.

Finally, this cloud-tape argument conflates disk as a technology and cloud as a business model.  There's nothing preventing tape from existing in the cloud; in fact, the Oracle Cloud does exactly this and hosts archival data in StorageTek archives at absolute rock-bottom prices--$0.001/GB, which shakes out to $1,000 per month to host a petabyte of archive.  Amazon Glacier also offers a tape-like performance and cost balance relative to its disk-based offerings.  The fact that you don't have to see the tapes in the cloud doesn't mean they don't exist and aren't providing you value.

5. The Performance of Archival Disk over Tape

The next argument posed by Mr. Turner is the same one that people have been using to beat up on tape for decades:
"...spotlighting a tape deficit that’s even more critical than price: namely, serial--and glacially slow--access to data."
This was a convincing argument back in the 1980s, but to be frank, it's really tired at this point.  If you are buying tape for low latency, you are doing something wrong.

As I discussed above, tape's benefits lie in its
  1. rock-bottom cost per bit, achievable because it uses older magnetic recording technology and does not package the drive machinery with the media like disk does, and
  2. total cost of ownership, which is due in large part to the fact that it does not draw power when data is at rest.
I would argue that if 
  1. you don't care about buying the cheapest bits possible (for example, if the cost of learning how to manage tape outweighs the cost benefits of tape at your scale), or
  2. you don't care about keeping power bills low (for example, if your university foots the power bill)
there are definitely better options for mass storage than tape.  Furthermore, if you need to access any bit of your data at nearline speeds, you should definitely be buying nearline storage media.  Tape is absolutely not nearline, and it would just be the wrong tool for the job.

However, tape remains the obvious choice in cases where data needs to be archived or a second copy has to be retained.  Consider the following anecdotes:
In both cases--offline second copy and offline archive--storing data in nearline storage often just doesn't make economic sense since the data is not being frequently accessed.

However, it is critical to point out that there are scales at which using tape does not make great sense. Let's break these scales out and look at each:

At small scales where the number of cartridges in on the same order as the number of drives (e.g., a single drive with a handful of cartridges), tape is not too difficult to manage.  At these scales, such as those which might be found in a small business' IT department, performing offline backups of financials to tape is a lot less expensive than continually buying external USB drives and juggling them.

At large scales where the number of cartridges is far larger than the number of drives (e.g., in a data-driven enterprise or large-scale scientific computing complex), tape is also not too difficult to manage.  The up-front cost of tape library infrastructure and robotics is amortized by the annual cost of media, and sophisticated data management software (more on this below!) prevents humans from having to juggle tapes manually.

At medium scales, tape can be painful.  If the cost of libraries and robotics is difficult to justify when compared to the cost of the media (and therefore has a significant impact on the net $/GB of tape), you wind up having to pay people to do the job of robots in managing tapes.  This is a dangerous way to operate, as you are tickling the upper limits of how far you can scale people and you have to carefully consider how much runway you've got before you are better off buying robotics, disks, or cloud-based resources.

6. The Usability of Archival Disk over Tape

The Scality post then begins to paint with broad strokes:
"To access data from a disk-based archive, you simply search the index, click on the object or file you want, and presto, it’s yours.  By contrast, pulling a specific file from tape is akin to pulling teeth. First, you physically comb through a pile of cartridges, either at a remote site or by having them trucked to you."
The mistake that Mr. Turner makes here is conflating disk media with archival software.  Tape archives come with archival software just like disk archives do.  For example, HPSS indexes metadata from objects stored on tape in a DB2 database.  There's no "pulling teeth" to "identify a cartridge that seems to contain what you're looking for" and no "manually scroll[ing] through to pinpoint and retrieve the data."

Data management software systems including HPSSIBM's Spectrum ProtectCray's TAS, and SGI's DMF all provide features that can make your tape archive look an awful lot like an object store if you want them.  The logical semantics of storing data on disks versus tape are identical--you put some objects into an archive, and you get some objects out later.  The only difference is the latency of retrieving data on a tape.

That said, these archival software solutions also allow you to use both tape and disk together to ameliorate the latency hit of retrieving warmer data from the archive based on heuristics, data management policies, or manual intervention.  In fact, they provide S3 interfaces too, so you can make your tapes and disk-based object stores all look like one archive--imagine that!

What this all boils down to is that the perceived usability of tape is a function of the software on top of it, not the fact that it's tape and not disk.

7. Disks Enable Magical Business Intelligence

The Scality post tries to drive the last nail in the coffin of tape by conjuring up tales of great insight enabled by disk:
"...mountains of historical data are a treasure trove of hidden gems—patterns and trends of purchasing activity, customer preferences, and user behavior that marketing, sales, and product development can use to create smarter strategies and forecasts."
and
"Using disk-based storage, you can retrieve haystacks of core data on-demand, load it into analytic engines, and emerge with proverbial “needles” of undiscovered business insight."
which is to imply that tape is keeping your company stupid, and migrating to disk will propel you into a world of deep new insights:

Those of us doing statistical analysis on a daily basis keep this xkcd comic taped to our doors and pinned to our cubes.  We hear it all the time.

This is not to say that the technological sentiment expressed by Mr. Turner is wrong; if you have specific analyses you would like to perform over massive quantities of data on a regular basis, hosting that data in offline tape is a poor idea.  But if you plan on storing your large archive on disk because you might want to jump on the machine learning bandwagon someday, realize that you may be trading significant, guaranteed savings on media for a very poorly defined opportunity cost.  This tradeoff may be worth the risk in some early-stage, fast-moving startups, but it is unappetizing in more conservative organizations.

I also have to point out that "[g]one are the days when data was retained only for compliance and auditing" is being quite dramatic and disconnected from the realities of data and lifecycle management.  A few anecdotes:

  • Compliance: The United States Department of Energy and the National Science Foundation both have very specific guidance regarding the retention and management of data generated during federally funded research.  At the same time, extra funding is generally not provided to help support this data management, so eating the total cost of ownership of storing such data on disk over tape can be very difficult to justify when there is no funding to maintain compliance, let alone perform open-ended analytics on such data.
  • Auditing: Keeping second copies of data critical to business continuity is often a basic requirement in demonstrating due diligence.  In data-driven companies and enterprises, it can be difficult to rationalize keeping the second archival copy of such data nearline.  Again, it comes down to figuring out the total cost of ownership.
That said, the sentiment expressed by Mr. Turner is not wrong, and there are a variety of cases where keeping archival data nearline has clear benefits:
  • Cloud providers host user data on disk because they cannot predict when a user may want to look at an e-mail they received in 2008.  While it may cost more in media, power, and cooling to keep all users' e-mails nearline, being able to deliver desktop-like latency to users in a scalable way can drive significantly larger returns.  The technological details driving this use case have been documented in a fantastic whitepaper from Google.
  • Applying realtime analytics to e-commerce is a massive industry that is only enabled by keeping customer data nearline.  Cutting through the buzz and marketing floating surrounding this space, it's pretty darned neat that companies like Amazon, Netflix, and Pandora can actually suggest things to me that I might actually want to buy or consume.  These sorts of analytics could not happen if my purchase history was archived to tape.

Tape's like New Jersey - Not Really That Bad

Mr. Turner turns out to be the Chief Marketing Officer of Scality, a company that relies on disk to sell its product.  The greatest amount of irony, though, comes from the following statement of his:
"...Iron Mountain opines that tape is best. This is hardly a surprising conclusion from a provider of offsite tape archive services. It just happens to be incorrect."
Takeoff from Newark Liberty International Airport--what most people think of New Jersey.  It's really not all like this, I promise.
I suppose I shouldn't have been surprised that a provider of disk-dependent archival storage should conclude that tape is dead and disks are the future, and I shouldn't have risen to the bait.  But, like my home state of New Jersey, tape is a great punching bag for people with a cursory knowledge of it.  Just like how Newark Airport is what shapes most people's opinions of New Jersey, old images of reel-to-reel PDP-11s and audio cassettes make it easy to trash tape as a digital storage medium.  And I just as I will always feel unduly compelled to stick up for my home state, I can't help but fact-check people who want to beat up tape.

The reality is that tape really isn't that terrible, and there are plenty of aspects to it that make it a great storage technology.  Like everything in computing, understanding its strengths (its really low total cost) and weaknesses (its high access latency) is the best way to figure out if the costs of deploying or maintaining a tape-based archive make it a better alternative to disk-based archives.  For very small-scale or large-scale offline data archive, tape can be very cost effective.  As the Scality blog points out though, if you're somewhere in between, or if you need low-latency access to all of your data for analytics or serving user data, disk-based object storage may be a better value overall.

Many of Mr. Turner's points, if boiled down to their objective kernels, are not wrong.  Tape is on a slow decline in terms of revenue, and this may stretch out the cadence of new tape technologies hitting the market.  However there will always be a demand for high-endurance, low-cost, offline archive despite however good object stores become, and I have a difficult time envisioning a way in which tape completely implodes in the next ten years.  It may be the case that, just like how spinning disk is rapidly disappearing from home PCs, tape may become even more of a boutique technology that primarily exists as the invisible backing store for a cloud-based archival solution.  I just don't buy into the doom and gloom, and I'll bet blog posts heralding the doom of tape will keep coming for years to come.

Understanding I/O on the mid-2017 iMac

$
0
0
My wife recently bought me a brand new mid-2017 iMac to replace my ailing, nine-year-old HP desktop.  Back when I got the HP, I was just starting to learn about how computers really worked and really didn't really understand much about how the CPU connected to all of the other ports that came off the motherboard--everything that sat between the SATA ports and the CPU itself was a no-man's land of mystery to me.

Between then and now though, I've somehow gone from being a poor graduate student doing molecular simulation to a supercomputer I/O architect.  Combined with the fact that my new iMac had a bunch of magical new ports that I didn't understand (USB-C ports that can tunnel PCIe, USB 3.1, and Thunderbolt??), I figure I'd sit down and see if I could actually figure out exactly how the I/O subsystem on this latest Kaby Lake iMac was wired up.

I'll start out by saying that the odds were in my favor--over the last decade, the I/O subsystem of modern computers has gotten a lot simpler as more of the critical components (like the memory controllers and PCIe controllers) have moved on-chip.  As CPUs become more tightly integrated, individual CPU cores, system memory, and PCIe peripherals can all talk to each other without having to cross a bunch of proprietary middlemen like in days past.  Having to understand how the front-side bus clock is related to the memory channel frequency all gets swept under the rug that is the on-chip network, and I/O (that is, moving data between system memory and stuff outside of the CPU) is a lot easier.

With all that said, let's cut to the chase.  Here's a block diagram showing exactly how my iMac is plumbed, complete with bridges to external interfaces (like PCIe, SATA, and so on) and the bandwidths connecting them all:




Aside from the AMD Radeon GPU, just about every I/O device and interface hangs off of the Platform Controller Hub (PCH) through a DMI 3.0 connection.  When I first saw this, I was a bit surprised by how little I understood; PCIe makes sense since that is the way almost all modern CPUs (and their memory) talk to the outside world, but I'd never given the PCH a second thought, and I didn't even know what DMI was.

As with any complex system though, the first step towards figuring out how it all works is to break it down into simpler components.  Here's what I figured out.

Understanding the PCH

In the HPC world, all of the performance-critical I/O devices (such as InfiniBand channel adapters, NICs, SSDs, and GPUs) are all directly attached to the PCIe controller on the CPU.  By comparison, the PCH is almost a non-entity in HPC nodes since all they do is provide low-level administration interfaces like a USB and VGA port for crash carts.  It had never occurred to me that desktops, which are usually optimized for universality over performance, would depend so heavily on the rinky-dink PCH.

Taking a closer look at the PCIe devices that talk to the Sunrise Point PCH:



we can see that the PCH chip provides PCIe devices that act as

  • a USB 3.0 controller
  • a SATA controller
  • a HECI controller (which acts as an SMBus controller)
  • a LPC controller (which acts as an ISA controller)
  • a PCI bridge (0000:00:1b) (to which the NVMe drive, not a real PCI device, is attached)
  • a PCIe bridge (0000:00:1c) that breaks out three PCIe root ports
Logically speaking, these PCIe devices are all directly attached to the same PCIe bus (domain #0000, bus #00; abbreviated 0000:00) as the CPU itself (that is, the host bridge device #00, or 0000:00:00).  However, we know that the PCH, by definition, is not integrated directly into the on-chip network of the CPU (that is, the ring that allows each core to maintain cache coherence with its neighbors).  So how can this be?  Shouldn't there be a bridge that connects the CPU's bus (0000:00) to a different bus on the PCH?

Clearly the answer is no, and this is a result of Intel's proprietary DMI interface which connects the CPU's on-chip network to the PCH in a way that is transparent to the operating system.  Exactly how DMI works is still opaque to me, but it acts like an invisible PCIe bridge that glues together physically separate PCIe buses into a single logical bus.  The major limitation to DMI as implemented on Kaby Lake is that it only has the bandwidth to support four lanes of PCIe Gen 3.

Given that DMI can only support the traffic of a 4x PCIe 3.0 device, there is an interesting corollary: the NVMe device, which attaches to the PCH via a 4x PCIe 3.0 link itself, can theoretically saturate the DMI link.  In such a case, all other I/O traffic (such as that coming from SATA-attached hard drive and the gigabit NIC) is either choked out by the NVMe device or competes with it for bandwidth.  In practice, very few NVMe devices can actually saturate a PCIe 3.0 4x link though, so unless you replace the iMac's NVMe device with an Optane SSD, this shouldn't be an issue.

Understanding Alpine Ridge

The other mystery component in the I/O subsystem is the Thunderbolt 3 controller (DSL6540), called Alpine Ridge.  These are curious devices that I still admittedly don't understand fully (they play no role in HPC) because, among other magical properties, they can tunnel PCIe to external devices.  For example, the Thunderbolt to Ethernet adapter widely available for MacBooks are actually fully fledged PCIe NICs, wrapped in a neat white plastic package, that tunnel PCIe signaling over a cable.  In addition, they can somehow deliver this PCIe signaling, DisplayPort, and USB 3.1 through a single self-configuring physical interface.

It turns out that being able to run multiple protocols over a single cable is a feature of the USB-C physical specification, which is a completely separate standard from USB 3.1.  However, the PCIe magic that happens inside Alpine Ridge is a result of an integrated PCIe switch which looks like this:



The Alpine Ridge PCIe switch connects up to the PCH with a single PCIe 3.0 4x and provides four downstream 4x ports for peripherals.  If you read the product literature for Alpine Ridge, it advertises two of these 4x ports for external connectivity; the remaining two 4x ports are internally wired up to two other controllers:

  • an Intel 15d4 USB 3.1 controller.  Since USB 3.1 runs at 10 Gbit/sec, this 15d4 USB controller  should support at least two USB 3.1 ports that can talk to the upstream PCH at full speed
  • an Thunderbolt NHI controller.  According to a developer document from Apple, NHI is the native host interface for Thunderbolt and is therefore the true heart of Alpine Ridge.
The presence of the NHI on the PCIe switch is itself kind of interesting; it's not a peripheral device so much as a bridge that allows non-PCIe peripherals to speak native Thunderbolt and still get to the CPU memory via PCIe.  For example, Alpine Ridge also has a DisplayPort interface, and it's likely that DisplayPort signals enter the PCIe subsystem through this NHI controller.

Although Alpine Ridge delivers some impressive I/O and connectivity options, it has some pretty critical architectural qualities that limit its overall performance in a desktop.  Notably,

  • Apple recently added support for external GPUs that connect to MacBooks through Thunderbolt 3.  While this sounds really awesome in the sense that you could turn a laptop into a gaming computer on demand, note that the best bandwidth you can get between an external GPU and the system memory is about 4 GB/sec, or the performance of a single PCIe 3.0 4x link.  This pales in comparison to the 16 GB/sec bandwidth available to the AMD Radeon which is directly attached to the CPU's PCIe controller in the iMac.
  • Except in the cases where Thunderbolt-attached peripherals are talking to each other via DMA, they appear to all compete with each other for access to the host memory through the single PCIe 4x upstream link.  4 GB/sec is a lot of bandwidth for most peripherals, but this does mean that an external GPU and a USB 3.1 external SSD or a 4K display will be degrading each others' performance.
In addition, Thunderbolt 3 advertises 40 Gbit/sec performance, but PCIe 3.0 4x only provides 32 Gbit/sec.  Thus, it doesn't look like you can actually get 40 Gbit/sec from Thunderbolt all the way to system memory under any conditions; the peak Thunderbolt performance is only available between Thunderbolt peripherals.

Overall Performance Implications

The way I/O in the iMac is connected definitely introduces a lot of performance bottlenecks that would make this a pretty scary building block for a supercomputer.  The fact that the Alpine Ridge's PCIe switch has a 4:1 taper to the PCH, and the PCH then further tapers all of its peripherals to a single 4x link to the CPU, introduces a lot of cases where performance of one component (for example, the NVMe SSD) can depend on what another device (for example, a USB 3.1 peripheral) is doing.  The only component which does not compromise on performance is the Radeon GPU, which has a direct connection to the CPU and its memory; this is how all I/O devices in typical HPC nodes are connected.

With all that being said, the iMac's I/O subsystem is a great design for its intended use.  It effectively trades peak I/O performance for extreme I/O flexibility; whereas a typical HPC node would ensure enough bandwidth to operate an InfiniBand adapter at full speed while simultaneously transferring data to a GPU, it wouldn't support plugging in a USB 3.1 hard drive or a 4K monitor.

Plugging USB 3 hard drives into an HPC node is surprisingly annoying.  I've had to do this for bioinformaticians, and it involves installing a discrete PCIe USB 3 controller alongside high-bandwidth network controllers.

Curiously, as I/O becomes an increasingly prominent bottleneck in HPC though, we are beginning to see very high-performance and exotic I/O devices entering the market.  For example, IBM's BlueLink is able to carry a variety of protocols at extreme speeds directly into the CPU, and NVLink over BlueLink is a key technology enabling scaled-out GPU nodes in the OpenPOWER ecosystem.  Similarly, sophisticated PCIe switches are now proliferating to meet the extreme on-node bandwidth requirements of NVMe storage nodes.

Ultimately though, PCH and Thunderbolt aren't positioned well to become HPC technologies.  If nothing else, I hope this breakdown helps illustrate how performance, flexibility, and cost drive the system designs decisions that make desktops quite different from what you'd see in the datacenter.

Appendix: Deciphering the PCIe Topology

Figuring out everything I needed to write this up involved a little bit of anguish.  For the interested reader, here's exactly how I dissected my iMac to figure out how its I/O subsystem was plumbed.

Foremost, I had to boot my iMac into Linux to get access to dmidecode and lspci since I don't actually know how to get at all the detailed device information from macOS.  From this,

ubuntu@ubuntu:~$ lspci -t -v
-[0000:00]-+-00.0  Intel Corporation Device 591f
           +-01.0-[01]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480]
           |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0
           +-14.0  Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller
           +-16.0  Intel Corporation Sunrise Point-H CSME HECI #1
           +-17.0  Intel Corporation Sunrise Point-H SATA controller [AHCI mode]
           +-1b.0-[02]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961
           +-1c.0-[03]----00.0  Broadcom Limited BCM43602 802.11ac Wireless LAN SoC
           +-1c.1-[04]--+-00.0  Broadcom Limited NetXtreme BCM57766 Gigabit Ethernet PCIe
           |            \-00.1  Broadcom Limited BCM57765/57785 SDXC/MMC Card Reader
...

we see a couple of notable things right away:

  • there's a single PCIe domain, numbered 0000
  • everything branches off of PCIe bus number 00
  • there are a bunch of PCIe bridges hanging off of bus 00 (which connect to bus number 0102, etc)
  • there are a bunch of PCIe devices hanging off both bus 00 and the other buses such as device 0000:00:14 (a USB 3.0 controller) and device 0000:01:00 (the AMD/ATI GPU)
  • at least one device (the GPU) has multiple PCIe functions (0000:01:00.0, a video output, and 0000:01:00.1 an HDMI audio output)

But lspci -t -v actually doesn't list everything that we know about.  For example, we know that there are bridges that connect bus 00 to the other buses, but we need to use lspci -Dv to actually see the information those bridges provides to the OS:

ubuntu@ubuntu:~$ lspci -vD
0000:00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
DeviceName: SATA
Subsystem: Apple Inc. Device 0180
        ...
0000:00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode])
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        ...
Kernel driver in use: pcieport
0000:00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31) (prog-if 30 [XHCI])
Subsystem: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller
        ...
Kernel driver in use: xhci_hcd
0000:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev c0) (prog-if 00 [VGA controller])
Subsystem: Apple Inc. Ellesmere [Radeon RX 470/480]
        ...
Kernel driver in use: amdgpu
0000:01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0
        ...
Kernel driver in use: snd_hda_intel
This tells us more useful information:

  • Device 0000:00:00 is the PCIe host bridge--this is the endpoint that all PCIe devices use to talk to the CPU and, by extension, system memory (since the system memory controller lives on the same on-chip network that the PCIe controller and the CPU cores do)
  • The PCIe bridge connecting bus 00 and bus 01 (0000:00:01) is integrated into the PCIe controller on the CPU.  In addition, the PCI ID for this bridge is the same as the one used on Intel Skylake processors--not surprising, since Kaby Lake is an optimization (not re-architecture) of Skylake.
  • The two PCIe functions on the GPU--0000:01:00.0 and 0000:01:00.1--are indeed a video interface (as evidenced by the amdgpu driver) and an audio interface (snd_hda_intel driver).  Their bus id (01) also indicates that they are directly attached to the Kaby Lake processor's PCIe controller--and therefore enjoy the lowest latency and highest bandwidth available to system memory.
Finally, the Linux kernel's procfs interface provides a very straightforward view of every PCIe device's connectivity by presenting them as symlinks:

ubuntu@ubuntu:/sys/bus/pci/devices$ ls -l
... 0000:00:00.0 -> ../../../devices/pci0000:00/0000:00:00.0
... 0000:00:01.0 -> ../../../devices/pci0000:00/0000:00:01.0
... 0000:00:14.0 -> ../../../devices/pci0000:00/0000:00:14.0
... 0000:00:16.0 -> ../../../devices/pci0000:00/0000:00:16.0
... 0000:00:17.0 -> ../../../devices/pci0000:00/0000:00:17.0
... 0000:00:1b.0 -> ../../../devices/pci0000:00/0000:00:1b.0
... 0000:00:1c.0 -> ../../../devices/pci0000:00/0000:00:1c.0
... 0000:00:1c.1 -> ../../../devices/pci0000:00/0000:00:1c.1
... 0000:00:1c.4 -> ../../../devices/pci0000:00/0000:00:1c.4
... 0000:00:1f.0 -> ../../../devices/pci0000:00/0000:00:1f.0
... 0000:00:1f.2 -> ../../../devices/pci0000:00/0000:00:1f.2
... 0000:00:1f.3 -> ../../../devices/pci0000:00/0000:00:1f.3
... 0000:00:1f.4 -> ../../../devices/pci0000:00/0000:00:1f.4
... 0000:01:00.0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0
... 0000:01:00.1 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.1
... 0000:02:00.0 -> ../../../devices/pci0000:00/0000:00:1b.0/0000:02:00.0
... 0000:03:00.0 -> ../../../devices/pci0000:00/0000:00:1c.0/0000:03:00.0
... 0000:04:00.0 -> ../../../devices/pci0000:00/0000:00:1c.1/0000:04:00.0
... 0000:04:00.1 -> ../../../devices/pci0000:00/0000:00:1c.1/0000:04:00.1
... 0000:05:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0
... 0000:06:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:00.0
... 0000:06:01.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:01.0
... 0000:06:02.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:02.0
... 0000:06:04.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:04.0
... 0000:07:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:00.0/0000:07:00.0
... 0000:08:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:02.0/0000:08:00.0

This topology, combined with the lspci outputs above, reveals that most of the I/O peripherals are either directly provided by or hang off of the Sunrise Point chip.  There is another fan-out of PCIe ports hanging off of the Alpine Ridge chip (0000:00:1b.0 and 0000:00:1c.{0,1,4}), and what's not shown are the Native Thunderbolt (NHI) connections, such as DisplayPort, on the other side of the Alpine Ridge.  Although I haven't looked very hard, I did not find a way to enumerate these Thunderbolt NHI devices.

There remain a few other open mysteries to me as well; for example, lspci -vv reveals the PCIe lane width of most PCIe-attached devices, but it does not obviously display the maximum lane width for each connection.  Furthermore, the USB, HECI, SATA, and LPC bridges hanging off the Sunrise Point do not list a lane width at all, so I still don't know exactly what level of bandwidth is available to these bridges.

If anyone knows more about how to peel back the onion on some of these bridges, or if I'm missing any important I/O connections between the CPU, PCH, or Alpine Ridge that are not enumerated via PCIe, please do let me know!  I'd love to share the knowledge and make this more accurate if possible.

Are FPGAs the answer to HPC's woes?

$
0
0

Executive Summary

Not yet.  I'll demonstrate why no domain scientist would ever want to program in Verilog, then highlight a few promising directions of development that are addressing this fact.

The usual disclaimer also applies: the opinions and conjectures expressed below are mine alone and not those of my employer.  Also I am not a computer scientist, so I probably don't know what I'm talking about.  And even if it seems like I do, remember that I am a storage architect who is wholly unqualified to speak on applications and processor performance.

Premise

We're now in an age where CPU cores aren't getting any faster, and the difficulties of shrinking processes below 10 nm means we can't really pack any more CPU cores on a die.  Where's performance going to come from if we ever want to get to exascale and beyond?

Some vendors are betting on larger and larger vectors--ARM (with its Scalable Vector Extensions) and NEC (with its Aurora coprocessors) are going down this path.  However, algorithms that aren't predominantly dense linear algebra will need very efficient scatter and gather operations that can pack vector registers quickly enough to make doing a single vector operation worthwhile.  For example, gathering eight 64-bit values from different parts of memory to issue an eight-wide (512-bit) vector multiply requires pulling eight different cache lines--that's moving 4096 bits of memory for what amounts to 512 bits of computation.  In order to continue scaling vectors out, CPUs will have to rethink how their vector units interact with memory.  This means either (a) getting a lot more memory bandwidth to support these low flops-per-byte ratios, or (b) pack vectors closer to the memory so that pre-packed vectors can be fetched through the existing memory channels.

Another option to consider are GPUs, which work around the vector packing issue by implementing a massive numbers of registers and giant crossbars to plumb those bytes into arithmetic units.  Even then, though, relying on a crossbar to connect compute and data is difficult to continue scaling; the interconnect industry gave up on this long ago, which is why today's clusters now connect hundreds or thousands of crossbars into larger fat trees, hypercubes, and dragonflies.  GPUs are still using larger and larger crossbars--NVIDIA's V100 GPU is one of the physically largest single-die chips ever made--but there's an economic limit to how large a die can be.

This bleak outlook has begun to drive HPC designers towards thinking about smarter ways to use silicon.  Rather than build a general-purpose processor that can do all multiplication and addition operations at a constant rate, the notion is to bring hardware design closer to the algorithms being implemented.  This isn't a new idea (for example, RIKEN's MDGRAPE and DESRES's Anton are famous examples of purpose-built chips for specific scientific application areas), but this approach historically has been very expensive relative to just using general-purpose processor parts.  Only now are we at a place where special-purpose hardware may be the only way to sustain HPC's performance trajectory.

Given the diversity of applications that run on the modern supercomputer though, expensive and custom chips that only solve one problem aren't very appetizing.  A close compromise are FPGAs though, and there has been a growing buzz surrounding the viability of relying on FPGAs in mainstream HPC workloads.

Many of us non-computer scientists in the HPC business only have a vague and qualitative notion of how FPGAs can realistically be used to carry out computations, though.  Since there is growing excitement around FPGAs for HPC as exascale approaches though, I set out to get my hands dirty and figure out how they might fit in the larger HPC ecosystem.

Crash course in Verilog

Verilog can be very difficult to grasp for people who already know how to program languages like C or Fortran (like me!).  On the one hand, it looks a bit like C in that has variables to which values can be assigned, if/then/else controls, for loops, and so on.  However these similarities are deceptive because Verilog does not execute like C; whereas a C program executes code line by line, one statement after the other, Verilog sort of execute all of the lines at the same time, all the time.

A C program to turn an LED on and off repeatedly might look like:

where the LED is turned on, then the LED is turned off, then we repeat.

In Verilog, you really have to describe what components your program will have and how they are connected. In the most basic way, the code to blink an LED in Verilog would look more like


Whereas C is a procedural language in that you describe a procedure for solving a problem, Verilog is more like a declarative language in that you describe how widgets can be arranged to solve the problem.

This can make tasks that are simple to accomplish in C comparatively awkward in Verilog. Take our LED blinker C code above as an example; if you want to slow down the blinking frequency, you can do something like


Because Verilog is not procedural, there is no simple way to say "wait a second after you turn on the LED before doing something else." Instead, you have to rely on knowing how much time passes between consecutive clock signals (clk incrementing).

For example, the DE10-Nano has a 50 MHz clock generator, so every 1/(50 MHz) (20 nanoseconds), and everything time-based has to be derived from this fundamental clock timer. The following Verilog statement:


indicates that every 20 ns, increment the cnt register (variable) by one. To make the LED wait for one second after the LED is turned on, we need to figure out a way to do nothing for 50,000,000 clock cycles (1 second / 20 nanoseconds). The canonical way to do this is to
  1. create a big register that can store a number up to 50 million
  2. express that this register should be incremented by 1 on every clock cycle
  3. create a logic block that turns on the LED when our register is larger than 50 million
  4. rely on the register eventually overflowing to go back to zero
If we make cnt a 26-bit register, it can count up to 67,108,864 different numbers and our Verilog can look something like


However, we are still left with two problems:
  1. cnt will overflow back to zero once cnt surpasses 226 - 1
  2. We don't yet know how to express how the LED is connected to our FPGA and should be controlled by our circuit
Problem #1 (cnt overflows) means that the LED will stay on for exactly 50,000,000 clock cycles (1 second), but it'll turn off for only 226 - 1 - 50,000,000 cycles (17,108,860 cycles, or 0.34 seconds). Not exactly the one second on, one second off that our C code does.

Problem #2 is solved by understanding the following:

  • our LED is external to the FPGA, so it will be at the end of an output wire
  • the other end of that output wire must be connected to something inside our circuit--a register, another wire, or something else

The conceptually simplest solution to this problem is to create another register (variable), this time only one bit wide, in which our LED state will be stored. We can then change the state of this register in our if (cnt > 5000000) block and wire that register to our external LED:


Note that our assign statement is outside of our always @(posedge clk) block because this assignment--connecting our led output wire to our led_state register--is a persistent declaration, not the assignment of a particular value. We are saying "whatever value is stored in led_state should always be carried to whatever is on the other end of the led wire." Whenever led_state changes, led will simultaneously change as a result.

With this knowledge, we can actually solve Problem #1 now by
  1. only counting up to 50 million and not relying on overflow of cnt to turn the LED on or off, and
  2. overflowing the 1-bit led_state register every 50 million clock cycles
Our Verilog module would look like


and we accomplish the "hello world" of circuit design:


This Verilog is actually still missing a number of additional pieces and makes very inefficient use of the FPGA's hardware resources. However, it shows how awkward it can be to express a simple, four-line procedural program using a hardware description language like Verilog.

So why bother with FPGAs at all?

It should be clear that solving a scientific problem using a procedural language like C is generally more straightforward than with a declarative language like Verilog. That ease of programming is made possible by a ton of hardware logic that isn't always used, though.

Consider our blinking LED example; because the C program is procedural, it takes one CPU thread to walk through the code in our program. Assuming we're using a 64-core computer, that means we can only blink up to 64 LEDs at once. On the other hand, our Verilog module consumes a tiny number of the programmable logic blocks on an FPGA. When compiled for a $100 hobbyist-grade DE10-Nano FPGA system, it uses only 21 of 41,910 programmable blocks, meaning it can control almost 2,000 LEDs concurrently**. A high-end FPGA would easily support tens of thousands.

The CM2 illuminated an LED whenever an operation was in flight. Blinking the LED in Verilog is easy.  Reproducing the CM2 microarchitecture is a different story.  Image credit to Corestore.
Of course, blinking LEDs haven't been relevant to HPC since the days of Connection Machines, but if you were to replace LED-blinking logic with floating point arithmetic units, the same conclusions apply.  In principle, a single FPGA can process a huge number of FLOPS every cycle by giving up its ability to perform many of the tasks that a more general-purpose CPU would be able to do.  And because FPGAs are reprogrammable, they can be quickly configured to have an optimal mix of special-purpose parallel ALUs and general purpose capabilities to suit different application requirements.

However, the fact that the fantastic potential of FPGAs hasn't materialized into widespread adoption is a testament to how difficult it is to bridge the wide chasm between understanding how to solve a physics problem and understanding how to design a microarchitecture.

Where FPGAs fit in HPC today

To date, a few scientific domains have had success in using FPGAs.  For example,

The success of these FPGA products is due in large part to the fact that the end-user scientists don't ever have to directly interact with the FPGAs.  In the case of experimental detectors, FPGAs are sufficiently close to the detector that the "raw" data that is delivered to the researcher has already been processed by the FPGAs.  Convey and Edico products incorporate their FPGAs into an appliance, and the process of offloading certain tasks to the FPGA in proprietary applications that, to the research scientist, look like any other command-line analysis program.

With all this said, the fact remains that these use cases are all on the fringe of HPC.  They present a black-and-white decision to researchers; to benefit from FPGAs, scientists must completely buy into the applications, algorithms, and software stacks.  Seeing as how these FPGA HPC stacks are often closed-source and proprietary, the benefit of being able to see, modify, and innovate on open-source scientific code often outweighs the speedup benefits of the fast-but-rigid FPGA software ecosystem.

Where FPGAs will fit in HPC tomorrow

The way I see it, there are two things that must happen before FPGAs can become a viable general-purpose technology for accelerating HPC:
  1. Users must be able to integrate FPGA acceleration into their existing applications rather than replace their applications wholesale with proprietary FPGA analogues.
  2. It has to be as easy as f90 -fopenacc or nvcc to build an FPGA-accelerated application, and running the resulting accelerated binary has to be as easy as running an unaccelerated binary.
The first steps towards realizing this have already been made; both Xilinx and Intel/Altera now offer OpenCL runtime environments that allow scientific applications to offload computational kernels to the FPGA.  The Xilinx environment operates much like an OpenCL accelerator, where specific kernels are compiled for the FPGA and loaded as application-specific logic; the Altera environment installs a special OpenCL runtime environment on the FPGA.  However, there are a couple of challenges:
  • OpenCL tends to be very messy to code in compared to simpler APIs such as OpenACC, OpenMP, CUDA, or HIP.  As a result, not many HPC application developers are investing in OpenCL anymore.
  • Compiling an application for OpenCL on an FPGA still requires going through the entire Xilinx or Altera toolchain.  At present, this is not as simple as f90 -fopenacc or nvcc, and the process of compiling code that targets an FPGA can take orders of magnitude longer than it would for a CPU due to the NP-hard nature of placing and routing across all the programmable blocks.
  • The FPGA OpenCL stacks are not as polished and scientist-friendly right now; performance analysis and debugging generally still has to be done at the circuit level, which is untenable for domain scientists.
Fortunately, these issues are under very active development, and the story surrounding FPGAs for HPC application improves on a month by month basis.  We're still years from FPGAs becoming a viable option for accelerating scientific applications in a general sense, but when that day comes, I predict that programming in Verilog for FPGAs will seem as exotic as programming in assembly is for CPUs.

Rather, applications will likely rely on large collections of pre-compiled FPGA IP blocks (often called FPGA overlays) that map to common compute kernels.  It will then be the responsibility of compilers to identify places in the application source code where these logic blocks should be used to offload certain loops.  Since it's unlikely that a magic compiler will be able to identify these loops on their own, users will still have to rely on OpenMP, OpenACC, or some other API to provide hints at compile time.  Common high-level functions, such as those provided by LAPACK, will probably also be provided by FPGA vendors as pre-compiled overlays that are hand-tuned.

Concluding Thoughts

We're still years away from FPGAs being a viable option for mainstream HPC, and as such, I don't anticipate them as being the key technology that will underpin the world's first exascale systems.  Until the FPGA software ecosystem and toolchain mature to a point where domain scientists never have to look at a line of Verilog, FPGAs will remain an accelerator technology at the fringes of HPC.

However, there is definitely a path for FPGAs to become mainstream, and forward progress is being made.  Today's clunky OpenCL implementations are already being followed up by research into providing OpenMP-based FPGA acceleration, and proofs of concept demonstrating OpenACC-based FPGA acceleration have shown promising levels of performance portability.  On the hardware side, FPGAs are also approaching first-class citizenship with Intel planning to ship Xeons with integrated FPGAs in 2H2018 and OpenPOWER beginning to ship Xilinx FPGAs with OpenCAPI-based coherence links for POWER9.

The momentum is growing, and the growing urgency surrounding post-Moore computing technology is driving investments and demand from both public and private sectors.  FPGAs won't be the end-all solution that gets us to exascale, nor will it be the silver bullet that gets us beyond Moore's Law computing, but they will definitely play an increasingly important role in HPC over the next five to ten years.

If you've gotten this far and are interested in more information, I strongly encourage you to check out FPGAs for Supercomputing: The Why and How, presented by Hal Finkel, Kazutomo Yoshii, and Franck Cappello at ASCAC.  It provides more insight into the application motifs that FPGAs can accelerate, and a deeper architectural treatment of FPGAs as understood by real computer scientists.

** This is not really true.  Such a design would be limited by the number of physical pins coming out of the FPGA; in reality, output pins would have to be multiplexed, and additional logic to drive this multiplexing would take up FPGA real estate.  But you get the point.
SaveSave
SaveSaveSaveSave

A week in the life of an SC attendee

$
0
0
Last week was the annual Supercomputing conference, held this year in Dallas, and it was as busy as they always are.  Every year I take plenty of photos and post plenty of tweets throughout the week, but this year I thought it might be fun to share some of those photos (and the related things I learned) now that the dust has settled.  Since some people might also be interested in how someone might approach the conference from a technical and philosophical perspective, I figured I'd write a more general piece documenting my entire SC experience this year.

This post wound up being a massive, meandering, chronological documentary of a week in my life that includes both technical and non-technical commentary.  For anyone who is only interested in the technical insights I gained during SC, check out the items prefixed with (tech) in this table of contents:

Everything that's not labeled (tech) is part diary and part career development perspective.  Hopefully someone will find something in here that's of some value.

Finally, disclosures:
  • I omitted some names in the interests of respecting the privacy of the folks who took the time to talk to me one-on-one.  If you're part of this story and don't mind having your name out there, I'd be happy to include it.
  • Everything I paraphrase here is public information or conjecture on my part.  Nothing in this post is either confidential or sensitive.  That said, check your references before citing anything here.  I don't know what I'm talking about.
  • Everything here is my personal opinion and does not necessarily reflect the viewpoint of my employer or its funding agency.  I attended the conference as a part the regular course of business in which I am employed.  However I took all photos for personal purposes, and the entirety of this post was written on my own personal time.

Before the conference

Everyone's SC experience is different because it draws such a diverse range of professionals.  There are plenty of activities for everyone ranging from students and early-career staff to senior management and leadership, and people on different career tracks (e.g., facilities staff, computer science researchers, program managers, product sales) are likely to be drawn to very different parts of the conference agenda.  My priorities during the week of SC are definitely shaped by where I am in my career, so when filling out my calendar a few weeks ahead of the conference, I considered the following:

My job is half research and half facilities staff. 50% of my time is funded by grant money to do applied research in characterizing parallel I/O systems.  The other half of my time is spent staying current on emerging technologies in computing and storage.  These two responsibilities mean that my SC is usually a mix of attending technical program sessions (to see what my peers in research are doing and see what research ideas might turn up in future technologies) and engaging with vendors.

I work in advanced technologies.  This means I am generally not in the trenches directly feeling the pains of operating HPCs today; instead, my job is to identify technologies that will cause less problems tomorrow.  This also means that I don't have purchasing authority, and I am less likely to be involved with anything that's going to hit the floor in the next year.  As such, I generally don't do vendor sales meetings or briefings at SC because they are generally focused on nearer-term products and sales.

I did not get to where I am by myself.  I first heard about SC in 2010 when I was a graduate student, and it sounded almost infinitely more exciting than the materials science conferences I was attending.  I had no experience in HPC at the time, but it made me realize what I really wanted to pursue as a career.  I relied heavily on the good will of the online HPC community to learn enough to get my first HPC job at SDSC, and after that, the faith of a great many more to get me to where I am now.  SC is often the only time I get to see people who have helped me out in my early career, and I always make time connect with them.

The net result of these goals was a pretty full schedule this year:

My SC'18 schedule.  Note that the time zone is PST, or two hours behind Dallas time.


I mark everything that I must attend (usually because I'm a speaker) in red to know the immovable obligations. Blue items are things I will attend unless an emergency comes up, and grey things are events I should attend because they sound interesting.

White space is very important to me too; between 10am and 6pm, white spaces are when I can walk the expo floor.  A lot of people write off the expo as a waste of time, but I actually feel that it's one of the most valuable parts of SC.  Since my job is to understand emerging technology (and the market trends that drive them), accosting a pre-sales engineer or product manager in a strategically important technology provider can yield an invaluable peek into the markets they're serving.  White space in the evenings are equally important for engagements of opportunity or working on slides that have to be presented the next day.

Saturday, November 10

I always fly to SC on the Saturday before the conference starts.  I have historically opted to do workshops on both Sunday and Monday, as I really enjoy attending both PMBS and PDSW-DISCS.  I bring a suitcase with has extra room for conference swag, and doing so this year was critically important because I opted to bring along a pair of cowboy boots that I knew I would not want to wear on the flight home.

My brown kicks.  Also Harriet the cat.

On just about every work flight I'm on, I've got PowerPoint slides to review; this trip was no different, and I spent the 3.5-hour flight time reviewing the slides I had to present the next day. Once in Dallas and at my hotel, I carried out my usual work-travel night-of-arrival ritual: order the specialty pizza from a local pizza joint, text home saying I arrived safely, and iron my clothes while watching Forensic Files.

Sunday, November 11

This year I had the honor of presenting one part of the famed Parallel I/O in Practice tutorial at SC along with Rob Ross, Brent Welch, and Rob Latham.  This tutorial has been running for over fifteen years now, and at some point over those years, it picked up the curious ritual of being kicked off with some juggling:

Brent leading up to the tutorial start time with some juggling.  He brought the pins with him.

The tutorial itself is really comprehensive and includes everything from device-level performance behavior to parallel file systems architecture and I/O middleware.  Even though I can proudly say that I knew 95% of the material being presented throughout the day (as I probably should since I was a presenter!), I found this particular slide that Rob Latham presented particularly insightful:

The ease and portability of using I/O middleware comes without sacrificing performance!  Sorry for the odd angle; this is the screen as us presenters were able to view it.

It makes the case that there is no significant performance penalty for using higher-level I/O libraries (like PnetCDF or parallel HDF5) despite how much easier they are to use than raw MPI-IO.  One of the biggest take-home messages of the entire tutorial is to use I/O middleware wherever possible; doing so means that understanding parallel file system architecture isn't prerequisite to getting good I/O performance.

Monday, November 12

Monday was the official first day of SC.  Workshops and tutorials went on throughout the day, and the opening keynote and exhibition hall opening gala started in the evening.

PDSW-DISCS 2018

The 3rd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS) was on Monday, and I had the honor of being asked to serve as its Publicity Chair this year.

The PDSW-DISCS full-day workshop agenda

It's a really great workshop for people working in I/O, storage, and data and always draws a large crowd:


For researchers, it's a great venue for short papers that IEEE or ACM publishes, and it also has a really nice Work-in-Progress track where a page-long abstract gives you a seven minute spot to pitch your work.  For attendees, it's always chock full of good talks that range from pure research to applied development.

This year's keynote speaker was Rangan Sukumar, Cray's analytics guru.  His talk was interesting in that it approached the oft-mentioned convergence between HPC and AI (which has become an over-used trope by itself) from the perspective of a system architect (which is where the rubber meets the road):


As many great keynote speakers are, Rangan used hyperbole at times to contrast HPC and "Big Data" workloads, and this stimulated some discussion online.  Although the slides alone tell only part of the story, you can download them from the PDSW-DISCS'18 website.

Later in the morning, Margaret Lawson (University of Illinois, Sandia Labs) presented a follow-on to the EMPRESS metadata system she presented last year:


Last year, EMPRESS seemed a little too researchy for me (as a facilities person) to sink my teeth into.  This year though, the picture seems a lot more complete and I quite like the architectural framework.  Although EMPRESS may not ever be a household name, the concept of separating data streams and metadata streams underneath some sort of I/O middleware is really solid.  I think that storing data and metadata in different, architecturally distinct storage systems that map to the unique access patterns of data and metadata is ultimately the right way to approach large-scale data and metadata management in HPC, and I expect to see this design pattern proliferate as scientific data analysis becomes a bigger part of large-scale HPC workloads.

In the afternoon, researchers from OSU offered a rare peak into Alibaba through a high-level analysis of SSD failure data provided by the Chinese hyperscaler:



The most alarming finding to me was that 20% of SSD failures were caused by humans yanking the wrong SSD.  This immediately made me wonder who Alibaba is hiring to do routine operational support at their data centers; if people are causing a significant fraction of storage faults, either they aren't hiring with the same standards as their US counterparts, or their data centers are a mess.  The speaker's proposed remedy was to use a different SSD form factor for each logical use case for SSDs so that operators could visually identify an SSD reserved for metadata versus one reserved for data.  I personally think a label maker, a barcode scanner, and a decent salary is an easier, standards-based solution.

Other highlights included
  • Characterizing Deep-Learning I/O Workloads in TensorFlow, presented by Stefano Markidis of KTH.  The first time I've seen an I/O-centric evaluation of how deep learning workflows will affect storage requirements of future systems.  I learned a lot.
  • Toward Understanding I/O Behavior in HPC Workflows, presented by Jakob Lüttgau of DKRZ/ANL.  Rather than analyze the I/O pattern of a single MPI job, this paper began examining the I/O patterns of related jobs that all work towards a single scientific objective.  Again, one of the first research papers I've seen that takes a critical look at end-to-end workflows from an I/O perspective.
  • Methodology for the Rapid Development of Scalable HPC Data Services, presented by Matthieu Dorier of ANL.  I think this paper is intended to be the canonical reference for the Mochi project, which I was glad to finally see.  The idea of enabling quickly composable, purpose-built I/O services that are optimized for next-generation media and interconnects is a brilliant one, and I am a huge believer that this approach will be what demonstrates the earliest scientific successes that rely on storage-class memory at scale.

There were a number of really promising ideas presented at the WIP sessions as well, and recapping the entirety of the workshop is a blog post in and of itself.  Fortunately, all the papers and slides are openly available on the PDSW-DISCS website.

SC Opening Keynote and Gala

I've actually stopped going to the SC keynotes over the last year since they're increasingly focused on the societal impacts enabled by HPC rather than HPC itself.  While I'm definitely not knocking that theme--it's a great way to inspire early-career individuals, big-picture program management types, and disenchanted technical folks in the trenches--it's just not why I attend SC.  Instead, I make use of my exhibitor badge and head into the expo floor before it opens to the public; this is the only time during the conference where I seem to be able to reliably find the people I want to meet at their booths.

This year I visited a few small businesses with whom I've fostered good will over the last few years to say hello, then dropped in on the SDSC booth to catch up with the latest news from my former coworkers.  They also happen to have free beer on the opening night.

Once the expo floor opens to the public following the opening keynote, booth activity goes from zero to eleven really quickly.  Every booth has a big splash during the gala which makes it hard to choose just one, but my decision this year was made easier by Cray choosing to unveil its new exascale HPC platform, Shasta, and celebrate its first sale of a Shasta system to NERSC.

Cray CEO Pete Ungaro at the Shasta unveiling ceremony

This new system, named Perlmutter, will be delivered in 2020 and has a bunch of really slick new technologies incorporated into it.

After Cray CEO Pete Ungaro unveiled the prototype Shasta blades, there was a celebratory toast and both NERSC and Cray staff donned their "ASK ME ABOUT SAUL" pins:

NERSC and Cray staff got these VIP pins to promote NERSC's next system, named after astrophysicist, Nobel laureate, and Berkeley Lab scientist Saul Perlmutter.

I stuck around to shake hands with my colleagues at Cray (including the CEO himself!  Haven't washed my hand since) and catch up with some of my counterparts in storage R&D there.

The Beowulf Bash

The gala shut down at 9 PM, at which time I headed over to the Beowulf Bash to try to find other some colleagues who said they would be there.  I generally don't prioritize parties at SC for a couple reasons:
  1. Shouting over music all night is a great way to burn out one's voice.  This is not good when I have to present something the next day.
  2. The crowds and lines often undercut my enjoyment of catching up with old colleagues (and meeting new ones).
  3. I almost always have slides that need to be finished by the end of the night.
I make an exception for the Bash because I personally value many of the people behind organizing and sponsoring it, and it captures the scrappier side of the HPC community which helped me get my foot in the door of the industry.  This year I specifically went to catch up with my colleagues at The Next Platform; Nicole and Tim are uncommonly insightful and talented writers and editors, and they always have wacky anecdotes to share about some of the more public figures in our industry.

More generally and self-servingly though, maintaining a good relationship with members of the HPC trade press at large has tremendous value over time regardless of your affiliation or job title.  Behind every interesting HPC news article is an editor with incomparable access to a broad network of people in the industry.  Despite this though, they still are subject to the same haters as anyone else who puts something out in the spotlight, so I have to imagine that putting in a kind word in-person will is always worth it.

At around midnight, only the die-hards were still around.

Late night Beowulf Bash at Eddie Deen's Ranch.

Regrettably, I barely had any time to catch up with my colleagues from the FreeNode HPC community at the Bash (or at all).  Maybe at ISC.

After getting back to the hotel, I realized I hadn't eaten anything since lunch.  I also learned that absolutely nothing that delivers food in the downtown Dallas area is open after midnight.  After waiting an hour for a food delivery that wound up going to a restaurant that wasn't even open, I had to settle for a hearty dinner of Hot Pockets from the hotel lobby.

I hadn't eaten a Hot Pocket since graduate school.  Still taste the same.

Fortunately my Tuesday was relatively light on hard obligations.

Tuesday, November 13

Tuesday was the first day in which the SC technical program and expo were both in full swing.  I split the day between paper talks, meetings, and the expo floor.

Technical Program, Part 1 - Data and Storage

My Tuesday morning began at 10:30 AM with the Data and Storage paper presentation session in the technical program.  Of note, the first two papers presented were about cloud-centric storage paradigms, and only the third one was clearly focused on scientific HPC workloads.

  • SP-Cache: Load-Balanced, Redundancy-Free Cluster Caching with Selective Partition by Yu et al was a paper squarely aimed at reducing tail latency of reads.  Very important if you want to load an old GMail message without waiting more than a few seconds for it to load.  Less useful for most scientific HPC workloads.
  • BESPOKV: Application Tailored Scale-Out Key-Value Stores by Anwar et al was a paper presenting a framework that is uncannily similar to the Mochi paper presented at PDSW on the day before.  The premise was to allow people to compose their own Cassandra-like KV store with specific consistency and durability balance without having to reinvent the basic building blocks.
  • Scaling Embedded In Situ Indexing with DeltaFS by Zheng et al was the talk I really wanted to hear but I had to miss on account of a conflicting meeting.  The DeltaFS work being done by CMU and LANL is a really innovative way to deal with the scalability challenges of parallel file system metadata, and I think it's going to ultimately be where many of the nascent software-defined storage technologies aimed at HPC will converge.
Unfortunately I had to cut out of the session early to meet with a vendor partner at a nearby hotel.

Interlude of Meetings

The first of my two vendor meetings at this year's SC was less a sales call and more about continuing a long-running discussion about technology futures in the five-to-ten year timeframe.  No sane vendor will commit to any roadmap that far out, especially given the uncertainty surrounding post-Moore's Law technologies, but they are receptive to input from customers who are formulating their own strategic directions for the same time period.  Maintaining these sorts of ongoing conversations is a major part of what falls under my job title in "advanced technologies."

Unfortunately that vendor meeting overlapped with the Lustre BOF, but other staff from my institution were able to attend and ensure that our interests were represented.  I was also able to attend the Lustre Lunch that followed the BOF which was very fruitful; in addition to simply being present to remind the Lustre community that I (and the institution I represent) am a part of it, I happened to connect in-person with someone I've known for a few years via Twitter and make a valuable connection.  Unfortunately I had to leave the Lustre Lunch early to make another meeting, unrelated to SC, that allowed a geographically distributed committee to meet face-to-face.

After that committee meeting, I seized the free hour I had to visit the show room floor.

Expo Floor, Part 1

The first photo-worthy tech I saw was the Shasta blade at the Cray booth.  Because the booth was mobbed with people during the previous night's gala, this was actually my first time seeing Shasta hardware up close.  Here's the compute blade:

Part of a Cray Shasta compute blade up-close

Unlike the Cray XC blade of today's systems which uses a combination of forced-air convection and heat exchangers to enable liquid cooling, these Shasta blades have direct liquid cooling which is rapidly becoming a de facto minimum requirement for an exascale-capable rack and node design.  I had some questions, so I struck up a conversation with a Cray employee at the booth and learned some neat things about the Shasta packaging.

For the sake of clarity, here is a hand-drawn, annotated version of the same photo:

Part of a Cray Shasta compute blade up-close with my annotations

What stood out to me immediately was the interesting way in which the DIMMs were direct-liquid cooled.  Unlike IBM's attempt at this with the POWER 775 system (the PERCS system of Blue Waters infamy) where cold plates were attached to every DIMM, Cray has opted to use what looks like a heat-conductive foam that wraps copper cooling lines.  To service the DIMMs, the entire copper cooling complex that runs between the two rows of two DIMMs unfastens and lifts up.  There's enough slack in the liquid cooling lines (highlighted in purple) so that DIMMs (and presumably every other field-replaceable part in the blade) can be serviced without draining the coolant from the blade.

The NIC is also pretty interesting; it is a commercial high-end data center Ethernet NIC that's manufactured in a custom form factor to fit this blade.  It looks like a second CPU is housed underneath the NIC so that it may be the case that the NIC and one of the CPUs shares a common cooling block.  The NIC is also positioned perpendicular to the long edge of the blade, meaning that there are probably some pretty good cable runs going from the front-most NIC all the way to the rear of the blade.  Finally, because the NIC is on a discrete mezzanine card, the networking technology is no longer soldered to the compute as it is with Aries on today's XC.

The network switch (which I did not photograph, but others did) is another blade that slots into the rear of the Shasta cabinet and mates perpendicularly with a row of compute blades such that a single switch blade can service a fully populated compute chassis.  The engineer with whom I spoke said that these Shasta cabinets have no actual midplane; the compute blades connect directly to the switch blades through a bunch of holes cut out of the sheet metal that separates the front of the cabinet from the rear.  Without a midplane there is presumably one less single point of failure; at the same time though, it wasn't clear to me how out-of-band management works without a centralized controller somewhere in the chassis.

At this point I should point out that all of the above information is what I learned by talking to a Cray booth employee at SC without any special privilege; although I'm sure that more details are available under non-disclosure, I frankly don't remember any of it because I don't work on the compute side of the system.

My next big stop on the show room floor was at the Fujitsu booth, where they had their post-K prototype hardware on display.  Of particular note was their A64FX engineering sample:



If you look very carefully, you can see the four stacks of high-bandwidth memory (HBM) on-die along with the ARM, which is fantastically historic in that it's the first general-purpose CPU (of which I am aware) that has integrated HBM2.  What's not present is any indication of how the on-chip Tofu NIC is broken out; I guess I was expecting something like Intel's -F series KNLs with on-package OmniPath.

A sample node of the post-K system was also on display:


Seeing as how both this post-K system and Cray Shasta are exascale-capable system architectures, it's interesting to compare and contrast them.  Both have direct liquid cooling, but the post-K compute blade does not appear to have any field-replaceable units.  Instead, the entire board seems to be a single FRU, so CPUs must be serviced in pairs.  I think the A64FX lacks any cache coherence bus, meaning that two CPUs correspond to two nodes per FRU.

That all said, the post-K design does not appear to have any DDR DRAM, and the NIC is integrated directly into the CPU.  With those two components out of the picture, the rate of a single component failure is probably a lot lower in post-K than it would be in Shasta.  Hopefully the post-K HBM has ECC though!

In chatting with a Fujitsu engineer about the post-K node architecture at their booth, I also met a Fujitsu engineer who just happened to be developing LLIO, the post-K system's burst buffer service:

LLIO burst buffer slide shown at the Fujitsu booth

It sounds a lot like DataWarp in terms of features, and given that Fujitsu is also developing a new Lustre-based file system (FEFS 2.0?) for post-K, we might see a tighter integration between the LLIO burst buffer layer and the FEFS back-end disk storage.  This is definitely a technology that wasn't on my radar before SC but is definitely worth keeping an eye on as 2021 approaches.

As I was racing between a few other booths, I also happened upon my boss (and NERSC-9 chief architect) presenting the Perlmutter system architecture at the NVIDIA booth:

NERSC's Nick Wright, chief architect of the Perlmutter system, describing its architecture at the NVIDIA booth


The talk drew a crowd--I'm glad to see people as jazzed about the new system as I am.

Analyzing Parallel I/O BOF

The Analyzing Parallel I/O BOF is a must-attend event for anyone in the parallel I/O business, and this year's BOF was especially good.  Andreas Dilger (of Lustre fame; now CTO of Whamcloud) gave a brief but insightful retrospective on understanding I/O performance:


Unfortunately I did not take a picture of Andreas' second slide (available on the Analyzing Parallel I/O BOF's website) which is a "what is needed?" slide which largely revolves around better integration between storage system software (like Lustre) and user applications.  I/O middleware seems to be at the center of most of the bullets that called for increased development which bodes well for scientific application developers who attended the Parallel I/O in Practice tutorial on Sunday--recall that this was my key takeaway.  It's good to know that the lead of Lustre development agrees with this vision of the future, and I hope Whamcloud moves Lustre in this direction so users and middleware developers can meet the storage system software somewhere in the middle.

The BOF took a darker turn after this, starting with a presentation from Si Liu of TACC about the Optimal Overloaded IO Protection System, or OOOPS.  It's a library that wraps the standard POSIX I/O calls:

OOOPS operates by hijacking standard I/O calls and lagging them.


But in addition to passively monitoring how an application performs I/O, it purposely injects latency to throttle the rate at which I/O operations get issued by an application.  That is, it purposely slows down I/O from clients to reduce server-side load and, by extension, the effects of a single bad actor on the I/O performance of all the other users.

Ideologically, I have a lot of problems with an HPC facility inserting itself into the user's workflow and reducing the efficiency with which he or she can accomplish their science relative to the peak capability of the HPC resource.  If a storage system allows a single user to accidentally deny service to other users in pursuit of peak performance, that is a problem with the storage system and it should be addressed at the system level.  And as Andreas pointed out in the BOF, tools exist to allow storage systems to accomplish fair sharing, which is distinctly different from explicitly penalizing users.  Granted, TACC is also the facility where one of its staff went on record as saying that the R language should not be used by anyone since it is a waste of energy.  Perhaps they have an institutionally different relationship with their user community.

Fortunately, anything that relies on LD_PRELOAD can be circumvented by users, so OOOPS is unlikely to be used to enforce any kind of resource usage policy as it was pitched during the BOF.  I do see a lot of value in using it to fence data analysis workflows that may hit a pathological condition as a result of their inputs, and being able to trigger changes in application behavior by tracking I/O rates is a technique that could be useful in auto-tuning I/O middleware.

Rosemary Francis, CEO of Ellexus, also spoke at the BOF and spoke for the need to make I/O performance analysis a little more accessible for the end users.  I was quite delighted by the visualizations she presented (presumably from her company's Breeze product) which used both color and human-readable "bad" I/O patterns to create a pie graph that quickly shows how much time an application spent doing I/O in various good, bad, and neutral ways.  Darshan, the tried-and-true open source I/O profiling library, operates at a slightly lower level and assumes a slightly higher level of user sophistication by comparison.

The discussion half of the BOF was packed with engagement from the audience--so much so that I didn't find any moments of silence to seize the opportunity to stump for my own view of the world.  The combination of OOOPS and Rosemary's I/O war stories did steer the discussion towards ways to punish bad users though.  I can appreciate HPC operators' frustration in novice users causing system-wide problems, but I don't think shaming users who do bad I/O is a great solution.  Rather, something between OOOPS' automatic identification of bad I/O at runtime and Ellexus' user-centric reporting and feedback, combined with storage systems capable of enforcing QOS, is where we need to go.

The Cray Celebration

I wrote earlier that I normally don't do the SC vendor party circuit, but the Cray party this year was another exception for two reasons: (1) we had just announced Perlmutter along with Cray's Shasta unveiling which is worth celebrating, and (2) there were specific Cray staff with whom I wanted to confer sometime during the week.  So after the Parallel I/O BOF, I headed over to the event venue:


The event was quite nice in that it was not held at a loud bar (which made conversation much easier), it had plenty of food (no need for 2 AM Hot Pockets), and the format was conducive to moving around and meeting a lot of different people.  The event was awash with representatives from all the major Cray customers including the DOE labs, the big oil & gas companies, and the regional leadership computing centers in EMEA including CSCS and KAUST, as well as alumni of all those employers and Cray itself.  I've only worked at a Cray customer site for three years now, but I couldn't walk ten feet without running into someone I knew; in that sense, it felt a little like an event at the annual Cray User Group meeting but with a broader range of attendees.

I don't know what this event would've been like if I was a student or otherwise didn't already know many of the regular faces within the Cray user community and instead had to start conversations cold.  That said, I was busy the entire evening getting to know the people behind all the conference calls I'm on; I find that getting to know my industry counterparts as people rather than just vendor reps really pays dividends when surprises happen and conflicts need to be resolved.  Events like this at SC are invaluable for building and maintaining these sorts of relationships.

Wednesday, November 14

My Wednesday began bright and early with a quick run-around of the expo floor to figure out who I needed to visit before the end of the week.


The expo floor was awkwardly laid out this year, so I really needed to do this to make sure I didn't spin my tires trying to find certain booths once the crowd showed up.  Incidentally, I did witness a sales person violate the unwritten rule of keeping everything friendly until the expo floor opened to the public--a sales rep selling "the world's fastest storage system" tried to stir up cold sales leads at my employer's booth at 8 AM while we were all still drinking our coffee and catching up on e-mail.  If you do this, shame on you!  Respect the exhibitor access and don't put your game face on until the public is allowed in.

SC Student Career Fair and Booth Talk

My first meeting was a chat over coffee with VAST Data, a storage technology company that has some really innovative and exciting ideas in the pipeline, to keep up to date with the latest news as they approach public launch.

My second obligation was volunteering at my employer's booth at the SC Career Fair.  I generally enjoy booth duty and talking to students, and this year I was doubly motivated by my desire to fill some career and student job openings related to my responsibilities.  A diverse cross section of students dropped by our booth looking for both summer internships and full-time jobs; many seemed very well rehearsed in their cold pitch, while some others were a little more casual or cautious.  Although I'm not particularly qualified to give career advice, I will say that knowing how to sell yourself cold can be a valuable skill in your early career.  If you are seeking employment, be prepared to respond to a request to "tell me about yourself" in a way that makes you stand out.

After the Career Fair, I wound up hunkering down at the SDSC booth to have lunch with my former coworkers and review the slides I volunteered to present at the adjacent DDN booth.

At 2 PM I took the stage (booth?) and one of my colleagues was not only kind enough to sit in on this booth talk, but also share this photo he took right before I started:

Beginning of my talk at the DDN booth.  Photo credit goes to Suhaib Khan via Twitter.

I continue to be humbled that anyone would go out of their way to come hear what I have to say, especially when my talk is as unvetted as booth talks tend to be.  Talking at booths rarely goes well for me; the audio is always a wildcard, the audience is often unwitting, and auditory and visual distractions are literally everywhere.  The DDN booth was my sole booth talk of this year and it went about as well as I would have expected.  On the up side, quite a few attendees seemed genuinely interested to hear what I had to say about the variety of ways one can deploy flash in an HPC system.  Unfortunately, I ran a few minutes long and got derailed by external distractions several times during the presentation though.  Flubbing presentations happens, and none of the audience members seemed to mind.

Shortly after the booth talk, I had to find a quiet spot to jump on a telecon.  This was no easy task; since cell phones killed the public phone booth, there are very few places to take a call on the expo floor.

Expo Floor, Part 2

The afternoon afforded me two more hours to race around the expo floor.  Despite my planning earlier in the morning, I wound up spinning my tires looking for a few key vendors who simply didn't show up to SC this year, including

  • Samsung and SK Hynix, two of the top three DRAM vendors and the sole manufacturers of HBM2
  • Seagate, one of two hard disk drive manufacturers
  • Broadcom/Avago, the company manufacturing most of the serdes used in the upcoming 200G and 400G network devices
  • Juniper, one of the major players in the 400 GbE space
  • AdvancedHPC, one of the few US integrators selling BeeGFS

I'm not really sure why so many vendors didn't show up this year, but it made getting a holistic view of the storage and networking technologies markets impossible.  That said, I still saw a few noteworthy things.

One of the big open questions in high-performance storage revolves around the battle between the NF1 (formerly NGSFF, promoted by Samsung) and EDSFF (promoted by Intel) form factors for NVMe.  It's clear that these long-and-skinny NVMe designs are going to have to replace the thermally inefficient 2.5" U.2 and unserviceable HHHL PCIe form factors, but the dust is far from being settled.  On the one hand, Samsung leads flash storage sales worldwide, but their NF1 form factor caps the power consumption (and therefore performance) of its devices to levels that are squarely aimed at cheaper data center flash.  On the other, the EDSFF form factor being pushed by Intel has a short version (competing directly with NF1) and a longer version that allows higher power.

The Supermicro booth had actual EDSFF drives on display, and this was the first time I could actually see one up-close:

A long-type EDSFF NVMe drive at the Supermicro booth.  The aluminum casing is actually required to meet the thermals.


What I didn't realize is that the higher thermal specification enabled by the long-version EDSFF drives requires that the entire SSD circuit board be enclosed in the aluminum casing shown to enable better heat dissipation.  This has the nasty side effect of reducing density; while a standard 19" 1U chassis can fit up to 36 NF1 SSDs, the aluminum casing on long EDSFFs reduces the equivalent density to 32 SSDs.  Although long EDSFF drives can compensate for this by packing more NAND dies on the physically longer EDSFF board, supporting these longer SSDs requires more engineering on the chassis design to fit the same amount of compute into a smaller area.

Similarly but differently, the Lenovo booth was showcasing their D3284 JBOD which packs 84x 3.5" HDDs into a double-decker 5U chassis.  I had naively assumed that all of these super-dense 84-drive enclosures were top-loading such that each drive mates to a backplane that is mounted to the floor of the chassis, but it turns out that's not the case:

Lenovo's 5U84 JBOD

Instead, each 3.5" drive goes into its 2.5U shelf on its side, and each drive attaches to a carrier that has to be slid slightly toward the front of the JBOD to release the drive, and then slide towards the back of the JBOD to secure it.  This seems a little harder to service than a simple top-load JBOD, but I assume there are thermal efficiencies to be gained by this layout.

The Western Digital booth had a pretty broad portfolio of data center products on display.  Their newest gadget seems to be a planar NAND-based U.2 device that can present itself as DRAM through a custom hypervisor.  This sounds like a direct competitor to Intel's Memory Drive offering which uses ScaleMP's hypervisor to expose flash as DRAM to a guest VM.  The combination of exposing flash as very slow memory and relying on software virtualization to do this lends this to being a technology not really meant for HPC, and the engineer with whom I spoke confirmed as much.  Virtualized big-and-slow memory is much more appealing to in-memory databases such as SAP HANA.

Perhaps more interestingly was the lack of any mention of Western Digital's investment in storage-class memory and microwave-assisted magnetic recording (MAMR) disk drives.  When I prodded about the state of MAMR, I was assured that the technology will work because there is no future of hard drives without some form of energy-assisted magnetic recording.  However, product announcements are still 18-24 months away, and the capacity for these drives will enter the market at the rather underwhelming range of ~20 TB.  Conveniently, this matches Seagate's recent cry of wolf that they will launch HAMR drives in 2020 at a 20 TB capacity point.  Western Digital also made no mention of multi-actuator drives, and asking about it only got me a sly grin; this suggests that Western Digital is either playing slow and steady so as not to over-promise, or Seagate has a slight technological lead.

My last substantive stop of the afternoon was at the IBM booth, where they had one of their new TS4500 tape libraries operating in demo mode.  The window was too reflective to take a vide of the robotics, but I will say that there was a perceptible difference between the robotics in IBM's enterprise tape library and the robotics in another vendor's LTO tape library.  The IBM enterprise robotics are downright savage in how forcefully they slam tapes around, and I now fully believe IBM's claims that their enterprise cartridges are constructed to be more physically durable than standard LTO.  I'm sure there's some latency benefit to being able to ram tapes into drives and library slots at full speed, but it's unnerving to watch.

IBM also had this cheeky infographic on display that was worth a photo:


If I built a tape drive that was still operating after forty years in outer space, I'd want to brag about it too.  But there are a couple of factual issues with this marketing material that probably made every physical scientist who saw it roll their eyes.

Over at the compute side of the IBM booth, I learned that the Summit and Sierra systems sitting at the #1 and #2 positions on Top500 are built using node architectures that IBM is selling commercially.  There are 2 CPU + 6 GPU nodes (which is what Summit at OLCF has) which require liquid cooling, and 2 CPU + 4 GPU nodes (which is what Sierra at LLNL has) which can be air- or liquid-cooled.  I asked an IBM technologist which configuration is more commercially popular, and the Sierra configuration is currently leading sales due to the relative lack of infrastructure to support direct liquid cooling in commercial data centers.

This has interesting implications for the exascale technologies I looked at on Tuesday; given that the exascale-capable system designs presented by both the Fujitsu and Cray rely on direct liquid cooling, bridging the gap between achieving exascale-level performance and delivering a commercially viable product is pretty wide from a facilities perspective.  Fortunately, the Fujitsu A64FX chip usually runs below 200 W and can feasibly be air-cooled with lower-density packaging, and Cray's Shasta will support standard air-cooled 19" racks via lower-density nodes.

The IO-500/VI4IO BOF

The second must-attend BOF for people working in I/O is the IO-500 and Virtual Institute for I/O BOF.  It's a very pragmatic BOF where people discuss system architecture, benchmarking, and various related community efforts, and since 2017, also began to include the semiannual unveiling of the IO-500 list.

This year was exciting in that the top system, a DDN IME installation at JCAHPC, was unseated by the monstrous storage system attached to the Summit system at Oak Ridge and sustained an astounding 2 TiB/sec and 3 million opens/sec.  In fact, the previous #1 system dropped to #4, and each of the new top three systems was of a different architecture (Spectrum Scale at Oak Ridge, IME at KISTI, and Lustre at Cambridge).

Perhaps the most interesting of these new submissions was the #3 system, the Data Accelerator at Cambridge, which is a home-grown whitebox system that was designed to be functionally equivalent to DataWarp's scratch mode:

Alasdair King presenting the Data Accelerator design at the IO-500 BOF


The hardware are just Dell boxes with six NVMe drives and one OPA NIC per socket, and the magic is actually handled by a cleanroom reimplementation of the interface that Slurm uses to instantiate DataWarp partitions on Cray XC systems.  Rather than use a sophisticated orchestration system as DataWarp does though, the Data Accelerator translates Slurm #DW pragmas into Ansible plays that spin up and tear down ephemeral Lustre file systems.

The fact that the #3 fastest storage system in the world is a whitebox NVMe system is really remarkable, and my hat is off to the team at Cambridge that did this work.  As all-flash parallel file systems go from the realm of being a high-end boutique solution and become affordably mainstream, relatively scrappy but innovative engineering like the Cambridge system are surely going to cause a rapid proliferation of flash adoption in HPC centers.

DDN also presented their software-defined IO-500 submission, this time run in Google Cloud and landing in the #8 position:


Since DDN's embedded SFA product line already runs virtual machines on their controller hardware, it doesn't seem like a big stretch to run the same SFA VMs in the cloud.  While this sounds a little counterproductive to DDN's biggest differentiator in providing a fully integrated hardware platform, this idea of running SFA in Google Cloud arose from the growing need for parallel file systems in the cloud.  I can only assume that this need is being largely driven by AI workloads which require a combination of high I/O bandwidth, high IOPS, and POSIX file interfaces.

Thursday, November 15

The conference was showing signs of winding down by Thursday, as many attendees brought their luggage with them to the convention center so they could head back home that night.  The expo floor also closes in the mid-afternoon on Thursday.

Technical Program, Part 2 - Exhibitor Forum

My Thursday began at 10:30 AM with the HPC Storage and Memory Architectures session of the Exhibitor Forum.  Liran Zvibel, former CTO and now CEO of WekaIO was the first presenter and gave a surprisingly technical description of the WekaIO Matrix parallel file system architecture:

WekaIO's Matrix file system architecture block diagram.  Surprising amount of detail can be cleaned by examining this carefully.

In terms of building a modern parallel file system from the ground up for all-flash, WekaIO checks off almost all of the right boxes.  It runs almost entirely in user space to keep latency down, it runs in its own reserved pool of CPU cores on each client, and capitalizes on the approximate parity between NVMe latency and modern high-speed network latency.  They make use of a lot of the smart ideas implemented in the enterprise and hyperscale storage space too and are one of the few really future-looking storage companies out there who are really thinking about the new possibilities in the all-flash world while still courting the HPC market.

There is a fair amount of magic involved that was not broken down in the talk, although I've found that the WekaIO folks are happy to explain some of the more complex details if asked specific questions about how their file system works.  I'm not sure what is and isn't public though, so I'll save an architectural deep-dive of their technology for a later date.

Andreas Schlapka of Micron Technology was the next speaker, and his talk was quite a bit more high-level.  Aside from the grand statements about how AI will transform technology though, he did have a couple of nice slides that filled some knowledge gaps in my mind.  For example:

Broad strokes highlighting the different computational (and architectural) demands of training and inference workloads

Training is what the vast majority of casual AI+HPC pundits are really talking about when extolling the huge compute requirements of deep learning.  Part of that is because GPUs are almost the ideal hardware solution to tackle the mathematics of training (dense matrix-matrix multiplication) and post impressive numbers; the other part is that inference can't happen without a well-trained model, and models are continually being refined and re-trained.  What I hadn't fully appreciated is that inference is much more of an interesting computational problem in that it more closely resembles the non-uniform and latency-bound workloads of scientific computing.

This has interesting implications for memory technology; while HBM2 definitely delivers more bandwidth than DDR, it does this by increasing the channel width to 128 bits and hard-wiring 8 channels into each stack.  The extra bandwidth helps feed GPUs for training, but it's not doing much for the inference side of AI which, presumably, will become a much more significant fraction of the cycles required overall.  In my mind, increasing the size of SRAM-based caches, scratchpads, and register files are the more obvious way to reduce latency for inference, but we haven't really seen a lot of fundamentally new ideas on how to effectively do that yet.

The speaker went on to show the following apples-to-apples system-level reference:

System-level speeds and feeds of the memory products available now or in the near future as presented by Micron

It's not terribly insightful, but it lets you back out the bus width of each memory technology (bandwidth / data rate / device #) and figure out where its bandwidth is coming from:
  • DDR4 and DDR5 use 64-bit channels and relies on increasing channel-level parallelism to improve bandwidth.  This is now putting them in a place where you wind up having to buy way more capacity than you may want just to get sufficient bandwidth.  This is analogous to where HDDs are in the HPC storage hierarchy today; it's rapidly becoming uneconomical to rely on DDR for bandwidth.
  • GDDR uses narrower channels (32 bits) but more of them to get better bandwidth.  They also rely on phenomenally high data rates per pin; I don't really understand how this is possible since they rely on inefficient single-ended signaling.
  • HBM uses both wide (128 bits) and plentiful channels to get its performance; the table is a misleading in this regard since each "device" (HBM stack) contains eight channels.  This is fine for feeding highly parallel arithmetic units like vector ALUs, but this offers no benefit to latency-bound workloads that, for example, chase pointers to traverse a graph. (it turns out HBM is just fine for pointer chasing--thanks to one of the HPC's memory-wizards-at-large for pointing this out to me!)
Micron also made the strange assertion that they are the only company that offers the entire range of memory products.  I guess since Samsung and SK Hynix both opted to skip SC, Micron can say whatever it likes; however, Samsung is currently the only company shipping commercial quantities of HBM, and Hynix's HBM capability just came online.  As far as I know, Micron has never manufactured a stack of HBM since they spent years promoting the competing-but-now-defunct Hybrid Memory Cube technology.

The NSF Future Directions BOF

I opted to see what was new with National Science Foundation's Office of Advanced Cyberinfrastructure (OAC) at their noon BOF.  Despite having left the NSF world when I left San Diego, I still care deeply about NSF computing because they pay for many of the most accessible HPC resources in the US.  I certainly got my start in HPC on the NSF's dime at SDSC, and I got to see firsthand the huge breadth of impact that SDSC's XSEDE resources had in enabling smaller research groups at smaller institutions to perform world-class research.  As such, it's also no surprise that the NSF leads the pack in developing and deploying many of the peripheral technologies that can make HPC accessible such as federated identity, science gateways, and wide-area file systems.

That all said, actually listening to the NSF HPC strategic vision makes me rather grumpy since the directions of such an important federal office sometimes appear so scattershot.  And judging by the audience questions at the end of the BOF, I am not the only one--Very Important People(tm) in two different national-level HPC consortia asked very pointed questions of Manish Parashar, the NSF OAC director, that highlighted the dichotomy between OAC's strategic vision and where it was actually putting money.  I really believe in the critical importance of NSF investment in maintaining national cyberinfrastructure which is probably why I keep showing up to these BOFs and do my best to support my colleagues at SDSC and the other XSEDE SPs.

After sitting through this Future Directions BOF, I could write another updated rant about how I feel about the NSF's direction in HPC and get myself in trouble.  Instead, I'll instead share just a few slides I photographed from afar along with some objective statements and leave it at that.

The future directions summary slide:

NSF OAC's future directions
  • Performance, capability computing, and global leadership are not mentioned in the above slides.  Terms like "agility, responsiveness, accessibility") are often used to describe the cloud.
  • "reduce barriers to CI adoption" indicates that NSF wants to serve more users.  NSF is not increasing investment in capital acquisition (i.e., more or larger HPC systems beyond the status quo of technology refreshes).
  • "Prioritize investments to maximize impact" does not define what impacts are to be maximized.

The Frontera slide:

NSF's next leadership-class HPC, Frontera, to be deployed by TACC
  • The award amount was $60M.  The previous Track-1 solicitation that funded Blue Waters was $200M.  Stampede was $30M, and Stampede 2 was another $30M.
  • "leadership-class ... for all [science and engineering] applications" either suggests that all science and engineering applications are leadership-capable, or this leadership-class system is not primarily designed to support a leadership computing workload.
  • It is unclear what the significance of the "CPU" qualifier in "largest CPU system" is in the larger context of leadership computing.
  • There is mention of "leadership-class" computing.  There is no mention of exascale computing.  There is nothing that acknowledges leveraging the multi-billion-dollar investment the US has made into the Exascale Computing Project.  An audience member politely asked about this omission.

The Midscale Research Infrastructure slide:

Upcoming solicitations for research cyberinfrastructure
  • NSF OAC expects to issue one $6M-$20M solicitation and another $20M-$70M solicitation "soon" to fund HPC systems and the associated infrastructure.
  • $6M-$20M is on the same order of magnitude as the Track-2 solicitations that funded SDSC's Gordon ($10M) and Comet ($12M).
  • $20M-$70M is on the same order of magnitude as the Track-2 solicitations that funded TACC's Stampede 1 and 2 ($30M).  NSF's next leadership-class investment (Frontera) is $60M.

My SC Paper

The next major item on my agenda was presenting my paper, A Year in the Life of a Parallel File System, as the final talk in the final session of the paper track.

My name in lights--or something like that.

I was admittedly bummed out when I found out that I was going to be the conference closer since a significant number of SC attendees tend to fly out on Thursday night and, presumably, would not stick around for my presentation.  As a result, I didn't take preparation for it as seriously in the weeks leading up to SC as I normally would have.  I knew the presentation was a 30-35 minute talk that had to be fit into a 25-minute slot, but I figured I would figure out how to manage that on the night before the talk and mostly wing it.

What I realized after arriving at SC was that a bunch of people--most of whom weren't the expected audience of storage researchers--were looking forward to hearing the talk.  This left me scrambling to seriously step up the effort I was going to put into making sure the presentation was well composed despite needing to drop ten minutes of material and fit it into the 25 minutes I was given.  I documented my general approach to crafting presentations in my patented Glenn K. Lockwood Five Keys to Being a Successful Researcher (FKBSR) method, but I'll mention some of my considerations for the benefit of anyone who is interested in how others approach public speaking.
  1. I absolutely could not overshoot the timing because some attendees had to leave at 5 PM to catch 7 PM flights.  This meant that it would be better for me to undershoot the time and either draw out the conclusions and acknowledgments slides to finish on time or finish early and leave extra time for questions.
  2. The people I met at SC who indicated interest in my talk were storage systems people, not statisticians.  This meant I could probably tone down the statistical rigor in the presentation without offending people's scientific sensibilities.
  3. Similarly, because attendees were already familiar with typical HPC I/O systems and the relevant technologies, I could gloss over the experimental setup and description of the different compute and storage systems.
  4. Given the above considerations, a reasonable approach would be to punt as many non-essential details into the Q&A after the talk and let people try to poke holes in my methods only if they really cared.
I also know two things about myself and the way I present:
  1. I can present either at a casual pace where I average ~70 seconds per slide or in turbo mode where I average ~50 seconds per slide.  Orating at turbo speed requires a lot more preparation because it requires speaking through slide transitions rather than pausing to reorient after each slide transition.
  2. I get distracted easily, so I would rather have people begin to leave after my monologue ended and Q&A began than have the commotion of people getting up derail the tail end of my presentation.

As a result of all these factors, I opted to both cut a lot of details to get the talk down to ~25-30 minutes when presented at a casual pace, then prepare to present in turbo mode just in case the previous speakers went long (I was last of three speakers), there were A/V issues (they were prolific at this SC, especially for Mac users), or there were any audience interruptions.

I also opted to present from my iPad rather than a full laptop since it did a fine job earlier at both PDSW-DISCS and the IO-500/VI4IO BOF.  In sticking with this decision though, I learned two valuable things during the actual presentation:
  1. The iOS "do not disturb" mode does not suppress Twitter notifications.  A couple of people were kind enough to tweet about my presentation as I was giving it, but this meant that my presenter view was blowing up with Twitter noise as I was trying to present!  Fortunately I only needed to look down at my iPad when transitioning between slides so it didn't derail me.
  2. There's no usefully sized timer or clock in PowerPoint for iOS's presenter view, and as a result, I had no idea how I was doing on time as I entered the final third of my slides.  This became a distraction because I was fully expecting a five-minute warning from the session moderator at some point and got worried that I wasn't going to get one.  As such, I didn't want to slow down the tail of the presentation without knowing how close I was getting to the target.  It turned out that I didn't get a five-minute warning because I was already concluding at that point.
Fortunately the audience was sufficiently engaged to pad out the Q&A period with many of the questions that would've been answered by the slides I had dropped.  Afterwards I got feedback that indicated the presentation was noticeably short to the audience (not great) but that the narrative remained understandable to most attendees throughout the entire presentation (good).

As far as the technical content of the presentation though, I won't recap that here--until I write up the high-level presentation as another blog post, you may have to read the paper (or invite me to present it at your institution!).

SC Technical Program Reception

I've never attended the reception that wraps up the last full day of SC for a variety of reasons, and I was going to skip it again this year to fit some me-time into the otherwise frantic week.  However the venue (the Perot Museum) and its close proximity to my hotel lured me out.

The entryway to the Perot Museum

I am not a "never eat alone" kind of person because I find that my ability to be at the top of my game diminishes without at least some intermittent time to sit back and digest.  As such, I approached the reception with very selfish intent: I wanted to see the museum, learn about something that had nothing to do with supercomputing, have a drink and a meal, and then go back to my hotel.  So I did just that.

The dinosaurs seemed like a major feature of the museum:

Rapetosaurus skeleton on display at the Perot Museum

The archaeological diversity of the dinosaur room reminded me of the dinosaur museum near my wife's hometown in the Canadian prairies, but the exhibit seemed to be largely reproduction fossils that blended science with entertainment.

More impressive to me was the extensive mineral collection:

I'm a sucker for quartz.  I did my PhD research on silicates.

Not only were the minerals on display of remarkable quality, but many of them were found in Texas.  In fact, the museum overall had a remarkably Texas-focused set of exhibits which really impressed me.  The most interesting exhibit that caught my attention was a mini-documentary on the geologic history of Texas that explained how plate tectonics and hundreds of millions of years resulted in the world-famous oil and gas reserves throughout the state.

Having learned something and enjoyed some delightful food at the museum, I then called it quits and cashed out.

Friday, November 16

The last day of SC is always a bit odd because the expo has already wrapped up, most of the vendors and casual attendees have gone home, and the conference is much more quiet and focused.  My day started with a surreal shuttle ride to the conference center in what appeared to be a 90's-era party bus:

Conference shuttle, complete with taped-together audio system, faux leather sofa, and a door that had to be poked with a broom stick to open.


Only six concurrent half-day workshops and a panel were on the agenda:

The entire Friday agenda fit on a single screen

I stuck my head into the P3HPC workshop's first panel discussion to catch the age-old but ever-lively argument over someone's proposed definition of performance portability and productivity either being too broad or too narrow.  I/O performance portability generally does not have a place in these sorts of conversations (which I don't fault--algorithmic complexity in I/O is usually hidden from user applications) so I attended only as an interested observer and wasn't as fastidious about taking notes as I was earlier in the week.

At 10:30 AM I headed over to the Convergence between HPC and Big Data: The Day After Tomorrow panel discussion which had a star-studded speaker lineup.  NERSC's Katie Antypas gave a great overview of the NERSC-9/Perlmutter architecture which fit the panel topic uncannily well since it is a system design from the ground up to meet the needs of both traditional HPC and large-scale data analysis.

The NERSC-9 Project Director describing how the Perlmutter system embodies the convergence of HPC and Big Data in front of a remarkably big crowd in the final session of SC.

Unfortunately I had to duck out shortly after she spoke to get to my last meeting of the week with an old colleague for whom I always make time at SC.  Incidentally, some of the most valuable time you can spend at SC is talking to industryconsultants.  Not unlike getting to know members of the trade press, good consultants have exposure to a tremendous breadth of problem and solution spaces.  They can give you all manner of interesting insights into different vendors, industry verticals, and market trends in an otherwise brief conversation.

After my final meeting was cut short by my colleague's need to run to the airport, I had a quick bite with another Friday holdout then made my own way to the airport to catch up on a week's worth of e-mails.  The flight back to Oakland was one of the rare occasions where I was just too worn out to try to catch up on some delinquent report writing and just watched three hours of Dark Tourist on Netflix.

After the Conference

It was technically Saturday by the time I finally got home, but the family was happy to see me (and the swag I had in tow):

George fully appreciating the giant pile of conference swag with which I came home

This was definitely the busiest SC of my career, but in many ways it was also the most productive.  I owe sincere thanks to everyone in the HPC community who made it such a worthwhile conference to attend--vendors, presenters, old colleagues, and even the new colleagues who occasionally just wanted to introduce themselves and express that they enjoy reading the nonsense I post on Twitter.  I always leave SC more amazed and humbled by all the bright minds with whom I connect, and I hope that I am doing my part to pay that experience forward for others now and in the SC conferences to come.
Viewing all 81 articles
Browse latest View live