Premise
This is the first in a series of posts that I intend to make detailing a real use of Hadoop to do some actual publishable science in a field that is still dominated by traditional, FLOP-intensive high-performance computing. In putting this information together, I hope to provide a look at how map/reduce might be useful to the majority of traditional computational scientists who aren't knee deep in "Big Data." I'll be examining a specific analysis that was commonplace throughout my own research as a prototype, illustrating how I turned it from a "traditional" HPC problem into a map/reduce problem, and how I solved it using the simplest, most accessible way possible.
This work is an offshoot of a few professional projects in which I am involved, but I should stress that none of this information represents the thoughts, opinions, or endorsements of my employer or any of the resources on which this work was done. With that being said, I gratefully acknowledge Hadoop time on SDSC Gordon and FutureGrid's Sierra and Hotel systems.
I should also say that this project, this post, and the posts that will follow are a work in progress. I hope to edit these posts as the project's direction becomes clearer.
I should also say that this project, this post, and the posts that will follow are a work in progress. I hope to edit these posts as the project's direction becomes clearer.
Scientific Background
I spent about two years of my graduate students doing research in modeling materials for nuclear waste forms as a part of a now-defunct AFCI/GNEP University Research Fellowship. The project involved conducting molecular dynamics simulations of relatively large systems (50,000 atoms) with a fully reactive model for water to understand how radiation damage would affect how quickly nuclear waste forms would degrade.I published twopapers based on this work, and one of the simplest measures of "damage" to the material (a silica glass) that I used was counting up the number of defect sites that formed over time as the material was subjected to radiation. In practice, this meant running an irradiation simulation that dumped out a complete snapshot of the system periodically, then performing detailed analysis on the series of snapshots that remained after the simulation had completed.
![]() |
A typical 50,000-atom system used to simulate radiation damage of nuclear waste forms in contact with water. Taken from chapter 5 of my dissertation |
The analysis of these snapshots was simple yet computationally intensive, because the bonding topology of the entire system had to be determined for every snapshot. This involved calculating the distance between every single atom and all of its neighbors, figuring out which neighbors were close enough to be considered "bonded," then counting up the number of bonded neighbors each of these atoms had. The number and type of neighbors then let me tell whether an oxygen atom was, say, holding the glass's atomic structure together (bonded to two silicons) or holding it open (bonded to one silicon and, optionally, some hydrogen).
![]() |
Measuring the rate of transient and permanent damage of an irradiated glass. Calculated by literally counting the number of defects in my simulated systems, as described above. Taken from chapter 4 of my dissertation. |
Why Fortran? And why Python?
Calculating the bonding topology of my snapshots was done in Fortran for two very compelling reasons:- The snapshot data was saved in a binary format easily read by Fortran but nothing else, and
- Fortran is very fast and very easily parallelized for these sorts of computationally dense tasks
However classifying atoms based on this bonding information was much less suited to Fortran for two reasons:
- Atoms can be bonded to a wide range of neighbors that all need to be described by the code. For example, an oxygen could be hydroxyl (OH-), water (H2O), hydronium (H3O+), non-bridging (-O-Si), bridging (Si-O-Si), protonated non-bridging (H-O-Si), doubly protonated (+H2-O-Si), protonated bridging (Si-OH+-Si), and any number of unstable states caused by irradiation. However, providing an open-ended description of bonding characteristics is not straightforward in Fortran.
- My data needed to be self-evident so colleagues could pick it up and make sense of it. Calling an oxygen "SiOH" (a string) is far less ambiguous than attributing some combination of on and off bits to it. Strings are not one of Fortran's strong suits.
Thus, this defect analysis was done in stages following a pattern that I wound up using throughout my research:
- Run the simulation using high-performance code (Fortran) and dump binary snapshots into "knite files" (2 GB output)
- Run the analysis using high-performance code (also Fortran) on binary snapshots and generate insightful, human-readable analysis called "coord files" (8 GB output)
- Parse the human-readable analysis (Python or Perl) and distill into a variety of concise and meaningful representations of that data
I was equipped to deal with stages #1 and #2 because they fall squarely within the realm of traditional computational science where if your calculation is taking too long, parallelize it with MPI or OpenMP. However, step #3 was a major bottleneck because parsing these large files got to be very slow despite the absurd simplicity of it all--I just wanted to count up how many oxygens of a certain chemistry (e.g., how many water molecules) I had in the simulation at various points in time. It became very easy to run Perl out of memory when trying to manipulate these files though, and the serial nature of parsing these files one line at a time was extremely slow.
Why Map/Reduce?
Stage #3 does not yield to the usual tricks of traditional HPC, but the trivially parallel task of counting up records across a very large file is an ideal use case for the map/reduce paradigm. However, map/reduce and Hadoop is surrounded by a lot of aggressive marketing from cloud service providers, and this level of buzz had left me feeling like Hadoop, like "HPC in the cloud", is a solution to a scientific problem that doesn't exist. I maintain the mindset of a domain scientist rather than a computer scientist, and I want to answer the following questions:
- Does map/reduce and Hadoop actually perform well? It's written in Java and runs over TCP, and there's nothing high-performance about either.
- Is it easy to take existing applications and actually use them with Hadoop or some other map/reduce framework? Domain scientists aren't doing publishable work if they are struggling with programming, so if it's not easy, it's not getting used.
- Is it easy to actually run map/reduce jobs on real data on real compute resources? If scientists have to possess arcane knowledge of deploying a Hadoop cluster, they won't waste their time.
In addition to these questions of broad relevance, I also hope to establish a simple set of guidelines to minimize the hassle of trying out map/reduce in the context of traditional HPC for anyone whose work might benefit.