Super-computers are continually pushing the boundaries of technology, pursuing the exponential growth in performance enjoyed for at least two decades. This performance increase allows advances in physical modelling techniques that make unsteady, viscous Navier-Stokes solvers so powerful for research and development.
Tianhe-2, in China, is currently the fastest super-computer in the world according to the Top500 organisation which tracks and studies the development of high-performance computing around the world. Tianhe-2 is capable of 33.86 petaflops per second (34-thousand-trillion operations-per-second), though it is not expected to hold the top spot for much longer.
The United States Department of Energy has commissioned a new supercomputer for 2018, Aurora, which will be capable of 180 petaflops. Technology is expected to allow construction of the first exa-scale machine, capable of 1 exaflops (1018 flops), by 2020. Overall, these computers continue to grow at an exponential rate, but under the bonnet there are significant architectural changes which developers must adapt to if they wish to use this new hardware.
The technology that drives these huge machines ripples down through the whole industry. Whether our CFD codes run on exa-scale super-computers, industrial or academic clusters, or under the desk in a workstation; the hardware will have changed to make use of the most efficient technology. The codes which can adapt to and exploit this hardware will see exponential growth in speed and capability. Those that don’t will perish.
Until around 2004, the limitation driving hardware technology was peak clock frequencies. Individual cores could be made faster and more efficient (in terms of electrical power consumption) simultaneously by shrinking their transistors, and so manufacturers invested in shrinking technologies. As transistors continue to shrink, quantum effects allow electrons to ‘jump’ across the silicon in a phenomenon known as leakage.
Leakage current creates lower limits on the power requirements of cores, creating a ‘power wall’. This power wall caused core clock frequencies to stagnate, and manufacturers turned to multi-core technologies to continue increasing total flop rates. Today, workstation processors such as Intel’s Haswell-EX have up to 18 cores in a single processor. Parallelisation is increasing approximately 20-times faster than it used to; codes that were designed for modest parallelisation in the pre-2004 era will require serious modifications .
Seymour Cray, the founder of Cray Research once commented on the disadvantages of parallel computing: “If you were ploughing a field, which would you rather use: Two strong oxen or 1,024 chickens?” In a few years, CFD practitioners may have to employ 20,000 metaphorical mice.
To push core frequencies further, modern processors often have dynamic core clock rates; allowing cores to speed up temporarily within the power envelope of the processor. This is cleverly marketed as a ‘turbo’ feature rather than a throttling feature. In future processors, these power management systems will become more commonplace, with much of the processor ‘going dark’ to stay within power limits. This dark silicon effect will create a highly heterogeneous system, which will wreak havoc on parallel computations which previously had good load-balancing.
Many machines, both at workstation and super-computer level, utilise accelerators alongside their standard processors. These may be in the form of graphics processing units (GPUs) or specialist co-processors. These accelerators feature many cores running at low frequencies (the latest Xeon Phi has 72 cores, and the latest GPUs have almost 5,000) and as such they are a stepping stone to a many-core, heterogeneous era. Accelerators may not be a long-term solution, but they represent an architecture which will be here for the foreseeable future.
Further efforts to increase flop-rates have been via the re-introduction of vector processing at the sub-core level, similar to the array-processing of old. This allows multiple elements of an array to be operated on at once to provide yet more parallelisation. Modern compilers can perform some auto-vectorisation, though it is mostly up to the developer to exploit it fully.
As the cost of floating point operations has decreased, in terms of electrical power and time, many other parts of the computer have become a bottleneck. The cost of storing data, and moving data back and forth to memory, has become proportionally more expensive, only improving at half the rate of processor improvements. Similarly, the networks connecting the nodes in a super-computer have seen limited improvement. Fundamentally, transferring a byte of data down a copper wire has a fixed cost: it takes a fixed amount of electrical power and a fixed amount of time. As such, our efficient processors are starved by our inefficient memory systems. Developers must be careful of data locality, ensuring that data is stored physically close to the cores which operate on it in order to avoid unnecessary movement wherever possible.
Overall, the ecosystem of high-performance computing is changing dramatically. The many-core era is approaching, and CFD codes must adapt if they wish to survive and thrive.
How can CFD survive?
Many CFD codes were fostered under the pre-2004 computing architecture, using message-passing techniques (sending packages of data between nodes) to allow inter-nodal parallelisation. As more nodes were added to super-computers, and more memory became available, these codes could naïvely use these extra nodes and increase the size of their meshes respectively. Developers were free to focus on physical modelling, enjoying higher resolution and higher speeds with relatively little effort.
Most of these codes are based on distributed-memory computing. That is, multiple copies of a process run on a machine, each with their own address space. This is a perfect system for inter-nodal parallelisation, where each node has its own memory; but when multi-core processors arrived in 2004 this method started to become outdated. CFD codes often advertise their near-perfect parallel efficiency at the inter-nodal level, but completely disregard multi-core scaling within a node, and this is where the real challenge lies. There are several efficiency losses when packing multiple distributed-memory processes into a single node, and this will be amplified by changes in technology:
Memory Usage: Processes have a large memory footprint, and any data common to the entire CFD problem is duplicated within a node needlessly. The more cores and processes existing on high-performance computing nodes, the more memory is duplicated. In a typical simulation, each process may create up to 100MB of duplicated memory, which limits the amount of memory available for the actual simulation. With the number of cores growing twice as fast as memory capacity, this is a dangerous situation. Furthermore, sending messages between processes existing on the same node is an unnecessary waste of resources. It requires expensive copying of data from one process’ buffer to another, whereas the target process could, in theory, directly access the data itself. It is the equivalent of asking your co-worker to pass you a document, when in fact it is already in your hand.
Load Balancing: Most load-balancing is performed prior to a simulation to divide the computational mesh among the processes. Accurate load-balancing is important because the overall simulation can only progress at the rate of the slowest process. Changing this load-balance mid-simulation is expensive and complicated, but is usually implemented for codes featuring automatic mesh refinement. This dynamic load-balancing is nowhere near efficient enough to handle the fine-grained fluctuations in core clock frequencies caused by dark silicon effects. Simulations on a many-core machine will be limited to the speed of the most throttled core at any time. Pessimistic estimates expect 50% of a machine to be dark by 2018, potentially doubling simulation time again, a dangerous situation.
So what can CFD do to survive this? The most comprehensive method for dealing with this problem is to implement a hierarchical parallelisation scheme. For example, using one distributed-memory process per node, and exploiting multi-threading within that process, with the common programming standards being MPI (Message Passing Interface) and OpenMP (Open Multi-Processing) respectively. This allows all of the cores within a node to access the same memory and thus share the work of each process. Data does not need to be duplicated, which is favourable for memory capacity. No messages need to be sent between threads, which is favourable for data locality. Finally, the work of each process can be split dynamically, allowing faster threads (running on faster cores) to steal work from slower ones thereby dealing with heterogeneity.
This hierarchical parallelisation technique provides good support for many-core processors and co-processors, and can be mixed with GPU programming interfaces to provide good support for all types of accelerator. At a finer level, vectorisation can also be applied using OpenMP directives, though this comes with its own set of challenges to overcome. For structured meshes vectorisation is trivial, but for unstructured meshes careful optimisation must be performed, usually involving “padding” of data structures in order to make their memory-access patterns more uniform.
Fortunately, this hybrid scheme is easy to adopt but it’s only the tip of the iceberg. The CFD codes that stand out from the crowd will be the ones that capitalise on the changes to supercomputing architecture and make fundamental changes in the way they run.
How can CFD thrive?
Another problem facing CFD is the sheer amount of parallelisation; utilising this efficiently goes beyond hierarchical parallelisation or optimisation. The core CFD algorithms must be updated in order to avoid certain communication patterns and obtain the highest possible performance. Global communications often creep into CFD algorithms via certain additional transport equations or through specialist routines (e.g. for adaptive meshing). They are also a fundamental part of the linear equation-system solvers which are at the heart of CFD. Global communication is caused by, for example, finding the mean value of a flow variable or computing a residual to continue the metaphor, it’s like trying to find which of your 20,000 mice has the longest tail.
As the demand for higher accuracy in CFD increases, more pressure is placed on these creaking algorithms, via higher-order discretisation schemes and closer coupling of the equation systems. Additionally, as more flexibility is demanded via mesh refinement, complex modelling capabilities, overlapping meshes, etc. it becomes difficult to maintain a scalable code. The codes that thrive will be those that can offer these capabilities without sacrificing performance.
Work is currently being undertaken at the University of Southampton, in conjunction with the Maritime Institute Netherlands (MARIN) to improve the scalability of the core algorithms. Particularly this focuses on scaling the linear solvers to a many-core environment and investigating dynamic balancing for dark silicon . Other developments for scalable CFD tackle such areas as:
Parallel methods for grid generation: ensuring that reading or creating mesh data does not become a bottleneck to the solver.
Scalable post-processing: using Big Data analytics to create useful results from millions of cells in a computationally efficient way.
Scalable discretisation techniques: especially concerning sliding or overset mesh technology.
Parallel time-stepping: allowing the time domain to be parallelised, improving scalability by partitioning the simulation in four dimensions.
New thinking is required in the development of CFD to obtain a harmony between hardware, software and hydrodynamics. Whether the codes are commercial, open-source or in-house; utilising super-computers or desktop workstations; for one-off high-resolution simulations or batch optimisation they must adapt to stay ahead of the game.